User documentation: AI Agent Survey System MVP#

1. Intended use#

The AI Agent Survey System is an intelligent tool for conducting and evaluating open-ended surveys. It automatically recognises vague or unspecific answers and asks targeted follow-up questions in order to obtain high-quality, structured data for statistical analysis.

Basic principle: The system evaluates each answer using a clarity score (0 to 1) and independently decides whether a follow-up question is necessary. All responses are prepared for automatic clustering and can be tracked – from the original vague response to the final structured form.

2. Range of functions#

Core functions#

  • Intelligent response evaluation: Automatic evaluation of response clarity with configurable thresholds (default: 0.7)
  • Adaptive follow-up questions: Generation of context-specific follow-up questions in case of unclear answers
  • Answer structuring: Preparation of final answers with categorisation and clustering suitability
  • Live tracking: Visualisation of answer development from original to structured
  • Session management: Complete management of survey runs with history
  • Performance monitoring: Detailed recording of processing times and system health

Operating modes#

  • Productive mode: Connection to real LLM APIs (e.g. Mistral, OpenAI-compatible)
  • Mock mode: Simulation for tests without LLM connection

User interface#

The system offers six specialised tabs:

  • Experimentation playground: Interactive chat for real-time testing with live feedback
  • Prompt engineering: Editing and optimisation of LLM prompt templates
  • Batch testing: Systematic testing with CSV test data
  • Question Editor: Management of test questions and help texts
  • Session Demo: End-to-end demonstration for stakeholders
  • Performance Tab: Live monitoring of system performance

3. Operation#

Working with the Experiment Playground#

  1. Set parameters:
  • Clarity threshold: Threshold value for follow-up questions (0.0–1.0, default: 0.7)
  • Max follow-ups: Maximum number of follow-up questions (1–3, default: 1)
  • Temperature: LLM creativity (0.0–1.0, default: 0.1)
  1. Start chat session:
  • Enter question or use predefined test question
  • Click ‘Start chat’
  1. Enter answers:
  • Enter your own answers in the text field

  • Alternatively: Use quick test buttons (“Cloud”, ‘Office 365’, etc.)

  • Submit with ‘Send’ or the Enter key

  1. Observe results:
    • Chat history shows questions and follow-up questions
  • Live results show evaluation details
  • Final answer display shows structuring in real time

Perform batch testing#

  1. Prepare test data in CSV format:
   question,answer,expected_followup
   ‘Cloud technologies?’, “Cloud”, true
   ‘Cloud technologies?’, ‘Office 365’, false
  1. Insert test data into the input field

  2. Click ‘Start batch test’

  3. Analyse results:

    • Summary shows success rate and metrics
  • Detailed table lists each individual test
  • Symbols: ✅ (successful), ⏰ (timeout), ❌ (error)

Session demo#

  1. Start demo in the session demo tab

  2. Answer questions one after the other:

    • System automatically presents the next question
  • Try vague (‘cloud’) and specific answers
  • Observe live evaluations
  1. Evaluate results: - Final answer overview shows all structured answers
    • Demo statistics show overall performance
    • System recommendation evaluates production readiness

Important notes#

  • Maximum follow-ups: The system respects the configured max follow-ups and then aborts
  • Timeout values: LLM queries have a 15-second timeout for evaluations and a 10-second timeout for queries
  • Performance data: Automatically deleted after 30 minutes
  • CSV format: Header row required, at least 3 columns

4. Application example#

Scenario: IT infrastructure survey#

Initial situation: You want to understand which cloud services are used in your organisation. The survey includes the question: ‘Which cloud technologies do you mainly use?’

Implementation:

  1. System starts chat with the question
  2. Participant replies: ‘Cloud’

Agent evaluation:

Clarity Score: 0.30
Needs Followup: Yes
Problem Areas: vague_terminology, missing_specificity

Agent query: ‘Which specific cloud services do you mean? For example, Office 365, AWS S3, Google Workspace or Azure?’

  1. Participant replies: ‘Office 365 for emails and OneDrive for documents’

Agent assessment:

Clarity Score: 0.92
Needs Followup: No
✅ Answer accepted

Final structured answer:

Original: ‘Cloud’
After follow-up: ‘Office 365 for emails and OneDrive for documents’
Structured: ‘Microsoft Cloud Services (Office 365 for email, OneDrive for file storage)’
Category: Microsoft Services
Confidence: 0.92
Clustering quality: High

Result: The originally unusable answer ‘Cloud’ was transformed into a precise, clusterable answer through intelligent follow-up questions.

5. Recommendations for efficient use#

Best practices#

  • Set the clarity threshold optimally: Value between 0.6 and 0.8 for balance between queries and acceptance
  • Mock mode for development: Use mock mode for quick iterations and testing
  • Batch testing before productive use: Validate your prompts with representative test data
  • Enable performance monitoring: Regularly monitor LLM response times
  • Customise prompt templates: Adapt the prompts to your specific domain
  • Limit max follow-ups: Keep the number at 1–2 for a better participant experience
  • Use session demo: Show stakeholders the live functionality
  • Carefully formulate test questions: Create questions that provoke different types of answers

Tips for optimal results#

  • Question wording: Formulate questions that are open enough to elicit different types of responses
  • Help texts: Define clear examples in the question editor
  • Categories: Use the category suggestions for later clustering
  • History analysis: Check the response development in the live display
  • Regular updates: Update prompts based on batch test results

6. System limitations#

Functional limitations#

  • No in-depth semantic analysis: The system primarily evaluates specificity, not content accuracy
  • Limited queries: Maximum number can be configured, but is practically limited to 1–3
  • No multilingualism: System is optimised for German-language responses
  • No audio/video processing: Only text-based inputs are supported
  • No automatic response validation: System does not check for truthfulness or plausibility

Technical limitations#

  • LLM dependency: Quality of results depends on the LLM model used
  • Processing times: Average of 2–5 seconds per response evaluation
  • Token limits: Max. 500 tokens for LLM responses (configurable)
  • Timeout thresholds: 15 seconds for evaluations, 10 seconds for queries, 10 seconds for structuring
  • Concurrency: One request at a time per session
  • Storage: Performance data is automatically deleted after 30 minutes

Contextual limitations#

  • Mock mode: Simulates only basic behaviour, no real intelligence
  • Clustering suggestions: Are hints, not guaranteed classifications
  • Confidence values: Based on heuristic evaluations, not statistical validation