Skip to content

User documentation: AI Agent Survey System MVP

1. Intended use

The AI Agent Survey System is an intelligent tool for conducting and evaluating open-ended surveys. It automatically recognises vague or unspecific answers and asks targeted follow-up questions in order to obtain high-quality, structured data for statistical analysis.

Basic principle: The system evaluates each answer using a clarity score (0 to 1) and independently decides whether a follow-up question is necessary. All responses are prepared for automatic clustering and can be tracked – from the original vague response to the final structured form.

2. Range of functions

Core functions

  • Intelligent response evaluation: Automatic evaluation of response clarity with configurable thresholds (default: 0.7)
  • Adaptive follow-up questions: Generation of context-specific follow-up questions in case of unclear answers
  • Answer structuring: Preparation of final answers with categorisation and clustering suitability
  • Live tracking: Visualisation of answer development from original to structured
  • Session management: Complete management of survey runs with history
  • Performance monitoring: Detailed recording of processing times and system health

Operating modes

  • Productive mode: Connection to real LLM APIs (e.g. Mistral, OpenAI-compatible)
  • Mock mode: Simulation for tests without LLM connection

User interface

The system offers six specialised tabs:

  • Experimentation playground: Interactive chat for real-time testing with live feedback
  • Prompt engineering: Editing and optimisation of LLM prompt templates
  • Batch testing: Systematic testing with CSV test data
  • Question Editor: Management of test questions and help texts
  • Session Demo: End-to-end demonstration for stakeholders
  • Performance Tab: Live monitoring of system performance

3. Operation

Working with the Experiment Playground

  1. Set parameters:
  2. Clarity threshold: Threshold value for follow-up questions (0.0–1.0, default: 0.7)
  3. Max follow-ups: Maximum number of follow-up questions (1–3, default: 1)
  4. Temperature: LLM creativity (0.0–1.0, default: 0.1)

  5. Start chat session:

  6. Enter question or use predefined test question
  7. Click ‘Start chat’

  8. Enter answers:

  9. Enter your own answers in the text field
  10. Alternatively: Use quick test buttons (“Cloud”, ‘Office 365’, etc.)

  11. Submit with ‘Send’ or the Enter key

  12. Observe results:

  13. Chat history shows questions and follow-up questions
  14. Live results show evaluation details
  15. Final answer display shows structuring in real time

Perform batch testing

  1. Prepare test data in CSV format:
       question,answer,expected_followup
       ‘Cloud technologies?’, “Cloud”, true
       ‘Cloud technologies?’, ‘Office 365’, false
       ```
    
    2. **Insert test data** into the input field
    
    3. Click **‘Start batch test’**
    
    4. **Analyse results:**
       - Summary shows success rate and metrics
    - Detailed table lists each individual test
    - Symbols: ✅ (successful), ⏰ (timeout), ❌ (error)
    
    ### Session demo
    
    1. **Start demo** in the session demo tab
    
    2. **Answer questions one after the other:**
       - System automatically presents the next question
    - Try vague (‘cloud’) and specific answers
    - Observe live evaluations
    
    3. **Evaluate results:**   - Final answer overview shows all structured answers
       - Demo statistics show overall performance
       - System recommendation evaluates production readiness
    
    ### Important notes
    
    - **Maximum follow-ups:** The system respects the configured max follow-ups and then aborts
    - **Timeout values:** LLM queries have a 15-second timeout for evaluations and a 10-second timeout for queries
    - **Performance data:** Automatically deleted after 30 minutes
    - **CSV format:** Header row required, at least 3 columns
    
    ## 4. Application example
    
    ### Scenario: IT infrastructure survey
    
    **Initial situation:** You want to understand which cloud services are used in your organisation. The survey includes the question: ‘Which cloud technologies do you mainly use?’
    
    **Implementation:**
    
    1. System starts chat with the question
    2. Participant replies: ‘Cloud’
    
    **Agent evaluation:**
    
    Clarity Score: 0.30 Needs Followup: Yes Problem Areas: vague_terminology, missing_specificity
    **Agent query:** ‘Which specific cloud services do you mean? For example, Office 365, AWS S3, Google Workspace or Azure?’
    
    3. Participant replies: ‘Office 365 for emails and OneDrive for documents’
    
    **Agent assessment:**
    
    Clarity Score: 0.92 Needs Followup: No ✅ Answer accepted
    **Final structured answer:**
    
    Original: ‘Cloud’ After follow-up: ‘Office 365 for emails and OneDrive for documents’ Structured: ‘Microsoft Cloud Services (Office 365 for email, OneDrive for file storage)’ Category: Microsoft Services Confidence: 0.92 Clustering quality: High ```

Result: The originally unusable answer ‘Cloud’ was transformed into a precise, clusterable answer through intelligent follow-up questions.

5. Recommendations for efficient use

Best practices

  • Set the clarity threshold optimally: Value between 0.6 and 0.8 for balance between queries and acceptance
  • Mock mode for development: Use mock mode for quick iterations and testing
  • Batch testing before productive use: Validate your prompts with representative test data
  • Enable performance monitoring: Regularly monitor LLM response times
  • Customise prompt templates: Adapt the prompts to your specific domain
  • Limit max follow-ups: Keep the number at 1–2 for a better participant experience
  • Use session demo: Show stakeholders the live functionality
  • Carefully formulate test questions: Create questions that provoke different types of answers

Tips for optimal results

  • Question wording: Formulate questions that are open enough to elicit different types of responses
  • Help texts: Define clear examples in the question editor
  • Categories: Use the category suggestions for later clustering
  • History analysis: Check the response development in the live display
  • Regular updates: Update prompts based on batch test results

6. System limitations

Functional limitations

  • No in-depth semantic analysis: The system primarily evaluates specificity, not content accuracy
  • Limited queries: Maximum number can be configured, but is practically limited to 1–3
  • No multilingualism: System is optimised for German-language responses
  • No audio/video processing: Only text-based inputs are supported
  • No automatic response validation: System does not check for truthfulness or plausibility

Technical limitations

  • LLM dependency: Quality of results depends on the LLM model used
  • Processing times: Average of 2–5 seconds per response evaluation
  • Token limits: Max. 500 tokens for LLM responses (configurable)
  • Timeout thresholds: 15 seconds for evaluations, 10 seconds for queries, 10 seconds for structuring
  • Concurrency: One request at a time per session
  • Storage: Performance data is automatically deleted after 30 minutes

Contextual limitations

  • Mock mode: Simulates only basic behaviour, no real intelligence
  • Clustering suggestions: Are hints, not guaranteed classifications
  • Confidence values: Based on heuristic evaluations, not statistical validation