User documentation: AI Agent Survey System MVP#
1. Intended use#
The AI Agent Survey System is an intelligent tool for conducting and evaluating open-ended surveys. It automatically recognises vague or unspecific answers and asks targeted follow-up questions in order to obtain high-quality, structured data for statistical analysis.
Basic principle: The system evaluates each answer using a clarity score (0 to 1) and independently decides whether a follow-up question is necessary. All responses are prepared for automatic clustering and can be tracked – from the original vague response to the final structured form.
2. Range of functions#
Core functions#
- Intelligent response evaluation: Automatic evaluation of response clarity with configurable thresholds (default: 0.7)
- Adaptive follow-up questions: Generation of context-specific follow-up questions in case of unclear answers
- Answer structuring: Preparation of final answers with categorisation and clustering suitability
- Live tracking: Visualisation of answer development from original to structured
- Session management: Complete management of survey runs with history
- Performance monitoring: Detailed recording of processing times and system health
Operating modes#
- Productive mode: Connection to real LLM APIs (e.g. Mistral, OpenAI-compatible)
- Mock mode: Simulation for tests without LLM connection
User interface#
The system offers six specialised tabs:
- Experimentation playground: Interactive chat for real-time testing with live feedback
- Prompt engineering: Editing and optimisation of LLM prompt templates
- Batch testing: Systematic testing with CSV test data
- Question Editor: Management of test questions and help texts
- Session Demo: End-to-end demonstration for stakeholders
- Performance Tab: Live monitoring of system performance
3. Operation#
Working with the Experiment Playground#
- Set parameters:
- Clarity threshold: Threshold value for follow-up questions (0.0–1.0, default: 0.7)
- Max follow-ups: Maximum number of follow-up questions (1–3, default: 1)
- Temperature: LLM creativity (0.0–1.0, default: 0.1)
- Start chat session:
- Enter question or use predefined test question
- Click ‘Start chat’
- Enter answers:
Enter your own answers in the text field
Alternatively: Use quick test buttons (“Cloud”, ‘Office 365’, etc.)
Submit with ‘Send’ or the Enter key
- Observe results:
- Chat history shows questions and follow-up questions
- Live results show evaluation details
- Final answer display shows structuring in real time
Perform batch testing#
- Prepare test data in CSV format:
question,answer,expected_followup
‘Cloud technologies?’, “Cloud”, true
‘Cloud technologies?’, ‘Office 365’, falseInsert test data into the input field
Click ‘Start batch test’
Analyse results:
- Summary shows success rate and metrics
- Detailed table lists each individual test
- Symbols: ✅ (successful), ⏰ (timeout), ❌ (error)
Session demo#
Start demo in the session demo tab
Answer questions one after the other:
- System automatically presents the next question
- Try vague (‘cloud’) and specific answers
- Observe live evaluations
- Evaluate results: - Final answer overview shows all structured answers
- Demo statistics show overall performance
- System recommendation evaluates production readiness
Important notes#
- Maximum follow-ups: The system respects the configured max follow-ups and then aborts
- Timeout values: LLM queries have a 15-second timeout for evaluations and a 10-second timeout for queries
- Performance data: Automatically deleted after 30 minutes
- CSV format: Header row required, at least 3 columns
4. Application example#
Scenario: IT infrastructure survey#
Initial situation: You want to understand which cloud services are used in your organisation. The survey includes the question: ‘Which cloud technologies do you mainly use?’
Implementation:
- System starts chat with the question
- Participant replies: ‘Cloud’
Agent evaluation:
Clarity Score: 0.30
Needs Followup: Yes
Problem Areas: vague_terminology, missing_specificityAgent query: ‘Which specific cloud services do you mean? For example, Office 365, AWS S3, Google Workspace or Azure?’
- Participant replies: ‘Office 365 for emails and OneDrive for documents’
Agent assessment:
Clarity Score: 0.92
Needs Followup: No
✅ Answer acceptedFinal structured answer:
Original: ‘Cloud’
After follow-up: ‘Office 365 for emails and OneDrive for documents’
Structured: ‘Microsoft Cloud Services (Office 365 for email, OneDrive for file storage)’
Category: Microsoft Services
Confidence: 0.92
Clustering quality: HighResult: The originally unusable answer ‘Cloud’ was transformed into a precise, clusterable answer through intelligent follow-up questions.
5. Recommendations for efficient use#
Best practices#
- Set the clarity threshold optimally: Value between 0.6 and 0.8 for balance between queries and acceptance
- Mock mode for development: Use mock mode for quick iterations and testing
- Batch testing before productive use: Validate your prompts with representative test data
- Enable performance monitoring: Regularly monitor LLM response times
- Customise prompt templates: Adapt the prompts to your specific domain
- Limit max follow-ups: Keep the number at 1–2 for a better participant experience
- Use session demo: Show stakeholders the live functionality
- Carefully formulate test questions: Create questions that provoke different types of answers
Tips for optimal results#
- Question wording: Formulate questions that are open enough to elicit different types of responses
- Help texts: Define clear examples in the question editor
- Categories: Use the category suggestions for later clustering
- History analysis: Check the response development in the live display
- Regular updates: Update prompts based on batch test results
6. System limitations#
Functional limitations#
- No in-depth semantic analysis: The system primarily evaluates specificity, not content accuracy
- Limited queries: Maximum number can be configured, but is practically limited to 1–3
- No multilingualism: System is optimised for German-language responses
- No audio/video processing: Only text-based inputs are supported
- No automatic response validation: System does not check for truthfulness or plausibility
Technical limitations#
- LLM dependency: Quality of results depends on the LLM model used
- Processing times: Average of 2–5 seconds per response evaluation
- Token limits: Max. 500 tokens for LLM responses (configurable)
- Timeout thresholds: 15 seconds for evaluations, 10 seconds for queries, 10 seconds for structuring
- Concurrency: One request at a time per session
- Storage: Performance data is automatically deleted after 30 minutes
Contextual limitations#
- Mock mode: Simulates only basic behaviour, no real intelligence
- Clustering suggestions: Are hints, not guaranteed classifications
- Confidence values: Based on heuristic evaluations, not statistical validation