User documentation: AI Agent Survey System MVP¶
1. Intended use¶
The AI Agent Survey System is an intelligent tool for conducting and evaluating open-ended surveys. It automatically recognises vague or unspecific answers and asks targeted follow-up questions in order to obtain high-quality, structured data for statistical analysis.
Basic principle: The system evaluates each answer using a clarity score (0 to 1) and independently decides whether a follow-up question is necessary. All responses are prepared for automatic clustering and can be tracked – from the original vague response to the final structured form.
2. Range of functions¶
Core functions¶
- Intelligent response evaluation: Automatic evaluation of response clarity with configurable thresholds (default: 0.7)
- Adaptive follow-up questions: Generation of context-specific follow-up questions in case of unclear answers
- Answer structuring: Preparation of final answers with categorisation and clustering suitability
- Live tracking: Visualisation of answer development from original to structured
- Session management: Complete management of survey runs with history
- Performance monitoring: Detailed recording of processing times and system health
Operating modes¶
- Productive mode: Connection to real LLM APIs (e.g. Mistral, OpenAI-compatible)
- Mock mode: Simulation for tests without LLM connection
User interface¶
The system offers six specialised tabs:
- Experimentation playground: Interactive chat for real-time testing with live feedback
- Prompt engineering: Editing and optimisation of LLM prompt templates
- Batch testing: Systematic testing with CSV test data
- Question Editor: Management of test questions and help texts
- Session Demo: End-to-end demonstration for stakeholders
- Performance Tab: Live monitoring of system performance
3. Operation¶
Working with the Experiment Playground¶
- Set parameters:
- Clarity threshold: Threshold value for follow-up questions (0.0–1.0, default: 0.7)
- Max follow-ups: Maximum number of follow-up questions (1–3, default: 1)
-
Temperature: LLM creativity (0.0–1.0, default: 0.1)
-
Start chat session:
- Enter question or use predefined test question
-
Click ‘Start chat’
-
Enter answers:
- Enter your own answers in the text field
-
Alternatively: Use quick test buttons (“Cloud”, ‘Office 365’, etc.)
-
Submit with ‘Send’ or the Enter key
-
Observe results:
- Chat history shows questions and follow-up questions
- Live results show evaluation details
- Final answer display shows structuring in real time
Perform batch testing¶
- Prepare test data in CSV format:
Clarity Score: 0.30 Needs Followup: Yes Problem Areas: vague_terminology, missing_specificity Clarity Score: 0.92 Needs Followup: No ✅ Answer accepted Original: ‘Cloud’ After follow-up: ‘Office 365 for emails and OneDrive for documents’ Structured: ‘Microsoft Cloud Services (Office 365 for email, OneDrive for file storage)’ Category: Microsoft Services Confidence: 0.92 Clustering quality: High ```
question,answer,expected_followup ‘Cloud technologies?’, “Cloud”, true ‘Cloud technologies?’, ‘Office 365’, false ``` 2. **Insert test data** into the input field 3. Click **‘Start batch test’** 4. **Analyse results:** - Summary shows success rate and metrics - Detailed table lists each individual test - Symbols: ✅ (successful), ⏰ (timeout), ❌ (error) ### Session demo 1. **Start demo** in the session demo tab 2. **Answer questions one after the other:** - System automatically presents the next question - Try vague (‘cloud’) and specific answers - Observe live evaluations 3. **Evaluate results:** - Final answer overview shows all structured answers - Demo statistics show overall performance - System recommendation evaluates production readiness ### Important notes - **Maximum follow-ups:** The system respects the configured max follow-ups and then aborts - **Timeout values:** LLM queries have a 15-second timeout for evaluations and a 10-second timeout for queries - **Performance data:** Automatically deleted after 30 minutes - **CSV format:** Header row required, at least 3 columns ## 4. Application example ### Scenario: IT infrastructure survey **Initial situation:** You want to understand which cloud services are used in your organisation. The survey includes the question: ‘Which cloud technologies do you mainly use?’ **Implementation:** 1. System starts chat with the question 2. Participant replies: ‘Cloud’ **Agent evaluation:**
Result: The originally unusable answer ‘Cloud’ was transformed into a precise, clusterable answer through intelligent follow-up questions.
5. Recommendations for efficient use¶
Best practices¶
- Set the clarity threshold optimally: Value between 0.6 and 0.8 for balance between queries and acceptance
- Mock mode for development: Use mock mode for quick iterations and testing
- Batch testing before productive use: Validate your prompts with representative test data
- Enable performance monitoring: Regularly monitor LLM response times
- Customise prompt templates: Adapt the prompts to your specific domain
- Limit max follow-ups: Keep the number at 1–2 for a better participant experience
- Use session demo: Show stakeholders the live functionality
- Carefully formulate test questions: Create questions that provoke different types of answers
Tips for optimal results¶
- Question wording: Formulate questions that are open enough to elicit different types of responses
- Help texts: Define clear examples in the question editor
- Categories: Use the category suggestions for later clustering
- History analysis: Check the response development in the live display
- Regular updates: Update prompts based on batch test results
6. System limitations¶
Functional limitations¶
- No in-depth semantic analysis: The system primarily evaluates specificity, not content accuracy
- Limited queries: Maximum number can be configured, but is practically limited to 1–3
- No multilingualism: System is optimised for German-language responses
- No audio/video processing: Only text-based inputs are supported
- No automatic response validation: System does not check for truthfulness or plausibility
Technical limitations¶
- LLM dependency: Quality of results depends on the LLM model used
- Processing times: Average of 2–5 seconds per response evaluation
- Token limits: Max. 500 tokens for LLM responses (configurable)
- Timeout thresholds: 15 seconds for evaluations, 10 seconds for queries, 10 seconds for structuring
- Concurrency: One request at a time per session
- Storage: Performance data is automatically deleted after 30 minutes
Contextual limitations¶
- Mock mode: Simulates only basic behaviour, no real intelligence
- Clustering suggestions: Are hints, not guaranteed classifications
- Confidence values: Based on heuristic evaluations, not statistical validation