User documentation: AI Agent Survey System MVP#

1. Intended use#

The AI Agent Survey System is an intelligent tool for conducting and evaluating open-ended surveys. It automatically recognises vague or unspecific answers and asks targeted follow-up questions in order to obtain high-quality, structured data for statistical analysis.

Basic principle: The system evaluates each answer using a clarity score (0 to 1) and independently decides whether a follow-up question is necessary. All responses are prepared for automatic clustering and can be tracked – from the original vague response to the final structured form.

2. Range of functions#

Core functions#

Intelligent response evaluation: Automatic evaluation of response clarity with configurable thresholds (default: 0.7)
Adaptive follow-up questions: Generation of context-specific follow-up questions in case of unclear answers
Answer structuring: Preparation of final answers with categorisation and clustering suitability
Live tracking: Visualisation of answer development from original to structured
Session management: Complete management of survey runs with history
Performance monitoring: Detailed recording of processing times and system health

Operating modes#

Productive mode: Connection to real LLM APIs (e.g. Mistral, OpenAI-compatible)
Mock mode: Simulation for tests without LLM connection

User interface#

The system offers six specialised tabs:

Experimentation playground: Interactive chat for real-time testing with live feedback
Prompt engineering: Editing and optimisation of LLM prompt templates
Batch testing: Systematic testing with CSV test data
Question Editor: Management of test questions and help texts
Session Demo: End-to-end demonstration for stakeholders
Performance Tab: Live monitoring of system performance

3. Operation#

Working with the Experiment Playground#

Set parameters:

Clarity threshold: Threshold value for follow-up questions (0.0–1.0, default: 0.7)
Max follow-ups: Maximum number of follow-up questions (1–3, default: 1)
Temperature: LLM creativity (0.0–1.0, default: 0.1)

Start chat session:

Enter question or use predefined test question
Click ‘Start chat’

Enter answers:

Enter your own answers in the text field
Alternatively: Use quick test buttons (“Cloud”, ‘Office 365’, etc.)
Submit with ‘Send’ or the Enter key

Observe results:
- Chat history shows questions and follow-up questions

Live results show evaluation details
Final answer display shows structuring in real time

Perform batch testing#

Prepare test data in CSV format:

   question,answer,expected_followup
   ‘Cloud technologies?’, “Cloud”, true
   ‘Cloud technologies?’, ‘Office 365’, false

Insert test data into the input field
Click ‘Start batch test’
Analyse results:
- Summary shows success rate and metrics

Detailed table lists each individual test
Symbols: ✅ (successful), ⏰ (timeout), ❌ (error)

Session demo#

Start demo in the session demo tab
Answer questions one after the other:
- System automatically presents the next question

Try vague (‘cloud’) and specific answers
Observe live evaluations

Evaluate results: - Final answer overview shows all structured answers
- Demo statistics show overall performance
- System recommendation evaluates production readiness

Important notes#

Maximum follow-ups: The system respects the configured max follow-ups and then aborts
Timeout values: LLM queries have a 15-second timeout for evaluations and a 10-second timeout for queries
Performance data: Automatically deleted after 30 minutes
CSV format: Header row required, at least 3 columns

4. Application example#

Scenario: IT infrastructure survey#

Initial situation: You want to understand which cloud services are used in your organisation. The survey includes the question: ‘Which cloud technologies do you mainly use?’

Implementation:

System starts chat with the question
Participant replies: ‘Cloud’

Agent evaluation:

Clarity Score: 0.30
Needs Followup: Yes
Problem Areas: vague_terminology, missing_specificity

Agent query: ‘Which specific cloud services do you mean? For example, Office 365, AWS S3, Google Workspace or Azure?’

Participant replies: ‘Office 365 for emails and OneDrive for documents’

Agent assessment:

Clarity Score: 0.92
Needs Followup: No
✅ Answer accepted

Final structured answer:

Original: ‘Cloud’
After follow-up: ‘Office 365 for emails and OneDrive for documents’
Structured: ‘Microsoft Cloud Services (Office 365 for email, OneDrive for file storage)’
Category: Microsoft Services
Confidence: 0.92
Clustering quality: High

Result: The originally unusable answer ‘Cloud’ was transformed into a precise, clusterable answer through intelligent follow-up questions.

5. Recommendations for efficient use#

Best practices#

Set the clarity threshold optimally: Value between 0.6 and 0.8 for balance between queries and acceptance
Mock mode for development: Use mock mode for quick iterations and testing
Batch testing before productive use: Validate your prompts with representative test data
Enable performance monitoring: Regularly monitor LLM response times
Customise prompt templates: Adapt the prompts to your specific domain
Limit max follow-ups: Keep the number at 1–2 for a better participant experience
Use session demo: Show stakeholders the live functionality
Carefully formulate test questions: Create questions that provoke different types of answers

Tips for optimal results#

Question wording: Formulate questions that are open enough to elicit different types of responses
Help texts: Define clear examples in the question editor
Categories: Use the category suggestions for later clustering
History analysis: Check the response development in the live display
Regular updates: Update prompts based on batch test results

6. System limitations#

Functional limitations#

No in-depth semantic analysis: The system primarily evaluates specificity, not content accuracy
Limited queries: Maximum number can be configured, but is practically limited to 1–3
No multilingualism: System is optimised for German-language responses
No audio/video processing: Only text-based inputs are supported
No automatic response validation: System does not check for truthfulness or plausibility

Technical limitations#

LLM dependency: Quality of results depends on the LLM model used
Processing times: Average of 2–5 seconds per response evaluation
Token limits: Max. 500 tokens for LLM responses (configurable)
Timeout thresholds: 15 seconds for evaluations, 10 seconds for queries, 10 seconds for structuring
Concurrency: One request at a time per session
Storage: Performance data is automatically deleted after 30 minutes

Contextual limitations#

Mock mode: Simulates only basic behaviour, no real intelligence
Clustering suggestions: Are hints, not guaranteed classifications
Confidence values: Based on heuristic evaluations, not statistical validation