User documentation: AI Agent Survey System MVP¶

1. Intended use¶

The AI Agent Survey System is an intelligent tool for conducting and evaluating open-ended surveys. It automatically recognises vague or unspecific answers and asks targeted follow-up questions in order to obtain high-quality, structured data for statistical analysis.

Basic principle: The system evaluates each answer using a clarity score (0 to 1) and independently decides whether a follow-up question is necessary. All responses are prepared for automatic clustering and can be tracked – from the original vague response to the final structured form.

2. Range of functions¶

Core functions¶

Intelligent response evaluation: Automatic evaluation of response clarity with configurable thresholds (default: 0.7)
Adaptive follow-up questions: Generation of context-specific follow-up questions in case of unclear answers
Answer structuring: Preparation of final answers with categorisation and clustering suitability
Live tracking: Visualisation of answer development from original to structured
Session management: Complete management of survey runs with history
Performance monitoring: Detailed recording of processing times and system health

Operating modes¶

Productive mode: Connection to real LLM APIs (e.g. Mistral, OpenAI-compatible)
Mock mode: Simulation for tests without LLM connection

User interface¶

The system offers six specialised tabs:

Experimentation playground: Interactive chat for real-time testing with live feedback
Prompt engineering: Editing and optimisation of LLM prompt templates
Batch testing: Systematic testing with CSV test data
Question Editor: Management of test questions and help texts
Session Demo: End-to-end demonstration for stakeholders
Performance Tab: Live monitoring of system performance

3. Operation¶

Working with the Experiment Playground¶

Set parameters:
Clarity threshold: Threshold value for follow-up questions (0.0–1.0, default: 0.7)
Max follow-ups: Maximum number of follow-up questions (1–3, default: 1)
Temperature: LLM creativity (0.0–1.0, default: 0.1)
Start chat session:
Enter question or use predefined test question
Click ‘Start chat’
Enter answers:
Enter your own answers in the text field
Alternatively: Use quick test buttons (“Cloud”, ‘Office 365’, etc.)
Submit with ‘Send’ or the Enter key
Observe results:
Chat history shows questions and follow-up questions
Live results show evaluation details
Final answer display shows structuring in real time

Perform batch testing¶

Prepare test data in CSV format:

   question,answer,expected_followup
   ‘Cloud technologies?’, “Cloud”, true
   ‘Cloud technologies?’, ‘Office 365’, false
   ```

2. **Insert test data** into the input field

3. Click **‘Start batch test’**

4. **Analyse results:**
   - Summary shows success rate and metrics
- Detailed table lists each individual test
- Symbols: ✅ (successful), ⏰ (timeout), ❌ (error)

### Session demo

1. **Start demo** in the session demo tab

2. **Answer questions one after the other:**
   - System automatically presents the next question
- Try vague (‘cloud’) and specific answers
- Observe live evaluations

3. **Evaluate results:**   - Final answer overview shows all structured answers
   - Demo statistics show overall performance
   - System recommendation evaluates production readiness

### Important notes

- **Maximum follow-ups:** The system respects the configured max follow-ups and then aborts
- **Timeout values:** LLM queries have a 15-second timeout for evaluations and a 10-second timeout for queries
- **Performance data:** Automatically deleted after 30 minutes
- **CSV format:** Header row required, at least 3 columns

## 4. Application example

### Scenario: IT infrastructure survey

**Initial situation:** You want to understand which cloud services are used in your organisation. The survey includes the question: ‘Which cloud technologies do you mainly use?’

**Implementation:**

1. System starts chat with the question
2. Participant replies: ‘Cloud’

**Agent evaluation:**

Clarity Score: 0.30 Needs Followup: Yes Problem Areas: vague_terminology, missing_specificity

**Agent query:** ‘Which specific cloud services do you mean? For example, Office 365, AWS S3, Google Workspace or Azure?’

3. Participant replies: ‘Office 365 for emails and OneDrive for documents’

**Agent assessment:**

Clarity Score: 0.92 Needs Followup: No ✅ Answer accepted

**Final structured answer:**

Original: ‘Cloud’ After follow-up: ‘Office 365 for emails and OneDrive for documents’ Structured: ‘Microsoft Cloud Services (Office 365 for email, OneDrive for file storage)’ Category: Microsoft Services Confidence: 0.92 Clustering quality: High ```

Result: The originally unusable answer ‘Cloud’ was transformed into a precise, clusterable answer through intelligent follow-up questions.

5. Recommendations for efficient use¶

Best practices¶

Set the clarity threshold optimally: Value between 0.6 and 0.8 for balance between queries and acceptance
Mock mode for development: Use mock mode for quick iterations and testing
Batch testing before productive use: Validate your prompts with representative test data
Enable performance monitoring: Regularly monitor LLM response times
Customise prompt templates: Adapt the prompts to your specific domain
Limit max follow-ups: Keep the number at 1–2 for a better participant experience
Use session demo: Show stakeholders the live functionality
Carefully formulate test questions: Create questions that provoke different types of answers

Tips for optimal results¶

Question wording: Formulate questions that are open enough to elicit different types of responses
Help texts: Define clear examples in the question editor
Categories: Use the category suggestions for later clustering
History analysis: Check the response development in the live display
Regular updates: Update prompts based on batch test results

6. System limitations¶

Functional limitations¶

No in-depth semantic analysis: The system primarily evaluates specificity, not content accuracy
Limited queries: Maximum number can be configured, but is practically limited to 1–3
No multilingualism: System is optimised for German-language responses
No audio/video processing: Only text-based inputs are supported
No automatic response validation: System does not check for truthfulness or plausibility

Technical limitations¶

LLM dependency: Quality of results depends on the LLM model used
Processing times: Average of 2–5 seconds per response evaluation
Token limits: Max. 500 tokens for LLM responses (configurable)
Timeout thresholds: 15 seconds for evaluations, 10 seconds for queries, 10 seconds for structuring
Concurrency: One request at a time per session
Storage: Performance data is automatically deleted after 30 minutes

Contextual limitations¶

Mock mode: Simulates only basic behaviour, no real intelligence
Clustering suggestions: Are hints, not guaranteed classifications
Confidence values: Based on heuristic evaluations, not statistical validation