KI-Umfrage¶
KI-Umfrage is an experimental tool that explores how open-ended surveys can be improved by using a large language model. The application processes free-form answers through a multi-stage LLM pipeline: answers are first evaluated for clarity, refined through automatically generated follow-up questions where necessary, and finally structured for downstream evaluation. The central question is whether a language model can reliably detect vague or unspecific input and clarify it through targeted follow-ups to the point where the resulting answers become suitable for clustering.
At a glance¶
- Pose open-ended survey questions and have them answered through an LLM-driven dialogue — vague answers are detected automatically and clarified.
- Answers are scored against an adjustable clarity threshold; only answers below the threshold trigger follow-up questions.
- Complete conversations are condensed into a single, structured final answer prepared for statistical evaluation and clustering.
- Prompt templates can be edited, compared, and tested against sample answers in a dedicated UI — without code changes.
- Larger answer datasets can be loaded as CSV and run against the pipeline in batch mode, including a comparison with expected behavior.
- A guided demo session walks through the full flow from the first question to the final structured answer.
- Per-operation runtimes and processing phases are recorded and shown in a dedicated performance tab.
Highlights¶
The application differs from a direct prompt to an LLM or a simple script in that answers are not handled in a single step but are sharpened through a dialogue and then transformed into a structured, evaluable form.
- Multi-stage LLM pipeline. Answers pass through three separate LLM calls: a clarity evaluation, an optional follow-up generation step, and a final structuring of the entire conversation. Each stage uses its own task-specific prompt template.
- Automatic detection of vague answers. The model assigns a clarity score between 0 and 1 and labels the problem types (e.g. "vague terminology", "missing specificity", "too short"). The threshold for triggering a follow-up is configurable.
- Targeted follow-ups instead of fixed scripts. Follow-up questions are generated at runtime from the original question, the actual answer, and the identified problem types, making them more specific than predefined fallback questions.
- Structuring for clustering. The full conversation is condensed into a single, dense final answer enriched with a main category, a list of specific items, and a confidence value — the basis for evaluation across many sessions.
- Live prompt engineering. All prompt templates can be loaded into the UI, edited, compared against sample answers, and verified against the raw LLM output.
- Batch test mode. Pre-built answer datasets (CSV) can be run against the pipeline to systematically observe scoring behavior and follow-up rate.
- Configurable evaluation strictness. Clarity threshold, maximum follow-ups per question, and follow-up aggressiveness can be adjusted in the UI without restarting.
- OpenAI-compatible with a configurable endpoint. The application talks to any OpenAI-compatible chat-completion API; a mock mode allows development and testing without a reachable model.
- Reproducible results. Sessions, conversations, and final answers are persisted and contain not only the structured outcome but also the raw answer, conversation history, and evaluation metadata.
- Connection to 1 external source: an OpenAI-compatible LLM endpoint. Configuration and survey definitions are loaded locally from YAML files.