KI-Umfrage¶

KI-Umfrage is an experimental tool that explores how open-ended surveys can be improved by using a large language model. The application processes free-form answers through a multi-stage LLM pipeline: answers are first evaluated for clarity, refined through automatically generated follow-up questions where necessary, and finally structured for downstream evaluation. The central question is whether a language model can reliably detect vague or unspecific input and clarify it through targeted follow-ups to the point where the resulting answers become suitable for clustering.

At a glance¶

Pose open-ended survey questions and have them answered through an LLM-driven dialogue — vague answers are detected automatically and clarified.
Answers are scored against an adjustable clarity threshold; only answers below the threshold trigger follow-up questions.
Complete conversations are condensed into a single, structured final answer prepared for statistical evaluation and clustering.
Prompt templates can be edited, compared, and tested against sample answers in a dedicated UI — without code changes.
Larger answer datasets can be loaded as CSV and run against the pipeline in batch mode, including a comparison with expected behavior.
A guided demo session walks through the full flow from the first question to the final structured answer.
Per-operation runtimes and processing phases are recorded and shown in a dedicated performance tab.

Highlights¶

The application differs from a direct prompt to an LLM or a simple script in that answers are not handled in a single step but are sharpened through a dialogue and then transformed into a structured, evaluable form.

Multi-stage LLM pipeline. Answers pass through three separate LLM calls: a clarity evaluation, an optional follow-up generation step, and a final structuring of the entire conversation. Each stage uses its own task-specific prompt template.
Automatic detection of vague answers. The model assigns a clarity score between 0 and 1 and labels the problem types (e.g. "vague terminology", "missing specificity", "too short"). The threshold for triggering a follow-up is configurable.
Targeted follow-ups instead of fixed scripts. Follow-up questions are generated at runtime from the original question, the actual answer, and the identified problem types, making them more specific than predefined fallback questions.
Structuring for clustering. The full conversation is condensed into a single, dense final answer enriched with a main category, a list of specific items, and a confidence value — the basis for evaluation across many sessions.
Live prompt engineering. All prompt templates can be loaded into the UI, edited, compared against sample answers, and verified against the raw LLM output.
Batch test mode. Pre-built answer datasets (CSV) can be run against the pipeline to systematically observe scoring behavior and follow-up rate.
Configurable evaluation strictness. Clarity threshold, maximum follow-ups per question, and follow-up aggressiveness can be adjusted in the UI without restarting.
OpenAI-compatible with a configurable endpoint. The application talks to any OpenAI-compatible chat-completion API; a mock mode allows development and testing without a reachable model.
Reproducible results. Sessions, conversations, and final answers are persisted and contain not only the structured outcome but also the raw answer, conversation history, and evaluation metadata.
Connection to 1 external source: an OpenAI-compatible LLM endpoint. Configuration and survey definitions are loaded locally from YAML files.