Features¶

KI-Umfrage provides a complete workspace for the experimental design, execution, and evaluation of LLM-supported survey dialogues. The feature set covers the interactive answering of individual questions, the refinement of vague input, the structuring of results, the editing of underlying prompts, and batch evaluation.

Use cases¶

Concept testing on sample surveys. With the bundled demo questions (cloud usage, IT security posture), it is possible to follow how the system recognizes a vague answer such as "cloud" or "good", formulates a specific follow-up question, and finally structures the combined answer.
Comparing prompt variants. Two formulations of the same evaluation or follow-up prompt can be run against identical sample answers, making the impact of individual changes on clarity score and follow-up behavior immediately visible.
Systematic validation against an answer dataset. A prepared CSV file with question–answer pairs and expected follow-up behavior can be processed in batch mode to check how often the pipeline matches the expected outcome.
Fine-tuning evaluation strictness. By varying clarity threshold, maximum follow-up count, and LLM temperature, it becomes observable how these parameters change the system's reaction to typical answer patterns.
Stakeholder demos. A dedicated demo mode runs through a two-question mini-survey chronologically and visualizes — after each question — the clarity score, any follow-ups that occurred, and the final structured answer.
Performance observation. Processing times for each pipeline stage are captured and can be inspected during testing to identify bottlenecks (such as slow LLM responses).

At a glance¶

Six thematically separated tabs: playground, prompt engineering, batch testing, question editor, session demo, performance.
Per-round decision driven by a configurable clarity threshold.
A configurable maximum number of follow-ups per main question; each follow-up carries its own reasoning and confidence value.
Live-editable prompt templates for evaluation, follow-up, structuring, and the handling of irrelevant answers.
Import via YAML (configuration, survey definition) and CSV (batch tests); session export as JSON.
Mock mode for development and testing without a reachable LLM.
Sessions including the full conversation history are persisted.

Connectors and data sources¶

External connector:

OpenAI-compatible chat-completion API. Through a configurable endpoint (api_base, api_key, model), any model implementing the OpenAI chat API can be addressed. The application requests structured JSON responses and falls back to rule-based replacement answers on timeout or error to keep the flow stable.

Internal components:

YAML-based survey and system configuration. Evaluation thresholds, maximum follow-up count, LLM connection details, and the survey itself (questions, expected answer patterns, help texts) are loaded from YAML files; environment variables can override individual values.
Local result storage. Completed sessions are stored in a configurable results directory along with the conversation history, evaluation metadata, and final structured answers.

Import and export formats¶

YAML (import). Split into a central system configuration and a survey definition. The latter contains the questions, their context information, and example answers in the categories "vague" and "clear", from which the evaluation strictness used in prompting is derived.
CSV (import, batch test). Columns for question, answer, and expected follow-up behavior. The batch run compares the actual pipeline reaction with the expected outcome.
JSON (export). Complete session results with question, answer, and evaluation data, conversation history, cluster suggestion, confidence value, and processing metadata.

Quality assurance features¶

Multi-stage processing. Answers are not handled in a single LLM call but are evaluated, optionally clarified, and finally structured in separate steps. Intermediate results are individually inspectable and correctable.
Clarity score with threshold logic. Each answer is scored between 0 and 1; only values below the configured threshold (default 0.7) trigger a follow-up.
Validation of evaluation output. The structured evaluation fields (clarity score, follow-up necessity, problem types, suggested clarifications) are checked against allowed value ranges; invalid entries are reset to defaults.
Confidence value for the final answer. The structured final answer carries its own confidence value; if it falls below a threshold, the answer is automatically flagged for manual review.
Limit on follow-up depth. The maximum number of follow-ups per main question is configurable to prevent infinite loops on repeatedly vague answers.
Robust fallback paths. On timeout, connection failure, or invalid JSON, every pipeline stage falls back to a rule-based replacement (heuristic clarity scoring, predefined follow-up patterns, local structuring), so that a traceable result is produced even in failure cases.
Traceability. Every session is recorded together with its individual interactions, the model's reasoning text, and processing times, and remains available for later evaluation.

Additional features¶

Live prompt engineering. All prompt templates can be loaded, edited, and tested against sample answers from within the UI; the raw LLM response can be inspected for diagnostic purposes.
Question editor. Question texts and their associated help hints can be adjusted without editing YAML files directly.
Performance monitoring. Runtimes for each operation (evaluation, follow-up generation, structuring, end-to-end run) are collected and presented in an aggregated overview; target values are referenced.
Configurable controls. Clarity threshold, maximum follow-ups, follow-up aggressiveness, and LLM temperature can be changed per session.
Mock mode. Toggled via an environment variable; produces canned responses in the pipeline's structure so the UI remains operable without an LLM.