LLM-supported coding for multi-stage workflows: Detailed documentation of a survey tool experiment#

Summary#

This article documents a learning experiment to develop an AI-supported survey tool using large language models. The project explored the methodological limitations and possibilities of LLM coding in complex, multi-stage workflows. The key finding: when developing such systems, understanding the architecture is the limiting factor, not the code generation itself. The article offers detailed insights into technical implementation, the development process and transferable methodological learnings.

1. Experimental context#

1.1 Motivation and learning objectives#

The project arose in the context of planned surveys and the question: To what extent can LLM tools be used to create survey-like systems that ensure that responses are actually usable? The primary learning objective was the methodological exploration of LLM coding in more complex architectures, specifically:

How do you orchestrate multi-stage workflows with LLMs?
Which architectural patterns are suitable for agent-based systems?
What are the practical limits of automatic criteria derivation?
How can specifications for such tasks be optimally structured?

The tool itself was explicitly designed as a learning vehicle, not as a productive system. The insights gained were to be incorporated into an improved follow-up project.

1.2 Positioning in the learning curve#

The experiment took place in the middle of a longer learning phase with LLM-supported coding. Basic knowledge of prompt engineering and simple tool development already existed, but there was no experience with complex agent workflow interactions. This positioning was deliberately chosen: the project was intended to explore the limits of previous understanding.

2. Technical implementation#

2.1 Architecture overview#

The system implements a single-agent approach with clearly separated responsibilities. Contrary to the initial assumption of several separate agents, it became clear during development that a SurveyAgent with specialised methods offers the appropriate level of complexity:

Core components:

SurveyAgent (src/agent.py): Main component with three core functions
evaluate_answer(): Evaluates answer clarity and specificity (clarity score 0-1)
generate_followup(): Generates context-specific follow-up questions for unclear answers
structure_final_answer(): Prepares final answers for clustering
ConversationManager (src/agent.py): Orchestrates the survey workflow
Manages question sequences
Controls follow-up logic
Tracked session state
LLMClient (src/llm_client.py): OpenAI-compatible API integration
Supports local and remote LLMs
Robust error handling with retry logic
Mock client for testing without LLM
PromptManager (src/prompts.py): Central prompt management
Templates for all agent functions
Structured JSON output definitions
Versioning and fallback mechanisms
Interface (interface/): Modular Gradio UI
Experimental playground for live testing
Batch testing functionality
- Performance monitoring
Demo mode for stakeholders

2.2 Technology stack#

The choice of technology stack was based on existing experience in order to enable focus on the actual learning objectives:

Python 3.x: Main language
Gradio 4.x: UI framework (simple, fast, suitable for prototypes)
Pydantic 2.x: Data validation and type safety
asyncio: Asynchronous LLM requests and timeout handling
Mistral Small 2506: Runtime LLM for evaluations and queries
OpenAI-compatible API client: Flexibility for different LLM backends

This combination made it possible to create a functional system within a short time (4 hours of active development) without wasting time learning new technology.

2.3 Modularity and structuring#

The modular structure was defined in the specification from the outset. This proved to be crucial for maintainability:

Project structure (4,000 lines, 15 files):
├── src/
│   ├── agent.py           # Agent core logic (500 lines)
│   ├── llm_client.py      # LLM integration (300 lines)
│   ├── prompts.py         # Prompt management (600 lines)
│   ├── models.py          # Pydantic data models (400 lines)
│   └── config_loader.py   # Configuration (200 lines)
├── interface/
│   ├── gradio_interface.py    # UI logic (800 lines)
│   ├── gradio_handlers.py     # Event handler (600 lines)
│   ├── gradio_tabs.py         # UI layout (500 lines)
│   └── timing_metrics.py      # Performance monitoring (400 lines)
└── config/
    ├── config.yaml        # System configuration
    └── survey_mvp.yaml    # Survey definitions

The clear separation between data models, business logic and presentation layer enabled iterative development without major refactoring.

3. The development process#

3.1 Specification phase (2 hours)#

The initial specification defined:

Basic requirements: demand system, response evaluation, clustering preparation
Rough architecture structure: agent-based approach, modularisation
Technology stack: Python, Gradio, Pydantic
Non-functional requirements: timeouts, error handling, performance monitoring

Key insight: The specification was deliberately kept at a conceptual level rather than as a detailed technical blueprint. The reason: With novel architectural patterns, there is a lack of understanding of optimal structuring. The specification defined the ‘what’ and ‘why,’ but left the ‘how exactly’ to the iterative process.

3.2 Development with LLM (4 hours over 3 days)#

Interaction pattern: The collaboration with the LLM followed a structured dialogue approach:

Presentation of the specification and architecture requirements
Discussion of possible implementation approaches
Step-by-step implementation of the components
Iterative refinement based on tests

Phases:

Phase 1 – Basic framework (1 hour):

LLM client integration with retry logic
Basic data models (question, answer evaluation, etc.)
Configuration system
Initial prompt templates

Phase 2 – Agent logic (1.5 hours):

Implementation of evaluate_answer()
Development of evaluation criteria
Initial tests with mock data
Challenge: LLM proposed complex scoring mechanisms with weighted sub-scores. Had to be reduced to a simple 0-1 score evaluation.

Phase 3 - Workflow orchestration (1 hour):

ConversationManager implementation
State management between questions
Integration of follow-up logic
Challenge: Coordination between agent decisions and workflow control was not trivial. Several rounds of iteration were necessary to avoid race conditions.

Phase 4 - UI and refinement (0.5 hours):

Gradio interface setup
Performance monitoring
Error handling improvements
Challenge: Gradio-specific type format issues required wrapper functions.

3.3 Prompt engineering iterations (5-6 rounds)#

Developing consistent prompt templates for response evaluation required the most iterations:

**Iteration 1-2: ** Initial prompts were too open-ended, leading to inconsistent evaluations.

Iteration 3-4: Adding examples of ‘vague’ vs. ‘clear’ responses significantly improved consistency. Structured JSON output definition became central.

Iteration 5-6: Fine-tuning of evaluation criteria. Explicit instructions against overly lenient evaluations were necessary.

Final template structure:

System prompt: Role, evaluation criteria, output format
User prompt: Question context, response to be evaluated, examples
Expected output: JSON with clarity_score, needs_followup, reasoning, problem_areas

3.4 Dealing with LLM overengineering#

A recurring pattern: The LLM tended towards complex solutions:

Example 1 – Evaluation logic: LLM suggestion: Multi-dimensional scoring with weightings for specificity, completeness, relevance, clusterability, separate sub-scores. Actually implemented: Simple 0-1 clarity score with boolean needs_followup.

Example 2 – State management: LLM proposal: Complex state machine pattern with explicit state transitions. Actually implemented: Simple session object with list of interactions.

Example 3 – Error handling: LLM suggestion: Hierarchical exception system with custom exceptions for each error source. Actually implemented: Robust try-catch blocks with fallback values.

4. Methodological findings#

4.1 Conceptual complexity as a limiting factor#

The key insight from this experiment: The challenge was not the amount of code (4,000 lines), but understanding the required architectural patterns.

Specifically: Without prior experience with agent workflow orchestration, it was difficult to specify:

How do the agent and ConversationManager interact optimally?
When should the agent make decisions, and when should the manager?
How can circular dependencies be prevented?
What state information needs to be stored where?

These questions could not be answered through intensive reflection, but only through practical experimentation. This leads to a fundamental principle:

It seems helpful to first familiarise yourself with the architectural patterns through exploratory prototypes and then specify and implement them in a targeted manner.**

4.2 Optimal specification strategies#

This project has resulted in concrete recommendations for the specification of LLM-supported development projects:

What works:

Clear requirements at the conceptual level
Examples of expected behaviour
Non-functional requirements (timeouts, error handling)
Rough architecture structure with clear responsibilities
Explicit constraints (‘simplest solution’, ‘no premature abstraction’)

What does not work:

Detailed technical blueprints for unknown architectural patterns
Too open formulations without examples
Implicit expectations of code quality
Lack of guidelines against overengineering

Optimal approach: Iterative process consisting of rough specification, prototypical implementation, architecture learning and subsequent more precise specification for the production system.

4.3 Structured JSON outputs#

The consistent use of structured JSON responses was central to the functioning of the system:

Advantages:

Type-safe processing through Pydantic validation
Clear interfaces between components
Simple state management
Predictable data flows

Implementation:

class AnswerEvaluation(BaseModel):
    clarity_score: float  # 0.0-1.0
    needs_followup: bool
    reasoning: str
    problem_areas: List[str]
    suggested_clarifications: List[str]

Critical aspect: JSON-driven systems can be prone to loops. Specifically observed:

Incorrect evaluation → Follow-up question → Same evaluation → Another follow-up question
Solution: Explicit max_followups limits, timeout mechanisms, fallback evaluations

4.4 Limitations of criteria derivation#

The biggest technical challenge was the automatic derivation of evaluation criteria for answer completeness:

Areas that work well:

Recognition of very vague responses (“cloud,” “good,” ‘OK’)
Recognition of very specific responses (“Office 365 for email and document processing”)
Evaluation of response length

Problematic areas:

Assessment of completeness (What is still missing?)
Context-dependent specificity (When is “AWS” specific enough?)
Ambiguous responses (Does “cloud” mean storage or software?)

Cause: The LLM does not have the full context of the survey objectives. It can recognize syntactic and superficial semantic patterns, but cannot assess whether a response is sufficient for the specific research question.

Practical consequence: Such systems work best for standardized surveys with clear evaluation criteria, less so for exploratory surveys with open-ended objectives.

4.5 Workflow orchestration as a core competency#

Orchestrating the individual components proved to be more complex than implementing individual functions:

Challenges:

Synchronization between agent evaluations and UI updates
State consistency for asynchronous LLM requests
Error handling across component boundaries
Performance optimization without loss of functionality

Solution: Clear event-based architecture with defined interfaces and robust error handling at every level. Threading locks to avoid race conditions.

Learning: In multi-component systems, more development time should be allocated to orchestration than to individual components.

5. Validation and practical suitability#

5.1 Functional tests#

The system was validated with two test questions:

“Which cloud technologies do you mainly use?”
“How would you rate your current IT security situation?”

Results:

In the case of clearly vague answers (“cloud,” “good”), the follow-up system worked reliably
Follow-up questions were contextually relevant and led to more precise answers
Structuring for clustering worked well with clear answers
Problems with borderline cases and assessment of completeness

Quantitative metrics:

Average response time: 2-4 seconds per assessment
Follow-up rate: 30-40% for typical test answers
Consistency of evaluations: Good for clear cases, inconsistent for borderline cases

5.2 Practical limitation: Risk of infinite loops#

A critical learning from the tests: JSON-driven LLM systems can run in infinite loops if insufficiently secured.

Observed scenario:

User response: “Cloud services”
Agent evaluation: needs_followup = true
Follow-up question: “Which specific cloud services?”
User response: “Cloud services”
Agent evaluation: needs_followup = true (criteria not met)
Follow-up question…

Implemented safeguards:

Hard limits: max_followups = 1-3 (configurable)
Timeouts: Maximum processing time per evaluation
Fallback evaluations: Accept in case of timeout or error
Monitoring: Logging of all iteration counts

General recommendation: All LLM-controlled loop systems require explicit exit conditions that do not depend on the LLM output.

5.3 Use and follow-up work#

The tool was used exclusively for exploratory purposes. The main value lay in the insights gained into the architectural design of such systems.

Direct benefits:

Understanding of agent workflow orchestration
Insights into the limitations of automatic criteria derivation
Prompt engineering patterns for structured outputs
Awareness of sources of error (infinite loops, race conditions)

Follow-up project “ppt-helper”: The insights were incorporated into an improved system with several specialized agents. There, the architectural patterns could be applied in a more targeted manner, as the basic understanding was already in place.

This illustrates the real value of such learning projects: Not the immediate productive use, but the systematic development of expertise for more complex follow-up implementations.

6. Transferable findings#

6.1 Development workflow for complex LLM systems#

Based on this experiment, the following workflow can be recommended:

Phase 1 - Conceptual specification (20% of time):

Requirements at the conceptual level
Examples of expected behavior
Rough architecture structure
Explicit constraints against overengineering

Phase 2 - Exploratory prototype (30% of time):

Focus on architecture learning, not productivity
Rapid iteration without perfection
Documentation of learnings
Identification of critical points

Phase 3 - Refined specification (10% of time):

Incorporation of prototype findings
Refinement of architecture
Definition of robust interfaces
Establishment of safeguards

Phase 4 - Productive implementation (40% of time):

Implementation with proven patterns
Focus on robustness and error handling
Systematic testing
Performance optimization

6.2 Prompt engineering for structured workflows#

Proven patterns:

Explicit JSON schema definition in the prompt:

You must ALWAYS respond with valid JSON in the following format:
{
  “clarity_score”: <float 0.0-1.0>,
  “needs_followup”: <boolean>,
  ‘reasoning’: “<string>”
}

Concrete examples instead of abstract rules: Instead of: “Evaluate the specificity of the answer” Better: “Examples of VAGUE: ‘Cloud’, ‘Software’. Examples of CLEAR: ‘Office 365’, ‘AWS S3’”
Explicit strictness requirements: “Only accept answers with a clarity score > 0.7. Be strict in your evaluation.“
Fallback instructions: ”If evaluation is not possible, use clarity_score: 0.5, needs_followup: true”

6.3 Architecture patterns for agent systems#

Single agent with specialized methods (as in this project) is suitable for:

Sequential workflows
Clear task sequences
Simple state management

Multi-agent systems are suitable for:

Parallel processing
Specialized subtasks with different prompts
Complex decision trees

Important: Complexity increases non-linearly with the number of agents. Only switch to multi-agent if there is a clear benefit.

6.4 Error handling and robustness#

Critical areas:

LLM timeouts: Always work with asyncio.wait_for() and fallback values
JSON parsing: Robust error handling, never crash on invalid response
State inconsistencies: Threading locks for parallel access
Infinite loops: Hard limits independent of LLM output

Implementation principle: Any LLM interaction can fail. The system must be able to continue running in all cases.

7. Metrics and effort#

Project scope:

4,000 lines of code
15 Python files
Modular structure: src/ (5 files), interface/ (4 files), config/ (2 files)

Development time:

Specification: 2 hours
Implementation: 4 hours (spread over 3 days)
Total: 6 hours
Iterations: 4-5 main iterations
Prompt refinement: 5-6 rounds

Component size:

Agent logic: ~500 lines
LLM client: ~300 lines
Prompt management: ~600 lines
UI interface: ~1,900 lines
Data models: ~400 lines
Monitoring/utils: ~300 lines

Efficiency assessment: The development time of 6 hours for a functional 4,000-line system demonstrates the efficiency of LLM-supported coding. However, the real learning value lay not in the rapid code generation, but in the acquired understanding of architecture.