LLM-supported coding for multi-stage workflows: Detailed documentation of a survey tool experiment#

Summary#

This article documents a learning experiment to develop an AI-supported survey tool using large language models. The project explored the methodological limitations and possibilities of LLM coding in complex, multi-stage workflows. The key finding: when developing such systems, understanding the architecture is the limiting factor, not the code generation itself. The article offers detailed insights into technical implementation, the development process and transferable methodological learnings.

1. Experimental context#

1.1 Motivation and learning objectives#

The project arose in the context of planned surveys and the question: To what extent can LLM tools be used to create survey-like systems that ensure that responses are actually usable? The primary learning objective was the methodological exploration of LLM coding in more complex architectures, specifically:

  • How do you orchestrate multi-stage workflows with LLMs?
  • Which architectural patterns are suitable for agent-based systems?
  • What are the practical limits of automatic criteria derivation?
  • How can specifications for such tasks be optimally structured?

The tool itself was explicitly designed as a learning vehicle, not as a productive system. The insights gained were to be incorporated into an improved follow-up project.

1.2 Positioning in the learning curve#

The experiment took place in the middle of a longer learning phase with LLM-supported coding. Basic knowledge of prompt engineering and simple tool development already existed, but there was no experience with complex agent workflow interactions. This positioning was deliberately chosen: the project was intended to explore the limits of previous understanding.

2. Technical implementation#

2.1 Architecture overview#

The system implements a single-agent approach with clearly separated responsibilities. Contrary to the initial assumption of several separate agents, it became clear during development that a SurveyAgent with specialised methods offers the appropriate level of complexity:

Core components:

  • SurveyAgent (src/agent.py): Main component with three core functions

  • evaluate_answer(): Evaluates answer clarity and specificity (clarity score 0-1)

  • generate_followup(): Generates context-specific follow-up questions for unclear answers

  • structure_final_answer(): Prepares final answers for clustering

  • ConversationManager (src/agent.py): Orchestrates the survey workflow

  • Manages question sequences

  • Controls follow-up logic

  • Tracked session state

  • LLMClient (src/llm_client.py): OpenAI-compatible API integration

  • Supports local and remote LLMs

  • Robust error handling with retry logic

  • Mock client for testing without LLM

  • PromptManager (src/prompts.py): Central prompt management

  • Templates for all agent functions

  • Structured JSON output definitions

  • Versioning and fallback mechanisms

  • Interface (interface/): Modular Gradio UI

  • Experimental playground for live testing

  • Batch testing functionality

    • Performance monitoring
  • Demo mode for stakeholders

2.2 Technology stack#

The choice of technology stack was based on existing experience in order to enable focus on the actual learning objectives:

  • Python 3.x: Main language
  • Gradio 4.x: UI framework (simple, fast, suitable for prototypes)
  • Pydantic 2.x: Data validation and type safety
  • asyncio: Asynchronous LLM requests and timeout handling
  • Mistral Small 2506: Runtime LLM for evaluations and queries
  • OpenAI-compatible API client: Flexibility for different LLM backends

This combination made it possible to create a functional system within a short time (4 hours of active development) without wasting time learning new technology.

2.3 Modularity and structuring#

The modular structure was defined in the specification from the outset. This proved to be crucial for maintainability:

Project structure (4,000 lines, 15 files):
├── src/
│   ├── agent.py           # Agent core logic (500 lines)
│   ├── llm_client.py      # LLM integration (300 lines)
│   ├── prompts.py         # Prompt management (600 lines)
│   ├── models.py          # Pydantic data models (400 lines)
│   └── config_loader.py   # Configuration (200 lines)
├── interface/
│   ├── gradio_interface.py    # UI logic (800 lines)
│   ├── gradio_handlers.py     # Event handler (600 lines)
│   ├── gradio_tabs.py         # UI layout (500 lines)
│   └── timing_metrics.py      # Performance monitoring (400 lines)
└── config/
    ├── config.yaml        # System configuration
    └── survey_mvp.yaml    # Survey definitions

The clear separation between data models, business logic and presentation layer enabled iterative development without major refactoring.

3. The development process#

3.1 Specification phase (2 hours)#

The initial specification defined:

  • Basic requirements: demand system, response evaluation, clustering preparation
  • Rough architecture structure: agent-based approach, modularisation
  • Technology stack: Python, Gradio, Pydantic
  • Non-functional requirements: timeouts, error handling, performance monitoring

Key insight: The specification was deliberately kept at a conceptual level rather than as a detailed technical blueprint. The reason: With novel architectural patterns, there is a lack of understanding of optimal structuring. The specification defined the ‘what’ and ‘why,’ but left the ‘how exactly’ to the iterative process.

3.2 Development with LLM (4 hours over 3 days)#

Interaction pattern: The collaboration with the LLM followed a structured dialogue approach:

  1. Presentation of the specification and architecture requirements
  2. Discussion of possible implementation approaches
  3. Step-by-step implementation of the components
  4. Iterative refinement based on tests

Phases:

Phase 1 – Basic framework (1 hour):

  • LLM client integration with retry logic
  • Basic data models (question, answer evaluation, etc.)
  • Configuration system
  • Initial prompt templates

Phase 2 – Agent logic (1.5 hours):

  • Implementation of evaluate_answer()
  • Development of evaluation criteria
  • Initial tests with mock data
  • Challenge: LLM proposed complex scoring mechanisms with weighted sub-scores. Had to be reduced to a simple 0-1 score evaluation.

Phase 3 - Workflow orchestration (1 hour):

  • ConversationManager implementation
  • State management between questions
  • Integration of follow-up logic
  • Challenge: Coordination between agent decisions and workflow control was not trivial. Several rounds of iteration were necessary to avoid race conditions.

Phase 4 - UI and refinement (0.5 hours):

  • Gradio interface setup
  • Performance monitoring
  • Error handling improvements
  • Challenge: Gradio-specific type format issues required wrapper functions.

3.3 Prompt engineering iterations (5-6 rounds)#

Developing consistent prompt templates for response evaluation required the most iterations:

**Iteration 1-2: ** Initial prompts were too open-ended, leading to inconsistent evaluations.

Iteration 3-4: Adding examples of ‘vague’ vs. ‘clear’ responses significantly improved consistency. Structured JSON output definition became central.

Iteration 5-6: Fine-tuning of evaluation criteria. Explicit instructions against overly lenient evaluations were necessary.

Final template structure:

System prompt: Role, evaluation criteria, output format
User prompt: Question context, response to be evaluated, examples
Expected output: JSON with clarity_score, needs_followup, reasoning, problem_areas

3.4 Dealing with LLM overengineering#

A recurring pattern: The LLM tended towards complex solutions:

Example 1 – Evaluation logic: LLM suggestion: Multi-dimensional scoring with weightings for specificity, completeness, relevance, clusterability, separate sub-scores. Actually implemented: Simple 0-1 clarity score with boolean needs_followup.

Example 2 – State management: LLM proposal: Complex state machine pattern with explicit state transitions. Actually implemented: Simple session object with list of interactions.

Example 3 – Error handling: LLM suggestion: Hierarchical exception system with custom exceptions for each error source. Actually implemented: Robust try-catch blocks with fallback values.

4. Methodological findings#

4.1 Conceptual complexity as a limiting factor#

The key insight from this experiment: The challenge was not the amount of code (4,000 lines), but understanding the required architectural patterns.

Specifically: Without prior experience with agent workflow orchestration, it was difficult to specify:

  • How do the agent and ConversationManager interact optimally?
  • When should the agent make decisions, and when should the manager?
  • How can circular dependencies be prevented?
  • What state information needs to be stored where?

These questions could not be answered through intensive reflection, but only through practical experimentation. This leads to a fundamental principle:

It seems helpful to first familiarise yourself with the architectural patterns through exploratory prototypes and then specify and implement them in a targeted manner.**

4.2 Optimal specification strategies#

This project has resulted in concrete recommendations for the specification of LLM-supported development projects:

What works:

  • Clear requirements at the conceptual level
  • Examples of expected behaviour
  • Non-functional requirements (timeouts, error handling)
  • Rough architecture structure with clear responsibilities
  • Explicit constraints (‘simplest solution’, ‘no premature abstraction’)

What does not work:

  • Detailed technical blueprints for unknown architectural patterns
  • Too open formulations without examples
  • Implicit expectations of code quality
  • Lack of guidelines against overengineering

Optimal approach: Iterative process consisting of rough specification, prototypical implementation, architecture learning and subsequent more precise specification for the production system.

4.3 Structured JSON outputs#

The consistent use of structured JSON responses was central to the functioning of the system:

Advantages:

  • Type-safe processing through Pydantic validation
  • Clear interfaces between components
  • Simple state management
  • Predictable data flows

Implementation:

class AnswerEvaluation(BaseModel):
    clarity_score: float  # 0.0-1.0
    needs_followup: bool
    reasoning: str
    problem_areas: List[str]
    suggested_clarifications: List[str]

Critical aspect: JSON-driven systems can be prone to loops. Specifically observed:

  • Incorrect evaluation → Follow-up question → Same evaluation → Another follow-up question
  • Solution: Explicit max_followups limits, timeout mechanisms, fallback evaluations

4.4 Limitations of criteria derivation#

The biggest technical challenge was the automatic derivation of evaluation criteria for answer completeness:

Areas that work well:

  • Recognition of very vague responses (“cloud,” “good,” ‘OK’)
  • Recognition of very specific responses (“Office 365 for email and document processing”)
  • Evaluation of response length

Problematic areas:

  • Assessment of completeness (What is still missing?)
  • Context-dependent specificity (When is “AWS” specific enough?)
  • Ambiguous responses (Does “cloud” mean storage or software?)

Cause: The LLM does not have the full context of the survey objectives. It can recognize syntactic and superficial semantic patterns, but cannot assess whether a response is sufficient for the specific research question.

Practical consequence: Such systems work best for standardized surveys with clear evaluation criteria, less so for exploratory surveys with open-ended objectives.

4.5 Workflow orchestration as a core competency#

Orchestrating the individual components proved to be more complex than implementing individual functions:

Challenges:

  • Synchronization between agent evaluations and UI updates
  • State consistency for asynchronous LLM requests
  • Error handling across component boundaries
  • Performance optimization without loss of functionality

Solution: Clear event-based architecture with defined interfaces and robust error handling at every level. Threading locks to avoid race conditions.

Learning: In multi-component systems, more development time should be allocated to orchestration than to individual components.

5. Validation and practical suitability#

5.1 Functional tests#

The system was validated with two test questions:

  1. “Which cloud technologies do you mainly use?”
  2. “How would you rate your current IT security situation?”

Results:

  • In the case of clearly vague answers (“cloud,” “good”), the follow-up system worked reliably
  • Follow-up questions were contextually relevant and led to more precise answers
  • Structuring for clustering worked well with clear answers
  • Problems with borderline cases and assessment of completeness

Quantitative metrics:

  • Average response time: 2-4 seconds per assessment
  • Follow-up rate: 30-40% for typical test answers
  • Consistency of evaluations: Good for clear cases, inconsistent for borderline cases

5.2 Practical limitation: Risk of infinite loops#

A critical learning from the tests: JSON-driven LLM systems can run in infinite loops if insufficiently secured.

Observed scenario:

  1. User response: “Cloud services”
  2. Agent evaluation: needs_followup = true
  3. Follow-up question: “Which specific cloud services?”
  4. User response: “Cloud services”
  5. Agent evaluation: needs_followup = true (criteria not met)
  6. Follow-up question…

Implemented safeguards:

  • Hard limits: max_followups = 1-3 (configurable)
  • Timeouts: Maximum processing time per evaluation
  • Fallback evaluations: Accept in case of timeout or error
  • Monitoring: Logging of all iteration counts

General recommendation: All LLM-controlled loop systems require explicit exit conditions that do not depend on the LLM output.

5.3 Use and follow-up work#

The tool was used exclusively for exploratory purposes. The main value lay in the insights gained into the architectural design of such systems.

Direct benefits:

  • Understanding of agent workflow orchestration
  • Insights into the limitations of automatic criteria derivation
  • Prompt engineering patterns for structured outputs
  • Awareness of sources of error (infinite loops, race conditions)

Follow-up project “ppt-helper”: The insights were incorporated into an improved system with several specialized agents. There, the architectural patterns could be applied in a more targeted manner, as the basic understanding was already in place.

This illustrates the real value of such learning projects: Not the immediate productive use, but the systematic development of expertise for more complex follow-up implementations.

6. Transferable findings#

6.1 Development workflow for complex LLM systems#

Based on this experiment, the following workflow can be recommended:

Phase 1 - Conceptual specification (20% of time):

  • Requirements at the conceptual level
  • Examples of expected behavior
  • Rough architecture structure
  • Explicit constraints against overengineering

Phase 2 - Exploratory prototype (30% of time):

  • Focus on architecture learning, not productivity
  • Rapid iteration without perfection
  • Documentation of learnings
  • Identification of critical points

Phase 3 - Refined specification (10% of time):

  • Incorporation of prototype findings
  • Refinement of architecture
  • Definition of robust interfaces
  • Establishment of safeguards

Phase 4 - Productive implementation (40% of time):

  • Implementation with proven patterns
  • Focus on robustness and error handling
  • Systematic testing
  • Performance optimization

6.2 Prompt engineering for structured workflows#

Proven patterns:

  1. Explicit JSON schema definition in the prompt:
You must ALWAYS respond with valid JSON in the following format:
{
  “clarity_score”: <float 0.0-1.0>,
  “needs_followup”: <boolean>,
  ‘reasoning’: “<string>”
}
  1. Concrete examples instead of abstract rules: Instead of: “Evaluate the specificity of the answer” Better: “Examples of VAGUE: ‘Cloud’, ‘Software’. Examples of CLEAR: ‘Office 365’, ‘AWS S3’”

  2. Explicit strictness requirements: “Only accept answers with a clarity score > 0.7. Be strict in your evaluation.“

  3. Fallback instructions: ”If evaluation is not possible, use clarity_score: 0.5, needs_followup: true”

6.3 Architecture patterns for agent systems#

Single agent with specialized methods (as in this project) is suitable for:

  • Sequential workflows
  • Clear task sequences
  • Simple state management

Multi-agent systems are suitable for:

  • Parallel processing
  • Specialized subtasks with different prompts
  • Complex decision trees

Important: Complexity increases non-linearly with the number of agents. Only switch to multi-agent if there is a clear benefit.

6.4 Error handling and robustness#

Critical areas:

  1. LLM timeouts: Always work with asyncio.wait_for() and fallback values
  2. JSON parsing: Robust error handling, never crash on invalid response
  3. State inconsistencies: Threading locks for parallel access
  4. Infinite loops: Hard limits independent of LLM output

Implementation principle: Any LLM interaction can fail. The system must be able to continue running in all cases.

7. Metrics and effort#

Project scope:

  • 4,000 lines of code
  • 15 Python files
  • Modular structure: src/ (5 files), interface/ (4 files), config/ (2 files)

Development time:

  • Specification: 2 hours
  • Implementation: 4 hours (spread over 3 days)
  • Total: 6 hours
  • Iterations: 4-5 main iterations
  • Prompt refinement: 5-6 rounds

Component size:

  • Agent logic: ~500 lines
  • LLM client: ~300 lines
  • Prompt management: ~600 lines
  • UI interface: ~1,900 lines
  • Data models: ~400 lines
  • Monitoring/utils: ~300 lines

Efficiency assessment: The development time of 6 hours for a functional 4,000-line system demonstrates the efficiency of LLM-supported coding. However, the real learning value lay not in the rapid code generation, but in the acquired understanding of architecture.