LLM-supported development of a diagram generator: Insights into code validation#

1. Introduction and project context#

1.1 Motivation and learning objectives#

The AI diagram generator was created as an exploratory learning project with the primary goal of testing multi-agent architectures in LLM-supported software development. The central research question was: How can the reliability of LLM-generated syntax be increased from an initial 70% to nearly 100%?

The motivation came from two sources: On the one hand, there was interest in the possibilities and limitations of LLM-supported coding for complex, syntax-sensitive tasks. On the other hand, there was a practical need for reliably generated diagrams in various formats. From the outset, the tool was designed as a learning vehicle, with the possibility of later productive use left open.

The project built on findings from a previous project (“ppt-helper”), which had shown that modular multi-component architectures can be developed well with LLMs. The AI diagram generator was intended to expand this approach and supplement it with specific validation and quality assurance mechanisms.

1.2 What the tool does#

The AI diagram generator is a Gradio-based web application that automatically generates diagrams in various formats from natural language descriptions or structured data. Users can describe their requirements via a chat interface, enter structured data (CSV, JSON, indented lists), or choose from over 40 ready-made templates.

The tool supports 12 diagram types:

Flowcharts (process flow diagrams)
Mind maps (hierarchical idea structures)
Organigrams (organizational structures)
Gantt charts (schedules)
Sequence diagrams (system interactions)
State diagrams (state transitions)
ER diagrams (database structures)
Network graphs
Pie charts
User journey diagrams
Swimlane diagrams (processes with responsibilities)
Kanban boards

The generated code is validated in real time and can be manually edited in the integrated code editor with an undo function. Diagrams can be exported as SVG, PNG, or source code.

2. Technical Architecture#

2.1 Multi-Agent System#

The architecture is based on three specialized agents with clearly defined responsibilities:

ChatAgent (intent analysis and coordination):

Analyzes natural language user queries
Identifies whether new diagrams should be created or existing ones modified
Coordinates the interaction between the user and the generation pipeline
Does not provide code output directly to the user, only explanations

DiagramAgent (code generation):

Generates Mermaid code based on structured data or chat context
Supports two generation modes: direct generation and JSON intermediate
Implements diagram type-specific generators with structured fallbacks
Uses complete syntax references for template-based generation

ValidationAgent (quality assurance):

Performs pre-validation and automatic syntax corrections
Tests code by actually rendering it with Mermaid CLI
Corrects faulty syntax through iterative LLM-assisted repair (max. 3 attempts)
Falls back to predefined templates for unreparable code

This structure emerged organically during development and was not fully planned from the outset. The clear modularization proved conducive to LLM collaboration, as each agent has a manageable, clearly defined task.

2.2 UI architecture and components#

The Gradio-based user interface is divided into four main areas:

Chat interface: Natural language interaction for diagram requests
Data panel: Input of structured data with smart convert function, template selection, and file upload
Code editor: Split view with input data on the left and editable, generated code on the right; undo stack for iterative adjustments; manual render button instead of auto-rendering
Diagram gallery: Display of generated variants with selection options and export functions

The decision to use Gradio was pragmatic—it enables rapid prototype development. Modularization into separate UI areas was part of the concept from the outset, as it seemed necessary to cleanly separate different interaction modes (chat vs. structured input vs. manual code editing).

2.3 Validation Pipeline#

The validation pipeline is the technical heart of the project and goes through several stages:

Stage 1 - Pre-validation:

Detection and removal of multiple diagram declarations
Diagram type-specific syntax corrections
Sanitization of node IDs and labels (special characters, invalid keywords)
Mind map-specific indentation normalization (2 spaces per level)

Stage 2 - Rendering test:

Actual rendering with Mermaid CLI
Success → Code is valid
Error → Continue to stage 3

Stage 3 - LLM-supported correction:

Prompt contains: Faulty code + error message + complete Mermaid syntax reference
In case of repeated failures: Adjustment of the temperature (0.0 → 0.05 → 0.1 → 0.15)
Extract and retest the corrected code
Maximum of 3 correction attempts

Stage 4 - Template fallback:

If all correction attempts fail: Fall back on predefined, validated templates
Template is also validated again (should always work)
Last fallback: Simplest flowchart example

This pipeline increased the success rate from an initial 70% to 95% after the final iteration.

2.4 JSON intermediate layer#

For structured diagram types (flowchart, mind map, org chart), a two-stage generation process was implemented:

Stage 1 - LLM → JSON:

LLM generates structured JSON representation of the diagram
Defined schemas with clear constraints (e.g., node ID conventions)
Validation of the JSON structure before conversion
Retry loop for invalid JSON (max. 2 attempts)

Stage 2 - JSON → Mermaid (deterministic):

JSONToMermaidConverter translates structured JSON into syntactically correct Mermaid code
Deterministic, no LLM variability
Guaranteed valid syntax with valid input JSON

This approach was an attempt to improve code quality. The separation between “What should be displayed” (JSON) and “How is it coded syntactically correctly” (converter) reduces sources of error.

3. Development process#

3.1 Five-phase development#

Development took place in five clearly structured phases over a period of two weeks:

Phase 1 - Specification (90 minutes):

Intensive architecture discussion with the LLM
Evaluation of possible patterns and architectures
Conscious selection based on KISS principles
Result: Detailed technical specification with multi-agent approach

Phase 2 - Basic implementation (30 minutes):

Implementation of core components
Basic UI with chat and data panel
Initial diagram generation without validation
Result: Functional prototype with ~70% success rate

Phase 3 - Validation Agent Integration (30-45 minutes):

Add automatic code validation through rendering tests
Initial LLM-supported correction attempts
Result: Improved success rate, but still unstable

Phase 4 - Pattern Integration (30-45 minutes):

Advanced error correction patterns
Diagram type-specific validation rules
Pre-validation with automatic fixes
Result: More stable generation, fewer correction iterations required

Phase 5 - Mermaid syntax reference (30-45 minutes):

Integration of complete syntax specification (>500 lines) into system prompts
Documentation of all Mermaid diagram types with examples
Explicit error pattern catalogs with corrections
Result: 95% success rate

Total effort: Approximately 6 hours, spread over short sessions of 10-15 minutes in between.

3.2 Working with the LLM#

Development was carried out with locally available LLMs in a structured process:

Preliminary discussion: Intensive exploration of possible architectures and patterns with the LLM before each implementation phase
Precise specification: The result of the discussion was translated into clear technical requirements
Implementation in one round: Each phase was implemented in one go, not iteratively within the phase
Evaluation and planning: After each phase, evaluation of the results and planning of the next improvement

The insight from the preliminary project—that clear patterns and architectures should be discussed before implementation—proved its worth. The LLM served both as an implementation partner and as a partner in architectural discussions.

3.3 KISS principle in practice#

Over-engineering was prevented by several mechanisms:

Explicit KISS specifications in architecture discussions
Pragmatic tool selection: Gradio instead of custom framework, Mermaid instead of custom rendering
Focus on core functionality: No unnecessary features such as user management, database persistence (except session storage)
Gradual increase in complexity: First the basics, then validation, then patterns, then syntax reference

Modularity arose organically, but was not overengineered. The 27 Python files have clear responsibilities without unnecessary levels of abstraction.

4. Methodological findings#

4.1 Proven approaches#

Multi-agent architectures seem to be beneficial: The division into specialized agents (chat, generation, validation) proved to be a structuring framework for LLM collaboration. Each agent has a manageable, clearly defined task, which simplifies both specification and implementation. However, this should not be understood as a general rule—the sample size is too small.

Syntax references in the prompt are effective: The integration of the complete Mermaid syntax reference (>500 lines) into the ValidationAgent system prompt led to the greatest leap in quality. This allowed the LLM to accurately “look up” what the correct syntax is, rather than generating it from “memory.” This seems to be particularly relevant for domain-specific syntax requirements.

Validation loops with LLM correction can be effective: The iterative correction of erroneous code by the LLM itself (with error message + syntax reference as input) worked surprisingly well. Of 100 erroneous diagrams, about 80-85 could be repaired by the correction loop. This is a significant improvement, but does not completely solve the problem.

Structured preliminary discussion reduces iterations: The intensive architecture discussion in phase 1 (90 minutes) paid off. The subsequent implementation phases each proceeded in a single large iteration without many rework steps. This suggests that LLMs can be good at discussing different architecture options before code is generated.

JSON as an intermediate format can help: The separation between “what to represent” (JSON) and “how to encode syntactically” (deterministic converter) reduced sources of error in structured diagrams. However, the effort was higher and the quality gain was moderate – not worthwhile for all diagram types.

4.2 Remaining challenges#

Subsequent code correction remains difficult: This is the key finding of the project. Even with a complete syntax reference and multiple correction attempts, the repair of invalid code is only successful in about 80-85% of cases. In about 15-20% of errors, even iterative LLM correction does not lead to success. There are many reasons for this:

LLM repeats the same errors despite error messages
LLM introduces new errors when attempting to correct old ones
Some syntax constructs seem difficult to “debug” for LLMs

Intent recognition is not reliable enough: Identifying which chart type is optimal for a natural language query remains unreliable. The ChatAgent only selects the correct type on the first attempt in about 60-70% of cases. Users often have to explicitly specify the desired type. This problem was addressed in a follow-up project (“Chart Tool”). *Basic LLM reliability is limiting: * Despite all validation and correction mechanisms, the basic quality of the initial code generation remains dependent on the reliability of the LLM used. With weaker models, the success rate drops significantly even with validation. The 95% achieved applies to the local models used—other models show different success rates.

*Mermaid does not cover all requirements: * While the tool works well for Mermaid-compatible diagrams, there are many visualization requirements that Mermaid does not cover (complex scientific plots, highly customized graphics, interactive visualizations). The tool is therefore only suitable for a subset of diagram requirements.

4.3 Workflow changes#

The following work practices have been established through the project:

Pattern and syntax integration as standard: For future projects with domain-specific syntax, the integration of complete references is planned from the outset, not just retrospectively. The effort (1-2 hours for preparing the reference) is worthwhile due to the significantly higher code quality.

Validation as an integral part: Instead of using LLM-generated code directly, a validation step is planned as standard. This can be test execution, rendering, or other forms of verification, depending on the project type.

5. Validation and practical use#

5.1 Functional testing#

The tool is currently being actively tested with the following observations:

Success rates by diagram type:

Flowcharts, mind maps, org charts: ~95% valid on first attempt
Gantt, sequence, state: ~90% valid
More complex types (ER, swimlane, Kanban): ~85% valid
Average across all types: ~95%

Manual rework: For the 5% of non-valid diagrams, the code editor allows for quick manual correction (1-2 minutes) in most cases. Only in rare cases (<1%) is it necessary to completely recreate the diagram.

Quality of visualizations: The generated diagrams are generally semantically correct and visually appealing. The biggest weakness lies in intent recognition (incorrect diagram type selected), not in the implementation of a correctly identified type.

5.2 Productive use#

The tool is occasionally used productively, but remains primarily an experiment. Typical usage scenarios:

Quick prototypes: For presentations or documentation where speed is more important than perfection
Template basis: As a starting point, which is then further processed in specialized tools
Exploration: To try out different visualization options for a data set

Not suitable for:

Production documentation with high quality requirements (error rate still too high)
Complex, highly customized diagrams (Mermaid limitations)
Batch processing of large amounts of data (performance reasons)

5.3 Findings from use#

Practical testing confirmed the theoretical findings:

The “last percent” is difficult: Getting from 95% to 99%+ reliability would probably require exponentially more effort. For a learning project, 95% is a good point, but not yet sufficient for production software.

Human-in-the-loop is valuable: The combination of automatic generation and manual editing works better than pure automation. Users can quickly make corrections where automation falls short.

Intent recognition remains the bottleneck: Even perfect code generation does not help if the wrong diagram type is selected. This issue was addressed in follow-up projects.

6. Metrics and comparability#

Scope:

13,000 lines of code in total
12,400 lines of Python
27 Python files
12 supported diagram types
40+ templates in the library

Development time:

Specification: ~3.5 hours (initial 90 min + 4x 30 min refinement)
Implementation: ~2.5 hours (4 phases of 30-45 min each)
Total effort: ~6 hours over 2 weeks
Working method: Short sessions (10-15 min) in between

Quality development:

Initial: ~70% valid code
After ValidationAgent: ~80%
After pattern integration: ~88%
After syntax reference: ~95%

Performance:

Average generation time: 2-5 seconds (depending on the LLM)
Validation with correction: 5-15 seconds
Template fallback: <1 second

7. Conclusions#

The AI diagram generator demonstrates that systematic validation and syntax integration can significantly increase the reliability of LLM-generated syntax – from an initial ~70% to 95% in this project. The multi-agent architecture proved to be a structuring framework for LLM collaboration, although it remains unclear whether this can be generalized.

The key remaining challenge is the subsequent correction of erroneous code by LLMs themselves. Even with a complete syntax reference and multiple iterations, this is only successful in 80-85% of cases. This points to fundamental limitations of current LLM generations in complex syntax debugging tasks.

The following practices have proven valuable for future projects:

Integration of complete syntax references from the outset
Validation as an integral part, not a retrospective add-on
Gradual quality iteration instead of big-bang perfection
Human-in-the-loop for the “last few percent”

Important limitation: All findings come from a single project and should not be understood as universally valid. They represent observations under specific conditions (specific LLMs, specific domain, specific developer) and require further validation.

The tool itself has not yet reached production readiness, but it functions as a learning vehicle and occasional aid. The methodological findings—especially regarding the validation pipeline and pattern integration—have already been incorporated into follow-up projects and are being further developed there.