LLM-supported data visualisation through code generation: A detailed case study#

1. Experimental Context and Motivation#

1.1 The Problem#

Despite their impressive capabilities in many areas, large language models (LLMs) show weaknesses when dealing with tabular data and numerical calculations. The central question of this experiment was: How can LLMs work effectively with structured data despite these limitations?

The chosen approach was based on code indirection: instead of confronting the LLM directly with numbers and calculations, it generates Python code for data analysis and visualisation. This code is executed in a controlled, secure environment – the actual calculations are performed by proven libraries such as Pandas and Plotly.

Another learning objective was to investigate multi-agent systems: How can different specialised agents interact to solve complex tasks? Specifically, agents for intent recognition, planning and execution should work in a coordinated manner.

1.2 Functional objective#

The functional question was pragmatic in nature: Is it possible to develop a tool that helps users quickly select suitable graphics for their data and creates simple visualisations largely automatically?

The approach was deliberately designed as a pure learning project – the primary aim was to find out how far agents and code generation could be used to solve problems, not to develop a tool ready for production.

2. The tool: Interactive Chart Generator#

2.1 Range of functions#

The Interactive Chart Generator processes CSV and Excel files (including multi-sheet support with automatic type recognition) and enables the creation of interactive Plotly visualisations via natural language chat requests.

Examples of user interactions:

‘Create bar charts for all sheets’
‘Show me the top 10 categories as a bar chart’
‘Colour the bars green for positive values’
‘Add percentage values above the bars’

The system distinguishes between different types of intent: modification of an existing chart, creation of a single chart, creation of charts for multiple sheets, or pure analysis queries without direct chart creation.

2.2 Technical stack#

The implementation is based on:

Gradio (web interface): Fixed infrastructure for all experiments in this series
Local LLMs (HU models) via OpenAI-compatible API: llm3 model
Pandas: Proven solution for structured data processing
Plotly Express: Pragmatic choice for interactive visualisations
Python 3.11+: Development environment

The system comprises approximately 8,500 lines of code spread across 33 files (27 Python files) and was developed entirely with LLM support.

3. Architecture evolution: From two agents to Intent→Plan→Execute#

3.1 Version 1: Direct chat agent + chart agent#

The initial approach implemented two agents:

A chat agent for communication
A chart agent for visualisation creation

This approach proved to be too unstable. The main problem was that the system could not reliably distinguish between exploratory discussion (‘Which charts would be useful?’) and concrete execution requests (‘Create a chart now’). The error rate was high, as charts were often generated when only advice was desired, or vice versa.

3.2 Version 2: Introduction of intent recognition#

The second iteration introduced an IntentService that analyses and classifies user requests:

modify: Change to the current chart
create_single: Create a chart for a specific sheet
create_multiple: Charts for multiple or all sheets
analyze: Recommendations only, no chart creation

This separation significantly improved accuracy. The system was now better able to decide which action should be performed.

3.3 Version 3: The complete Intent→Plan→Execute workflow#

The final architecture introduced an additional planning step, as it became clear that direct translation from intent to execution was not sufficient for complex requests.

IntentService: Analyses the user request and extracts the intent

Input: User message, available sheets, current chart status
Output: UserIntent with action type, target sheets, reasoning, confidence

PlanService: Translates the intent into a concrete execution plan

Input: UserIntent, DataFrame metadata, current chart code (if modified)
Output: ExecutionPlan with detailed ChartStep objects or ModificationStep objects

ExecutionService: Coordinates the actual execution

Input: ExecutionPlan, session data
Output: ExecutionResult with created charts or error messages
Implements retry logic and error correction

These three services communicate via clearly defined data models (UserIntent, ExecutionPlan, ChartStep, ExecutionResult), which significantly improves testability and maintainability.

3.4 Secure code execution#

Code execution takes place in a controlled sandbox:

safe_globals = {
    “pd”: pd,
    “px”: px,
    “go”: go,
    “np”: np,
    “df”: df.copy(),
}

safe_builtins = {
    “len”: len, “str”: str, “int”: int, “float”: float,
    # ... other permitted functions, but WITHOUT __import__
}

safe_globals[“__builtins__”] = safe_builtins
exec(code, safe_globals, safe_locals)

This sandbox implementation was added as a separate learning field after it became clear that chart code can become quite complex. The restriction to predefined modules (without __import__) prevents potentially dangerous operations.

4. Development decision: Pattern libraries#

4.1 The problem: Unstable code generation#

A key insight during development was that pure LLM-based code generation for chart creation did not work reliably enough. Common problems:

Inconsistent use of .groupby() (with/without as_index=False)
Forgotten .dropna() calls, leading to errors for missing values
Complex and fragile constructs such as pd.Categorical() with nested lists
.size() often failed to rename “size” to “count”, leading to column conflicts

4.2 The solution: hybrid approach with 44 code patterns#

The implementation of extensive pattern libraries resolved this instability. The system comprises:

23 working implementation patterns:

PATTERN_SIMPLE_BAR: Simple bar chart with aggregation
PATTERN_GROUPED_BAR_TWO_VARS: Grouped bars with colour coding
PATTERN_TIME_SERIES_SIMPLE: Time series visualisation
PATTERN_TOP_N_SIMPLE: Top N display with sorting
PATTERN_SEMANTIC_COLOR_DISCRETE_MAP: Semantic colour mapping
… and 18 more

7 anti-patterns for error prevention:

ANTI_PATTERN_COMPLEX_CATEGORICAL: Warns against pd.Categorical() with complex lists
ANTI_PATTERN_NO_DROPNA: Enforces explicit NaN handling
ANTI_PATTERN_MISSING_RESET_INDEX: Prevents index problems after groupby
… and 4 more

9 modification patterns:

MODIFICATION_PATTERN_SINGLE_COLOR: Simple colour change
MODIFICATION_PATTERN_ADD_LABELS: Value labels above bars
MODIFICATION_PATTERN_CHANGE_TITLE: Title adjustment
… and 6 more

5 additional patterns:

Semantic colour recognition
Gradients for ordinal scales
Pattern selection strategies

4.3 How the hybrid approach works#

The PlanService receives this pattern library in the prompt. The LLM then selects the appropriate pattern based on:

Data types (numeric, categorical, datetime)
User query
Number of data points

the appropriate pattern and adapts it by replacing the column names. This approach combines LLM intelligence (selection, adaptation) with template stability (proven implementations).

5. Development process and workflow#

5.1 Specification-driven approach#

The development process was highly specification-driven, not exploratory-iterative. For each development stage, a comprehensive specification was created and then implemented.

Typical cycle:

Specification creation: Discussion of technical approaches, architecture, UI design, component interactions, functional goals
Implementation: LLM-assisted code generation based on specification
Adjustment round: Fine-tuning and bug fixes
Total per iteration stage: ~1 hour

The initial specification was particularly extensive and took about an hour. The total development time across all iteration stages was approximately 3 hours, plus 30 minutes for Docker deployment (more complex requirements).

5.2 Why specifications instead of micro-prompting?#

The conscious decision to use high-quality, detailed specifications was based on the realisation that small-scale micro-prompting leads to fragmented, inconsistent development. When components are developed in many small prompt loops:

The overall context is missing
Inconsistent naming conventions arise
The architecture becomes unstructured
The coordination effort increases

Comprehensive specifications enable the LLM to:

Understand the overall architecture
Implement consistently across components
Anticipate interfaces
Use consistent naming and structuring

5.3 KISS principle as a counterbalance#

With an already complex solution (multi-agent system, pattern libraries, secure code execution), the KISS principle (Keep It Small and Simple) was a deliberate strategy to avoid over-engineering.

Specific decisions:

No excessive abstraction (e.g. no factory patterns for simple object creation)
Direct function calls instead of complex event systems
Sequential workflows instead of asynchronous architectures
Clear data models instead of nested dictionaries

6. Methodological insights: Transferable learnings#

6.1 Intent→Plan→Execute as a proven architecture pattern#

The three-stage separation proved to be robust for complex NLP-driven systems:

Advantages:

Clear responsibilities per service
Better testability (each phase can be tested in isolation)
More targeted error handling
Possibility for dry runs (create plan without execution)

Transferability:
This pattern is not specific to chart generation. It is suitable for any system where natural language queries need to be translated into technical actions:

Code generation tools
Automation workflows
Database query systems
Configuration management

6.2 Pattern libraries stabilise code generation#

The integration of code patterns significantly improved stability. Quantitative statements on error reduction are difficult, but qualitatively the difference was clearly noticeable.

Findings:

LLMs are good at learning from examples and adapting them
Pattern libraries can effectively support abstract descriptions
Anti-patterns are just as important as positive patterns
The combination (LLM selection + template stability) is more robust than pure generation

Transferability:
For domains with known best practices and frequent error patterns (web development, API integration, database operations), pattern libraries are a promising approach.

6.3 LLM control outperforms heuristics

Despite imperfect reliability, LLM-controlled processing proved superior to rule-based heuristics.

Example of multi-sheet handling: One heuristic could be: ‘If “all” is in the text → process all sheets’. But users formulate in many different ways:

‘for all sheets’
‘every sheet’
‘all tables’
‘all data’
‘create charts everywhere’

LLMs generalise better across these variations. The challenge remains: 100% reliability is difficult to achieve. But 90-95% with LLM is more practical than 70% with heuristics, which must be constantly expanded to accommodate new formulations.

6.4 Retry logic and self-correction#

The implemented fix_code() method enables LLM-based error correction:

for attempt in range(max_retries):
    try:
        fig = execute_code(code, df)        return fig
    except Exception as e:
        if attempt < max_retries:
            code = llm.fix_code(
                failed_code=code,
                error_message=str(e),
                df_metadata=metadata
            )

Findings:

Works for standard errors (TypeError, KeyError, AttributeError)
Moderately reduces error rate (qualitatively: ‘slightly suppressed’)
Does not achieve 100% reliability
Risk of retry loops in case of systematic problems

Transferability:
Retry with LLM correction is useful for:

Syntactic errors
Simple logical errors
Incorrect API usage

Less suitable for:

Complex algorithm problems
Performance optimisations
Architectural errors

6.5 Examples compensate for model weaknesses#

The local LLM Qwen3-30B-A3B-Instruct-2507 used handles large contexts (up to 250,000 tokens) but shows errors in code generation.

The extensive pattern libraries successfully compensated for this weakness. This points to a general strategy: For weaker models, extensive, high-quality examples in context can reduce the performance gap to stronger models.

7. Challenges and limitations#

7.1 Multi-sheet handling#

Distinguishing between requests for single vs. multiple graphics was initially problematic. User formulations are diverse:

‘Create a chart’ (singular, but which sheet?)
‘For all sheets’ (clearly plural)
‘Show the data’ (unclear)

Solution:

Intent patterns with explicit trigger words
Contextual analysis (how many sheets are available?)
Confidence scoring for ambiguous cases

7.2 Semantic colour mapping#

Natural language colour specifications such as ‘green for positive values’ required additional logic. The implemented SemanticColorHelper module:

Recognises ordinal scales (Likert scales with ‘very satisfied’ to ‘very dissatisfied’)
Extracts colour intentions from user messages
Generates color_discrete_map for category-based mappings

This was necessary to go beyond simple position-based colour assignments and enable semantically correct visualisations.

7.3 Reliability: The 100% hurdle#

The fundamental limitation remains: It is difficult to implement solutions that work 100% of the time, not just almost always.

Observations:

The system works reliably in the majority of cases (estimated 85-90%)
Edge cases and unusual data structures lead to errors
User formulations outside the trained spectrum may fail

Pragmatic approach:
Instead of striving for perfect reliability, the focus was placed on:

Graceful degradation (communicating errors transparently)
Retry mechanisms
Extensive validation
Clear error messages

This is acceptable for learning tools and exploratory tools, but would not be sufficient for production-critical systems.

8. Validation and status#

8.1 Test strategy#

The tool is currently in the testing phase. The validation strategy includes:

Tests with various Excel files of varying complexity
Multi-sheet documents with up to 10+ sheets
Different data types (numerical, categorical, time series, ordinal scales)
Different user formulations and intent types

Results to date:

Multi-sheet handling works reliably
Charts are generated correctly for different data types
Intent recognition works accurately for standard formulations

8.2 Next steps#

The tool is currently being made available to a wider user group in order to:

Identify edge cases
Test unusual data structures
Collect feedback on usability and functionality
Further improve robustness

9. Transferable insights: Summary#

The following principles can be derived for future LLM-based coding projects:

Multi-agents with clear phase separation (intent→plan→execute) work better than monolithic approaches
Specification-driven development is superior to iterative micro-prompting for coherent, maintainable systems
Pattern libraries significantly stabilise code generation, especially in domains with known best practices
LLM control generalises better than heuristics, even if 100% reliability remains difficult to achieve
Code indirection is a promising approach for LLM limitations in numerical and tabular tasks
Extensive examples in context can reduce performance gaps in weaker models
KISS principles are essential as a counterbalance in already complex systems
Retry with LLM correction reduces errors moderately, but does not replace a solid basic architecture