LLM-supported data visualisation through code generation: A detailed case study#
1. Experimental Context and Motivation#
1.1 The Problem#
Despite their impressive capabilities in many areas, large language models (LLMs) show weaknesses when dealing with tabular data and numerical calculations. The central question of this experiment was: How can LLMs work effectively with structured data despite these limitations?
The chosen approach was based on code indirection: instead of confronting the LLM directly with numbers and calculations, it generates Python code for data analysis and visualisation. This code is executed in a controlled, secure environment – the actual calculations are performed by proven libraries such as Pandas and Plotly.
Another learning objective was to investigate multi-agent systems: How can different specialised agents interact to solve complex tasks? Specifically, agents for intent recognition, planning and execution should work in a coordinated manner.
1.2 Functional objective#
The functional question was pragmatic in nature: Is it possible to develop a tool that helps users quickly select suitable graphics for their data and creates simple visualisations largely automatically?
The approach was deliberately designed as a pure learning project – the primary aim was to find out how far agents and code generation could be used to solve problems, not to develop a tool ready for production.
2. The tool: Interactive Chart Generator#
2.1 Range of functions#
The Interactive Chart Generator processes CSV and Excel files (including multi-sheet support with automatic type recognition) and enables the creation of interactive Plotly visualisations via natural language chat requests.
Examples of user interactions:
- ‘Create bar charts for all sheets’
- ‘Show me the top 10 categories as a bar chart’
- ‘Colour the bars green for positive values’
- ‘Add percentage values above the bars’
The system distinguishes between different types of intent: modification of an existing chart, creation of a single chart, creation of charts for multiple sheets, or pure analysis queries without direct chart creation.
2.2 Technical stack#
The implementation is based on:
- Gradio (web interface): Fixed infrastructure for all experiments in this series
- Local LLMs (HU models) via OpenAI-compatible API: llm3 model
- Pandas: Proven solution for structured data processing
- Plotly Express: Pragmatic choice for interactive visualisations
- Python 3.11+: Development environment
The system comprises approximately 8,500 lines of code spread across 33 files (27 Python files) and was developed entirely with LLM support.
3. Architecture evolution: From two agents to Intent→Plan→Execute#
3.1 Version 1: Direct chat agent + chart agent#
The initial approach implemented two agents:
- A chat agent for communication
- A chart agent for visualisation creation
This approach proved to be too unstable. The main problem was that the system could not reliably distinguish between exploratory discussion (‘Which charts would be useful?’) and concrete execution requests (‘Create a chart now’). The error rate was high, as charts were often generated when only advice was desired, or vice versa.
3.2 Version 2: Introduction of intent recognition#
The second iteration introduced an IntentService that analyses and classifies user requests:
- modify: Change to the current chart
- create_single: Create a chart for a specific sheet
- create_multiple: Charts for multiple or all sheets
- analyze: Recommendations only, no chart creation
This separation significantly improved accuracy. The system was now better able to decide which action should be performed.
3.3 Version 3: The complete Intent→Plan→Execute workflow#
The final architecture introduced an additional planning step, as it became clear that direct translation from intent to execution was not sufficient for complex requests.
IntentService: Analyses the user request and extracts the intent
- Input: User message, available sheets, current chart status
- Output:
UserIntentwith action type, target sheets, reasoning, confidence
PlanService: Translates the intent into a concrete execution plan
- Input:
UserIntent, DataFrame metadata, current chart code (if modified) - Output:
ExecutionPlanwith detailedChartStepobjects orModificationStepobjects
ExecutionService: Coordinates the actual execution
- Input:
ExecutionPlan, session data - Output:
ExecutionResultwith created charts or error messages - Implements retry logic and error correction
These three services communicate via clearly defined data models (UserIntent, ExecutionPlan, ChartStep, ExecutionResult), which significantly improves testability and maintainability.
3.4 Secure code execution#
Code execution takes place in a controlled sandbox:
safe_globals = {
“pd”: pd,
“px”: px,
“go”: go,
“np”: np,
“df”: df.copy(),
}
safe_builtins = {
“len”: len, “str”: str, “int”: int, “float”: float,
# ... other permitted functions, but WITHOUT __import__
}
safe_globals[“__builtins__”] = safe_builtins
exec(code, safe_globals, safe_locals)This sandbox implementation was added as a separate learning field after it became clear that chart code can become quite complex. The restriction to predefined modules (without __import__) prevents potentially dangerous operations.
4. Development decision: Pattern libraries#
4.1 The problem: Unstable code generation#
A key insight during development was that pure LLM-based code generation for chart creation did not work reliably enough. Common problems:
- Inconsistent use of
.groupby()(with/withoutas_index=False) - Forgotten
.dropna()calls, leading to errors for missing values - Complex and fragile constructs such as
pd.Categorical()with nested lists .size()often failed to rename “size” to “count”, leading to column conflicts
4.2 The solution: hybrid approach with 44 code patterns#
The implementation of extensive pattern libraries resolved this instability. The system comprises:
23 working implementation patterns:
PATTERN_SIMPLE_BAR: Simple bar chart with aggregationPATTERN_GROUPED_BAR_TWO_VARS: Grouped bars with colour codingPATTERN_TIME_SERIES_SIMPLE: Time series visualisationPATTERN_TOP_N_SIMPLE: Top N display with sortingPATTERN_SEMANTIC_COLOR_DISCRETE_MAP: Semantic colour mapping- … and 18 more
7 anti-patterns for error prevention:
ANTI_PATTERN_COMPLEX_CATEGORICAL: Warns againstpd.Categorical()with complex listsANTI_PATTERN_NO_DROPNA: Enforces explicit NaN handlingANTI_PATTERN_MISSING_RESET_INDEX: Prevents index problems after groupby- … and 4 more
9 modification patterns:
MODIFICATION_PATTERN_SINGLE_COLOR: Simple colour changeMODIFICATION_PATTERN_ADD_LABELS: Value labels above barsMODIFICATION_PATTERN_CHANGE_TITLE: Title adjustment- … and 6 more
5 additional patterns:
- Semantic colour recognition
- Gradients for ordinal scales
- Pattern selection strategies
4.3 How the hybrid approach works#
The PlanService receives this pattern library in the prompt. The LLM then selects the appropriate pattern based on:
- Data types (numeric, categorical, datetime)
- User query
- Number of data points
the appropriate pattern and adapts it by replacing the column names. This approach combines LLM intelligence (selection, adaptation) with template stability (proven implementations).
5. Development process and workflow#
5.1 Specification-driven approach#
The development process was highly specification-driven, not exploratory-iterative. For each development stage, a comprehensive specification was created and then implemented.
Typical cycle:
- Specification creation: Discussion of technical approaches, architecture, UI design, component interactions, functional goals
- Implementation: LLM-assisted code generation based on specification
- Adjustment round: Fine-tuning and bug fixes
- Total per iteration stage: ~1 hour
The initial specification was particularly extensive and took about an hour. The total development time across all iteration stages was approximately 3 hours, plus 30 minutes for Docker deployment (more complex requirements).
5.2 Why specifications instead of micro-prompting?#
The conscious decision to use high-quality, detailed specifications was based on the realisation that small-scale micro-prompting leads to fragmented, inconsistent development. When components are developed in many small prompt loops:
- The overall context is missing
- Inconsistent naming conventions arise
- The architecture becomes unstructured
- The coordination effort increases
Comprehensive specifications enable the LLM to:
- Understand the overall architecture
- Implement consistently across components
- Anticipate interfaces
- Use consistent naming and structuring
5.3 KISS principle as a counterbalance#
With an already complex solution (multi-agent system, pattern libraries, secure code execution), the KISS principle (Keep It Small and Simple) was a deliberate strategy to avoid over-engineering.
Specific decisions:
- No excessive abstraction (e.g. no factory patterns for simple object creation)
- Direct function calls instead of complex event systems
- Sequential workflows instead of asynchronous architectures
- Clear data models instead of nested dictionaries
6. Methodological insights: Transferable learnings#
6.1 Intent→Plan→Execute as a proven architecture pattern#
The three-stage separation proved to be robust for complex NLP-driven systems:
Advantages:
- Clear responsibilities per service
- Better testability (each phase can be tested in isolation)
- More targeted error handling
- Possibility for dry runs (create plan without execution)
Transferability:
This pattern is not specific to chart generation. It is suitable for any system where natural language queries need to be translated into technical actions:
- Code generation tools
- Automation workflows
- Database query systems
- Configuration management
6.2 Pattern libraries stabilise code generation#
The integration of code patterns significantly improved stability. Quantitative statements on error reduction are difficult, but qualitatively the difference was clearly noticeable.
Findings:
- LLMs are good at learning from examples and adapting them
- Pattern libraries can effectively support abstract descriptions
- Anti-patterns are just as important as positive patterns
- The combination (LLM selection + template stability) is more robust than pure generation
Transferability:
For domains with known best practices and frequent error patterns (web development, API integration, database operations), pattern libraries are a promising approach.
6.3 LLM control outperforms heuristics
Despite imperfect reliability, LLM-controlled processing proved superior to rule-based heuristics.
Example of multi-sheet handling: One heuristic could be: ‘If “all” is in the text → process all sheets’. But users formulate in many different ways:
- ‘for all sheets’
- ‘every sheet’
- ‘all tables’
- ‘all data’
- ‘create charts everywhere’
LLMs generalise better across these variations. The challenge remains: 100% reliability is difficult to achieve. But 90-95% with LLM is more practical than 70% with heuristics, which must be constantly expanded to accommodate new formulations.
6.4 Retry logic and self-correction#
The implemented fix_code() method enables LLM-based error correction:
for attempt in range(max_retries):
try:
fig = execute_code(code, df) return fig
except Exception as e:
if attempt < max_retries:
code = llm.fix_code(
failed_code=code,
error_message=str(e),
df_metadata=metadata
)Findings:
- Works for standard errors (TypeError, KeyError, AttributeError)
- Moderately reduces error rate (qualitatively: ‘slightly suppressed’)
- Does not achieve 100% reliability
- Risk of retry loops in case of systematic problems
Transferability:
Retry with LLM correction is useful for:
- Syntactic errors
- Simple logical errors
- Incorrect API usage
Less suitable for:
- Complex algorithm problems
- Performance optimisations
- Architectural errors
6.5 Examples compensate for model weaknesses#
The local LLM Qwen3-30B-A3B-Instruct-2507 used handles large contexts (up to 250,000 tokens) but shows errors in code generation.
The extensive pattern libraries successfully compensated for this weakness. This points to a general strategy: For weaker models, extensive, high-quality examples in context can reduce the performance gap to stronger models.
7. Challenges and limitations#
7.1 Multi-sheet handling#
Distinguishing between requests for single vs. multiple graphics was initially problematic. User formulations are diverse:
- ‘Create a chart’ (singular, but which sheet?)
- ‘For all sheets’ (clearly plural)
- ‘Show the data’ (unclear)
Solution:
- Intent patterns with explicit trigger words
- Contextual analysis (how many sheets are available?)
- Confidence scoring for ambiguous cases
7.2 Semantic colour mapping#
Natural language colour specifications such as ‘green for positive values’ required additional logic. The implemented SemanticColorHelper module:
- Recognises ordinal scales (Likert scales with ‘very satisfied’ to ‘very dissatisfied’)
- Extracts colour intentions from user messages
- Generates
color_discrete_mapfor category-based mappings
This was necessary to go beyond simple position-based colour assignments and enable semantically correct visualisations.
7.3 Reliability: The 100% hurdle#
The fundamental limitation remains: It is difficult to implement solutions that work 100% of the time, not just almost always.
Observations:
- The system works reliably in the majority of cases (estimated 85-90%)
- Edge cases and unusual data structures lead to errors
- User formulations outside the trained spectrum may fail
Pragmatic approach:
Instead of striving for perfect reliability, the focus was placed on:
- Graceful degradation (communicating errors transparently)
- Retry mechanisms
- Extensive validation
- Clear error messages
This is acceptable for learning tools and exploratory tools, but would not be sufficient for production-critical systems.
8. Validation and status#
8.1 Test strategy#
The tool is currently in the testing phase. The validation strategy includes:
- Tests with various Excel files of varying complexity
- Multi-sheet documents with up to 10+ sheets
- Different data types (numerical, categorical, time series, ordinal scales)
- Different user formulations and intent types
Results to date:
- Multi-sheet handling works reliably
- Charts are generated correctly for different data types
- Intent recognition works accurately for standard formulations
8.2 Next steps#
The tool is currently being made available to a wider user group in order to:
- Identify edge cases
- Test unusual data structures
- Collect feedback on usability and functionality
- Further improve robustness
9. Transferable insights: Summary#
The following principles can be derived for future LLM-based coding projects:
Multi-agents with clear phase separation (intent→plan→execute) work better than monolithic approaches
Specification-driven development is superior to iterative micro-prompting for coherent, maintainable systems
Pattern libraries significantly stabilise code generation, especially in domains with known best practices
LLM control generalises better than heuristics, even if 100% reliability remains difficult to achieve
Code indirection is a promising approach for LLM limitations in numerical and tabular tasks
Extensive examples in context can reduce performance gaps in weaker models
KISS principles are essential as a counterbalance in already complex systems
Retry with LLM correction reduces errors moderately, but does not replace a solid basic architecture