LLM-assisted coding: From specification to implementation of a document chat tool#

Introduction#

This documentation describes an experiment exploring LLM-assisted coding, which is part of an ongoing series investigating the possibilities and limitations of this development methodology. The focus is on how detailed specifications influence the quality and speed of code generation and which architecture patterns prove themselves for LLM-based development.

The tool developed primarily served as a vehicle for answering specific methodological questions. The practical usability of the tool was initially secondary, but proved to be unexpectedly high in the course of the experiment.

Experimental context and motivation#

Initial situation#

After the successful local commissioning of a large language model with a 256K token context, the question arose as to the practical limits of this context size. Previous experiments showed that the consistency of LLM responses typically decreases with increasing token count. It was unclear whether and how large context windows could be used productively for processing extensive documents.

Learning objectives of the experiment#

The experiment pursued several methodological objectives:

Practical evaluation of the usability of long contexts for document processing
Systematic investigation of response consistency at high token utilisation
Deepening the methodology of rapid development through high-quality specification
Exploration of optimal token usage without classic chunking strategies

Selection of the exercise example#

The development of a ‘Talk to Documents’ tool appeared to be a suitable test field, as it combines several dimensions of complexity:

Processing of different document formats
Intelligent content optimisation for token efficiency
Multi-document management with context limitation
Implementation of a reference system for source citations
Integration of a web interface for practical usability

The complexity was sufficient to generate meaningful methodological insights without exceeding the limits of LLM-based development.

How the developed tool works#

Core functionality#

The tool allows up to 10 documents in various formats (PDF, Word, Excel, PowerPoint, Text, Markdown, HTML, CSV, RTF) to be uploaded. After uploading, the documents go through several processing steps:

Extraction: Use of the Unstructured library to extract structured content from various formats
Content cleaning: Removal of headers, footers, page numbers and other recurring metadata
Deduplication: Identification and removal of duplicate or nearly identical content
Markdown formatting: Conversion to structure-preserving Markdown representation
Referencing: Automatic insertion of [P1], [P2] tags for source references
Token counting: Calculation and visualisation of context utilisation

The processed documents are kept entirely in the LLM context, without external databases or retrieval mechanisms. Queries are sent directly to the LLM with the combined document context.

Modularisation#

Modularisation follows a clear structure:

Extractors: Document processing and content extraction
Optimizers: content cleaning, deduplication, formatting
LLM integration: API communication and multi-document management
UI: Gradio-based web interface

Each module remains under 1000 lines of code, which has proven to be a practical limit for LLM maintainability. The main application comprises 800 lines, the Word exporter 400 lines, the LLM interface 300 lines, and formatting 250 lines.

Development process#

Phase 1: Specification creation (90 minutes)#

The main effort of the project focused on creating a detailed specification in interactive collaboration with an LLM. This phase included:

Discussion of different architectural approaches and weighing their advantages and disadvantages
Evaluation of different reference systems for source citations
Definition of concrete implementation details for all modules
Specification of interfaces between components
Establishing conventions and limitations

Several alternatives were discussed for the reference system, including complex metadata structures, parallel documents with annotations, and various retrieval mechanisms. The final solution with simple [P1], [P2] markings proved to be functional: simple and reliable to implement.

A central theme during the specification process was the consistent application of the KISS principle (Keep It Small and Simple). This required active countermeasures, as LLMs tend to suggest complex solutions that may be contained in their training data but cannot always be cleanly replicated.

Phase 2: Code generation (60 minutes)#

Once the specification was complete, code generation was performed by transferring the complete specification document to the LLM. The implementation proceeded in three main iterations:

Iteration 1: Generation of the core modules (extractors, optimisers, LLM interface)
Iteration 2: Implementation of UI components and integration
Iteration 3: Finalisation with Word export and deployment configuration

Most modules worked right away on the first run. Only a few adjustments to the document formatting and the integration of high-quality prompts for everyday use required some tweaking.

Phase 3: Documentation and deployment (approx. 60-90 minutes)#

The final phase included the creation of README, deployment configuration and initial testing. The entire development process took three days, which allowed for iterative reflection phases between work sessions.

LLMs used#

During development, various LLMs were used for both specification creation and code generation. Qwen3-30B-A3B-Instruct-2507, a locally operated model with 256K token context, is used for the productive operation of the tool.

Technical features of the implementation#

Content optimisation#

Token efficiency was a key aspect, as it was unclear at the start of the project how far the available context could be exploited. The optimisation strategy comprised several parts:

Content cleaning: Automatic removal of recurring elements such as page numbers, headers and footers, copyright notices and graphic artefacts (lines, separators). Recognition is achieved through pattern matching and identification of repetitive structures.

Deduplication: Three-step approach to eliminating redundant content: exact duplicates via hash comparisons, nearly identical content via fuzzy matching, structural duplicates such as repeated table headers via structural analysis.

Markdown formatting: Conversion of tables, lists and headings into compact Markdown syntax while preserving the document structure. This significantly reduces token consumption compared to full HTML or proprietary formats.

Token counting and visualisation#

The integration of Tiktoken for approximate token counting proved to be a helpful addition compared to previous projects. Visualising context utilisation allows the remaining capacity to be understood and prevents the context limit from being exceeded.

This component has been adopted in subsequent projects and is establishing itself as a standard pattern for applications with context limitations.

Streaming and interactivity#

The implementation of token-by-token streaming significantly improves the perceived response speed. A stop button allows long responses to be cancelled, which proved necessary in practice.

A technical challenge arose in minimising UI flickering in LaTeX formulas during streaming. The solution uses a buffer mechanism that collects characters and performs updates at semantic boundaries (sentence ends, paragraphs).

Methodological findings#

Specification quality as a success factor#

A key finding of the experiment: high-quality specifications translate directly into predominantly error-free, maintainable implementations. An LLM-compatible specification must go beyond functional requirements and establish the explicit connection between function, architecture and technical implementation.

Specifically, this means:

Not only ‘what’ is implemented, but also “how” and ‘why’
Explicit mention of design decisions and their justifications
Conscious decisions on the frameworks used and their versions
Concrete examples of expected inputs and outputs
Specification of error handling and edge cases
Definition of interfaces between components

The quality of the specification continuously evolves over several projects. Each experiment provides insights into which aspects need to be formulated more explicitly.

KISS principle as an active control task#

Enforcing simple solutions required continuous attention during specification creation. LLMs tend to suggest complex, possibly over-engineered solutions based on their training data. These solutions are often not optimally replicable or result in code that is difficult to maintain.

The appropriateness of proposed approaches was continuously questioned:

Is this approach appropriate for the specific problem?
Are there simpler alternatives?
Are current versions of frameworks taken into account?
Can the solution be reliably implemented by the LLM?
Does the code remain maintainable and understandable?

This critical evaluation took place during the specification creation phase, not just during the code review.

Modularisation and size limitation#

The limit of 1000 lines per file proved to be a practical heuristic. This limitation is based on the observation that LLMs cannot always fully comprehend greater complexity. When maintaining or modifying larger files, the error rate increases noticeably.

Modularisation according to clear areas of responsibility enabled isolated processing of individual components. Each module remains understandable and testable. The interfaces between modules were explicitly defined in the specification.

Limits and possibilities of LLM-supported development#

Despite considerable progress, specific limitations became apparent:

Complexity management: LLMs were not yet able to independently break down larger system complexity into smaller components. Breaking down the problem into manageable sub-problems remained the task of human developers.

Architectural decisions: Fundamental design decisions still required human judgement. LLMs were able to point out alternatives and discuss their advantages and disadvantages, but the final decision was made in an informed manner.

Error diagnosis: In the case of unexpected errors or edge cases, human intervention was often more efficient for diagnosis than iterative LLM queries.

Transferable patterns#

Several aspects proved to be potentially transferable beyond this project:

Specification-first approach: Investing more than half of the total time in specification accelerated implementation and significantly reduced errors.

Token visualisation: The integration of token counting and context utilisation display is establishing itself as a standard pattern for context-based applications.

Modular size limitation: The 1000-line limit per file proved to be a robust heuristic for LLM-maintainable code.

Iterative specification refinement: The continuous improvement of specification quality across multiple projects had a measurable effect on implementation quality.

Validation and practical proven performance#

Functional tests#

The tool was tested by various people with different document types. The feedback was consistently positive, both in terms of functionality and speed.

Surprising strengths#

The tool is popular and works very convincingly.

Response consistency: The central experimental question about consistency at high token utilisation was answered positively. Even with documents containing many tokens, the responses remained consistent and accurate.

Processing speed: Response times were in the range of a few seconds and corresponded to typical LLM response times.

Versatile applicability: Beyond its original intention, the tool proved useful for document version comparisons and structured analyses.

Identified potential for improvement#

The source citation function worked reliably, but could be optimised in terms of detail and presentation. This primarily concerns the formatting and display of references, not the basic functionality.

Typical context utilisation was well below 250K tokens, leaving room for additional features or more complex document collections.

Productive use#

Following the test phase, the tool is intended for productive use. What was originally a purely experimental idea has developed into a productive tool through practical application. This was not foreseeable at the start of the project, but it demonstrates the importance of functional validation even in primarily methodologically motivated experiments.

Metrics and quantification#

Code size#

The final system comprises 3,100 lines of code, distributed across 16 Python files. The breakdown by component is as follows:

Main application (gradio_app.py): 800 lines
Word exporter: 400 lines
LLM interface: 300 lines
Content formatting: 250 lines
Remaining modules (extractors, optimisers, configuration): approx. 1350 lines

Development time#

Total effort: 3-4 hours, spread over 3 days

Specification creation: 90 minutes (45% of total time)
Code generation: 60 minutes (30% of total time)
Documentation and deployment: 60-90 minutes (25% of total time)

Spreading the work over several days allowed for reflection phases between work sessions.

Iterations#

Three main iterations led from the initial code to the final version. The low number of iterations was a direct result of the quality of the specification.

Supported formats#

The tool processes more than 10 different document formats: PDF, Word (DOCX/DOC), Excel (XLSX/XLS), PowerPoint (PPTX/PPT), Text (TXT), Markdown (MD), reStructuredText (RST), HTML (HTML/HTM), CSV, RTF.

Conclusions and outlook#

The experiment confirmed several key hypotheses about LLM-assisted coding:

High-quality specifications were the critical success factor for efficient LLM-based development
KISS principles (Keep It Small and Simple) had to be actively enforced and did not arise automatically
Modular structuring with clear size limitations significantly improved LLM maintainability
The limitations of LLM-assisted development were primarily in complexity management and architectural decisions

The continuous refinement of specification techniques across multiple projects showed a measurable effect. Each experiment contributed to the development of more robust patterns. Future experiments will focus on specific aspects: scaling to larger projects, handling different document domains, integrating additional LLM capabilities such as vision models for OCR. The methodology of the specification-first approach has proven to be functional and potentially scalable. The documented patterns provide guidance for your own LLM-supported development projects.