LLM coding experiment: Document translation system#
Methodological findings on the role of specification quality and the possibilities of LLM-supported development#
Summary#
This article documents the development of a web-based document translation system as an experiment to explore LLM-supported coding. The project investigated how far clear specifications can take you with more complex applications and where the practical challenges of the methodology lie. With a total effort of 6-7 hours, a productively used application with about 12,000 lines of code was created. A key finding: The quality of the specification – in particular the explicitation of technical dependencies – proved to be a key success factor. At the same time, the project highlights limitations: for larger systems, even more systematic specification processes may be relevant.
1. Experimental context#
1.1 Motivation and learning objectives#
The idea for the project arose from a practical need: In everyday life, translation tasks for documents in various formats arise on a regular basis. At the same time, the aim was to explore the complexities that arise in a seemingly simple use case such as a translation tool. Translation is a suitable application for LLM tools, but the requirement to translate documents of any size in various formats while maintaining structure and stylistic continuity poses further technical challenges.
The primary learning objective was to investigate how well LLMs can handle complex, multi-stage development tasks when a clear specification is available. From the outset, the project was designed both as a learning vehicle and as a potentially productive tool.
1.2 Technical challenges#
What initially appears simple turns out, on closer inspection, to be a complex system with several non-trivial sub-problems:
- Processing different document formats while preserving structure
- Intelligent chunking of large documents, taking into account sentence and paragraph boundaries
- Parallel API processing with rate limiting and error handling
- Consistency of translation across chunk boundaries
- Special handling of complex elements such as tables
- User-friendly interface with progress bar
- Session management for parallel use
This project was chosen because it combines several challenging areas and thus represents a realistic reflection of more complex development tasks.
2. Technical implementation#
2.1 Architecture and system design#
The system was developed using Python and Gradio as the web framework. The architecture follows a modular structure with a clear separation of concerns:
Core components:
- Document processing: Extraction and structure recognition from PDF, Word, text and Markdown
- Context management system: Ensuring stylistic continuity
- Chunking system: Intelligent segmentation taking semantic boundaries into account
- Translation engine: Parallel translation with context management
- Export system: Output in various formats (Markdown, Word, HTML)
- UI components: Gradio-based interface with side-by-side view
The architecture was developed interactively with the LLM. A key design decision was the ‘unified pipeline’: all document formats are first converted into a uniform Markdown format, which then runs through the same processing pipeline. This significantly reduces complexity compared to format-specific processing stages, as conceived by LLMs during specification.
2.2 Libraries used#
The main libraries were selected pragmatically:
- Gradio: Proven in many projects for rapid UI development
- PyMuPDF/pymupdf4llm: For PDF processing, based on previous experience
- python-docx: Standard for Word document processing
- aiohttp: For asynchronous HTTP requests to the LLM API
- tenacity: Retry logic with exponential backoff
The complete application comprises approximately 12,000 lines of code (excluding comments) spread across 40 files, organised into clearly structured modules for document processing, chunking, translation, UI and session management.
2.3 Special implementation details#
Uniform pipeline: Initially, the LLM proposed format-specific processing stages – different pipelines for PDF, Word and other formats. Through LLM discussion about the different processing quality of various formats in the result, it became clear that a uniform Markdown-based pipeline is significantly simpler and more maintainable.
Context management for translation consistency: A key challenge was to ensure stylistic and terminological consistency across chunk boundaries. The implemented solution passes each translation chunk:
- An optional glossary with defined translations
- The last 50 translated key terms
- Summaries of the previous 3 chunks (200 characters each)
- Position information in the document
This approach worked surprisingly well and delivers consistent translations even for large documents.
Smart Boundary Detection: The chunking system respects a hierarchy: headings > paragraphs > sentences. It never splits in the middle of a sentence. Tables are treated as complete units; if the token limits are exceeded, they are split with repeated headers. This logic proved to be technically challenging, but it works reliably.
Parallel processing with rate limiting: By default, the system processes 5 chunks in parallel, respecting API limits and implementing retry logic with exponential backoff. Asynchronous processing was a deliberately chosen learning field and works robustly.
3. Development process with LLM#
3.1 Workflow and methodology#
The development process followed a clear pattern:
- Specification development (30-60 minutes): Iterative discussion with the LLM to develop functional and technical requirements
- Code generation in large steps: Implementation of complete modules instead of incremental individual steps
- Iterative corrections: 2-3 iterations for complex modules such as chunking and translation engine
- Documentation and deployment (approx. 60 minutes)
The total effort was 6-7 hours, spread over three main iterations. This contrasts significantly with the 4-6 weeks originally estimated for manual development.
3.2 LLMs used#
Various LLMs were used during development. In productive operation, the system runs on Mistral Small 2506, which suggests that even for more complex translation tasks, the largest models are not necessarily required.
3.3 Quality of the specification#
The specification was already clear in terms of functionality, but had not yet been fully thought through in terms of technical implementation and the technical dependencies of functional decisions. This meant that important technical aspects had to be clarified iteratively:
- The decision to use a unified pipeline instead of format-specific processing
- The specific chunking procedure with smart boundaries
- The implementation of context management
- Dealing with table splitting
This project marked an intermediate step in the development of the specification methodology: better specification than in previous projects, but not yet fully thought through down to the last detail.
4. Methodological findings#
4.1 Key finding: specification quality was crucial#
The requirements for specifications for LLM-supported coding proved to be high. The more detailed the functional requirements and technical dependencies were worked out in advance, the easier and more error-free the subsequent implementation was. Interacting with LLMs tempts one to ignore details and complications – but this is precisely where errors occurred.
Aspects that proved to be important in this project:
- Functional requirements were formulated so clearly that technical dependencies became comprehensible
- Technically difficult aspects were explicitly described and discussed
- Alternative solutions were weighed up and the decision documented
- Architectural decisions were clearly justified
4.2 Technical dependencies proved to be central#
An important insight was that functional clarity was not enough. The technical implementation and, in particular, the dependencies between components also had to be clearly identified. Examples from this project:
- The decision to use Markdown as an intermediate format had an impact on the entire pipeline
- The chunking process directly influenced the translation quality and was coordinated with context management
- The choice of parallel processing required appropriate session management
If these dependencies were not clarified in advance, inconsistencies arose or refactoring was required.
4.3 LLMs tend towards overengineering#
A consistent observation was the tendency of LLMs to propose solutions that were more complex than necessary. In problem areas that appeared simple, complicated but ‘clear’ approaches were often suggested. The better solution usually resulted from careful elaboration and consideration of the alternatives.
Specific examples:
- Format-specific pipelines instead of uniform Markdown processing
- Complex state management solutions where simple approaches would have sufficed
- Excessive abstraction for functions that were initially used only once
The role of the developers shifted: it was less about code creation and more about critically evaluating the LLM proposals and actively demanding simpler solutions.
4.4 Limits of the methodology#
With around 12,000 lines of code, this project already marks a limit in terms of scope. For larger projects, even clearer and more systematic specification processes may be relevant. The insight gained from this experiment: with a ‘good but not exceptionally high-quality’ specification, systems of this complexity could still be successfully implemented. However, an even higher standard may be relevant for larger systems.
Identified limitations:
- Above a certain system size, functional clarity was no longer sufficient
- Architectural decisions required more explicit documentation and justification
- Interfaces between modules required more precise specification in this case
- Performance requirements proved to be important to quantify
4.5 Proven patterns and structuring approaches#
Uniform pipeline: Consolidating to a uniform intermediate format (Markdown) simplifies the system considerably and reduces sources of error. This is a potentially transferable pattern: normalise heterogeneous inputs to a common format, then process them uniformly.
Modular prompt structures: The use of separate, combinable prompt modules for different aspects (base translation, glossary, context, styles) proved to be maintainable and flexible.
Context management across boundaries: The approach of explicitly including context when processing segmented data (chunks) could also be transferable to other domains – for example, when processing large codebases or long documents.
5. Validation and productive use#
5.1 Functional validation#
Validation was carried out by:
- Systematic testing with different document types (PDF, Word, text) and sizes
- Manual checking of translation quality and structure preservation
- Performance testing with large documents
- Checking of output formats
The system proved to be robust and delivers high-quality translations with reliable structure preservation.
5.2 Productive use#
The tool is now in productive use. The output to files makes it particularly useful for larger translations. The system has expanded from an initial 6 to 10 supported languages. Currently, additional languages are being requested, for which it must first be checked whether the LLMs used can output them with sufficient quality.
The translation speed depends heavily on the document size or the number of chunks. Smaller files are processed very quickly, while larger documents benefit from parallel processing.
5.3 Evolution from experiment to productive tool#
The tool has evolved from a pure experiment to a productively used tool. This was both planned and a positive surprise in terms of code quality and robustness. The fact that the system is stable in use and has been expanded validates the fundamental feasibility of LLM-supported development for this class of complexity.
6. Observations for future projects#
6.1 Investment in specification quality#
A key observation for the following projects: In this project, a great deal of effort was invested in really clear specifications – both for functional descriptions and for architecture and detailed technical implementation concepts.
The time saved (6-7 hours instead of several weeks) was made possible by clarifying the requirements in advance. It became apparent that the more complex the system, the higher the requirements for the specification in this case.
6.2 Role of the developers#
In this project, the role of the developers shifted from code creation to:
- Making and justifying architectural decisions
- Critically evaluating LLM proposals
- Actively demanding simple solutions
- Quality assurance of the specification
- Identifying and explaining technical dependencies
6.3 Balance between functional and technical specification#
A purely functional specification was not sufficient in this project. The technical implementation – in particular architectural decisions and dependencies between components – proved to be just as important to clarify. This required technical understanding already in the specification phase.
6.4 Iterative refinement as observation#
Even with a good specification, 2-3 iterations were necessary for complex modules. The challenge was to clarify the right aspects in the first iteration and not to invest too much time in revision.
7. Summary and outlook#
The experiment showed that LLM-supported coding worked efficiently for systems of medium complexity (here: ~12,000 lines of code) with a good specification. The time savings were considerable. At the same time, limitations became apparent: the approach to developing the specification and the implementation process are key factors here.
For larger projects, even more systematic specification processes may be relevant. The challenge lies in documenting functional and technical requirements as well as architectural decisions in such a way that LLMs can generate consistent, maintainable code from them.
The tendency of LLMs to over-engineer required active control by the developers. Simple solutions had to be demanded and alternatives had to be weighed up. The role of developers is changing from code writers to architects and quality assurance specialists.
Metrics at a glance:
- Development effort: 6-7 hours (including 30-60 minutes of specification)
- Code size: ~12,000 lines in 40 files
- Main iterations: 3 (initial → uniform pipeline → final version)
- Supported languages: 6 initially, 10 currently
- Status: In productive use
This article is part of a series on the methodical documentation of LLM-supported development projects.