Document summarisation with LLM coding: The beginning of a methodical learning journey#

Introduction: The first systematic experiment#

This document describes the development of a document summarisation tool as the starting point for a series of methodical experiments on LLM-supported coding. Unlike later, more complex projects in the series, this tool – developed more than a year ago – marks the conscious entry into systematic LLM coding.

The temporal context is important: since then, all the technologies involved have evolved considerably. Today, LLMs offer greater context, better reasoning abilities and more reliable prompt following. Libraries and frameworks have been refined. The findings documented here should therefore be understood in their historical context – they show the state of the art and our own competence at the beginning of the learning journey.

However, this starting point is important in the context of this series of articles: it shows how to get started with simple tools and, from there, move on to increasingly complex programmes. The principles developed here remained relevant in later projects.

Unlike purely exploratory projects, this tool arose from a specific practical need: the rapid summarisation of arXiv papers according to various questions, even for large documents.

The experiment had a dual purpose: on the one hand, the aim was to create a functional tool; on the other hand, this was the first conscious attempt to systematically explore LLM-supported coding. Three key challenges needed to be understood:

Dealing with context limitations when processing large documents
Avoiding hallucinations in generated summaries
Enforcing prompt following for complex, multi-layered tasks

The result was surprisingly simple: 280 lines of Python code, developed in 5-6 hours, with several hundred daily uses in productive operation.

Technical architecture and implementation#

System overview#

The tool is based on a multi-layered architecture:

UI layer: Gradio was deliberately chosen as the web framework because it is easy to understand for inexperienced users and provides all the necessary components for LLM applications. The decision to use a web interface enabled uncomplicated document processing while maintaining data protection.

Backend LLM: Mistral Small 2506 with 32K token context, operated locally. The decision against commercial APIs was made deliberately due to token consumption and data protection requirements. The 32K context already enabled direct summarisation without chunking for many documents.

**Processing pipeline: ** The code base comprised 10 main functions in one file:

LLM class as API wrapper
Three format-specific extraction functions (PDF, DOCX, ODT)
Central summarisation function with retry logic
Gradio UI configuration

The chunking strategy: Two-stage summarisation#

The core of the implementation is the strategy for documents that exceed the LLM context:

Stage 1 – Segmentation and individual summarisation:

max_chunks = 10000
chunks = [text[i:i + max_chunks] for i in range(0, len(text), max_chunks)]
summaries = []
for chunk in chunks:
    partialSummary = summarise_text(llm, chunk, instructions, summary_type, prompt_only, language)
    summaries.append(partialSummary)

Documents were broken down into 10,000-character segments. This chunk size was determined through systematic experimentation and represented a compromise: larger chunks reduced the number of API calls, while smaller chunks improved quality. Practical use showed that further reducing the chunk size improved the summary quality, but at the expense of longer processing times (sometimes several minutes).

Stage 2 – Meta summary:

summaries.insert(0, "HERE ARE VARIOUS SUMMARIES THAT SHOULD BE SUMMARISED INTO A SINGLE 
SUMMARY. TAKE ALL PARTS INTO ACCOUNT. THE ORDER IS 
IMPORTANT. THE MOST IMPORTANT PARTS COME FIRST...‘)
final_summary = summarise_text(llm, ’\n".join(summaries), instructions, summary_type, prompt_only, language)

The individual summaries were combined with an explicit meta-prompt and synthesised into a final summary. This two-step strategy was developed co-creatively with the LLM, balancing ease of implementation and high performance.

Structure extraction: The success of pymupdf4llm#

A critical success factor was the use of pymupdf4llm for PDF processing. This library converted PDF content directly into LLM-friendly Markdown and was a great help in easily processing PDF files in the context of LLMs.

In addition, the tool explicitly extracted structural elements:

# For ODT files (similar for other formats)
if style_name and “Heading” in style_name:
    extracted_text += f‘[{style_name}] {teletype.extractText(element)}\n’

# Tables
extracted_text += ‘[Table]\n’
for row in table_element.getElementsByType(table.TableRow):
    # ... Cell extraction

# Image captions
if caption_text.strip().startswith(‘Caption:’):
    extracted_text += f‘[Image Caption] {caption_text}\n’

This structured annotation with tags such as [Heading], [Table], [Image Caption] was particularly important for identifying key content and measurably improved the quality of the summaries.

Six summary types: prompt engineering in practice#

The tool implemented six different analysis modes through specialised system prompts. Of particular interest was the ‘Thematic Summary,’ which deliberately worked in a structure-agnostic manner:

‘Generate a concise and professionally written summary of the document 
without exaggeration, focusing on the topics rather than the slides .
 Therefore, not every paragraph needs to be described, only the 
important topics of the document. A short paragraph should be written for each topic 
...’

This type was particularly interesting and challenging to develop, as it attempted to describe the content thematically, independently of the document structure.

The ‘Critical Reflection’ implemented an explicit 4-step process:

"Complete the task in individual steps: 
1. Consider how the task can best be completed...
2. Choose an approach that you can implement yourself...
3. Plan the individual steps...
4. Carry out the individual steps and develop the partial result..."

These prompt templates were not only expanded and refined during the initial development, but also continuously outside of the development phases.

Development process: iterative co-creation#

Time structure and iterations#

Development took place over several weeks, with a total time of 5-6 hours. Noteworthy was the very short specification phase of only 30 minutes, followed by iterative development. The architecture required 4-5 main iterations, primarily to optimise the chunking strategy.

Development was still very iterative: first via a minimal specification, then interactively. Various LLMs were used to test different code generation capabilities. In productive operation, however, the tool ran with Mistral Small 2506.

Avoiding overengineering#

A key methodological challenge in LLM coding is the tendency of LLMs to produce overly complex solutions with many levels of abstraction. In this project, overengineering was avoided by employing a specific strategy: incrementally adding features.

The individual functions were added separately, rather than specifying a complex overall architecture in advance. This also prevented overengineering, as each addition was functionally justified and minimised in complexity.

The role of architecture ownership#

A key finding of the experiment was that the architecture proposals of the LLMs were not very usable. This contradicts the often-expressed hope that LLMs could also take over strategic architecture decisions.

Coding was easily possible in clear, small steps as long as the functionality was very narrowly defined. However, strategic architectural decisions – such as the choice of chunking strategy, the structuring of summary types, or the balance between simplicity and functionality – still required human expertise.

Prompt engineering: concrete techniques for production environments#

Capitalisation as signal amplification#

A surprisingly effective technique was the use of capitalisation for critical instructions:

"HERE ARE VARIOUS SUMMARIES THAT NEED TO BE COMBINED INTO A SINGLE SUMMARY 
. TAKE ALL PARTS INTO ACCOUNT. THE ORDER IS IMPORTANT..."

Due to the short length of the prompt in relation to the content, the capitalised prompts were very helpful. When the prompt-to-content ratio was unfavourable, capitalisation significantly reinforced the assertiveness of the instructions.

Structuring system prompts#

The prompts were designed according to established principles of system prompt development:

Context: Define role and task (‘You are a helpful assistant who summarises texts’)
Restrictions: Explicit prohibitions (‘Only consider information that is actually based on the processed document’)
Tone: Specify writing style (‘The writing style should be academic and professional’)
Output format: Specify structure (‘Structure the text using Markdown’)
Language specification: Redundant reinforcement (‘IMPORTANT: You generate all texts in the following language: … This is IMPORTANT: Generate all texts only in the following language: …’)

The redundant repetition of the language instruction proved necessary to ensure consistent multilingual output.

Limitations and practical insights#

Productive use and feedback#

The tool was used several hundred times a month by different users in the VPN. The feedback was positive, but not excellent – a realistic assessment that reflected the limitations of the technology.

Typical documents ranged from a few pages to hundreds of pages. The disadvantage of the chunking approach was that although the analyses produced good results, they took a long time – sometimes several minutes for large documents.

Identified limitations#

Graphically complex PDFs: There were repeatedly files, especially PDF files with a special and rather graphically complex structure, that could not be summarised well. Text extraction reached fundamental limits here that could not be overcome even with better prompt engineering.

Chunk size as a critical parameter: Reducing the chunk size improved the quality somewhat, but proportionally increased the processing time and the number of API calls. This revealed a fundamental trade-off in the architecture.

Fundamental LLM limitations: Coding was manageable when LLM technology was used in conjunction with complementary data structures in memory. However, limitations remained in the limited context size, hallucinations in the responses, and insufficient prompt following.

Retry logic for production robustness#

The implementation of retry logic was necessary to cope with the high usage load. The tool was used by many people, which is why timeouts sometimes occurred. This robustness was not in the initial specification, but developed in response to practical use.

Methodological insights and transferable principles#

The surprising simplicity of functional solutions#

What was most surprising was that such a relatively short code was sufficient to generate high-quality and helpful summaries. The chunking strategy with the individual summaries and the overall summary worked very well. Given the small amount of code, the tool worked very well.

This simplicity contrasted with traditional software development assumptions, which would suggest that significantly more complex architectures would be required for comparable functionality.

Workflow evolution: the value of advance specification#

The workflow was significantly improved with this tool. From tools with very limited functionality to a tool that could generate added value for many, even though it was still very manageable in scope.

A key insight: The effort put into specification and the coordination and definition of the overall architecture in advance was helpful. Iterative co-creation with LLMs worked best when strategic decisions were made in advance and LLMs were used for tactical implementation steps.

Transferable principles for LLM coding#

This experiment validated several principles for effective LLM-assisted coding:

Architecture ownership remains human: Strategic architecture decisions could not be delegated. LLMs were suitable for narrowly defined implementation tasks, but not for architectural planning.
Prefer libraries with LLM integration: Choosing pymupdf4llm was a major simplification. Libraries that already produced LLM-friendly output accelerated development.
Incremental addition of features reduces complexity: Instead of complex pre-architectures, gradual development worked better when each extension was functionally justified.
Prompt engineering is system architecture: The development of robust prompts was not downstream, but an integral part of the architecture. Techniques such as capitalisation, explicit hallucination prevention, and redundant reinforcement were systematically applicable tools.
Simplicity over premature optimisation: The working solution was surprisingly simple. Refraining from complex optimisations in favour of clarity and maintainability proved successful.

Technical metrics and resource consumption

Code base:

280 lines of Python code in one file
10 main functions
7 core libraries (Gradio, PyMuPDF, pymupdf4llm, python-docx, odfpy, requests, filelock)
6 summary types with specialised prompts

Development effort:

5-6 hours total time
30 minutes initial specification
4-5 architecture iterations
Spread over several weeks, intermittent sessions

Chunking parameters:

Chunk size: 10,000 characters
Max. retry attempts: 5

Usage:

Several hundred accesses per month in the VPN
Document size: a few to hundreds of pages
Processing time: seconds to several minutes
User group: university members from various disciplines

Conclusion: Simplicity as an emergent property – and the beginning of a journey#

This first systematic experiment demonstrated that LLM-supported coding could be highly efficient in clearly defined contexts. The development of a productively used tool in a few hours, which was used several hundred times a day, would not have been feasible in this timeframe without LLM support.

However, a key finding was not the speed, but the surprising simplicity of the working solution. 280 lines of code were sufficient for a robust, multilingual M