LLM-Enabled Development: Cascading LLM Workflows#

A case study on multi-stage text processing with focused prompts

Abstract#

This article documents insights gained from the development of a tool for processing speech-to-text transcripts using cascading LLM workflows. Focused, sequential prompts, each with a specific task, led to significantly better results than complex multi-task prompts. The project provided transferable insights into scaling limits, prompt engineering patterns, and the role of experimental environments in LLM-based development.

1. Experimental Context and Motivation#

1.1 The Underlying Problem#

The initial situation was motivated by two factors: on the one hand, there was a specific practical problem, and on the other hand, there was an exploratory technical learning objective.

Practical reason: An eight-hour training course on the background of large language models was available as an audio recording. The goal was to create documentation and an AI-based assistant based on this training content. Although automatic transcription using speech-to-text provided the complete text, it was written in the form of spoken language – with filler words, repetitions, broken sentences, colloquial expressions and typical features of spontaneous oral communication.

Manually converting eight hours of transcription material into written language was not feasible in terms of time. At the same time, the conversion had to be done carefully to minimise information loss and preserve the integrity of the content.

Technical learning objective: At the same time, there was interest in technically implementing multi-stage LLM workflows and evaluating their possibilities. How can cascading processing pipelines be structured? Which patterns prove themselves in the transformation of longer texts? What technical challenges arise in asynchronous processing with local LLM instances?

1.2 Institutional context#

Humboldt University of Berlin has had an established speech-to-text system since 2024, which is integrated into the processing chains of OpenCast and Moodle. This system transcribes audio and video recordings up to 4 GB in size and returns the results in various formats via email.

While verbatim transcriptions are indispensable for certain purposes (e.g. subtitling with time stamps), there was a need for careful transformation into readable continuous text for other use cases – in particular for further processing in RAG systems and AI assistants or for scripts.

1.3 From learning project to productive use#

The tool was originally conceived as an exploratory project. Its development for productive use took place gradually as it became clear during development and initial testing how robust the results were and how great the institutional need for such text processing is.

This unexpected development shows how iterative, exploratory development with LLMs can lead to solutions whose quality and applicability exceed initial expectations.

2. Technical architecture and implementation#

2.1 Architecture overview#

The technical architecture was deliberately chosen based on existing infrastructure and proven components:

Frontend: Gradio 5.0 for the web-based user interface. Gradio was chosen because of existing deployment experience and the ability to quickly create iterative prototypes.

Backend: Python with asyncio for asynchronous processing. The implementation uses aiohttp for non-blocking API calls and enables parallel processing of multiple text chunks.

Job management: Gearman for asynchronous background processing. This component had already been used in other projects and proved to be necessary, as processing longer texts can take several minutes to hours due to numerous LLM API calls. Results are delivered by email.

LLM connection: Local LLM instance via OpenAI-compatible API. This enables the use of institutional computing resources and maintains data protection requirements.

2.2 Modular structure#

The implementation comprises 1,100 lines of code in three main files.

This modularisation neatly separates presentation, processing and orchestration logic, facilitating maintenance and testing.

2.3 Chunking strategy#

Longer texts typically exceed the context window limits of LLMs. The implementation therefore uses simple character-based chunking:

The decision to use character-based chunking instead of more complex approaches (e.g. sentence- or paragraph-based) was made based on tests in this scenario.

2.4 Three-stage processing cascade#

The core concept is sequential processing through three focused prompts:

Stage 1 – Clean-up and error correction:

Correction of obvious transcription errors
Completion of truncated sentences
Removal of filler words and repetitions
Careful improvement of readability

Stage 2 – Stylistic revision:

Reformulation of colloquial expressions into factual language
Transformation into professional writing style
Active formulations instead of passive
Direct speech

Stage 3 – Markdown formatting:

Incorporation of headings and structural elements
Division into paragraphs
Appropriate terminology
Clear, professional structure

Each stage receives the output of the previous stage as input. The prompts contain a {context} variable in which users can enter specialist areas or specific terminology to improve the quality of technical texts.

2.5 Error handling and robustness#

The implementation includes several mechanisms to ensure robustness:

Retry logic: In the event of API errors, up to five retries are made with exponential backoff (20-second base delay).

Fallback strategy: If the processing of a chunk fails, the original text is accepted with an error marker.

Timeout handling: API calls have a timeout of 300 seconds.

File validation: Input files are checked for type, size (maximum 10 MB), encoding and binary data.

Graceful degradation: In the event of resource overload, parallelism can be reduced automatically.

3. Development process with LLM#

3.1 Iterative development approach#

Development took place over several weeks with about ten main iterations. The total development time was approximately 6-8 hours, with about three hours spent on prompt optimisation alone – a significant proportion of 37-50% of the total time.

The process was deliberately designed to be iterative:

Initial specification of requirements in discussion with the LLM
Implementation of a first working version
Testing with real transcripts
Identification of weak points
Refinement of prompts or code
Retesting

This cycle was repeated until the quality of the results met the requirements.

3.2 Role of the development interface#

A central approach was the creation of an extended development interface. While the final production version offers a reduced interface along the main usage scenarios, the development version contained additional functionality:

Editable prompt fields for all three processing stages
Display of intermediate results after each stage
Detailed statistics and timing information
Configurable parameters for chunk size and parallelism

This interface served as an experimental sandbox in which prompts could be iteratively refined without modifying the code or redeploying the tool. The ability to inspect intermediate results was crucial for understanding where problems occurred in the pipeline.

Only after the prompts had been optimised in the development interface was the final user interface derived and the prompts fixed in the code.

3.3 LLM used#

The text processing itself is performed by a local LLM instance.

The choice of the local LLM for processing was motivated by institutional availability and data protection requirements. At the same time, this enabled experiments with prompt optimisation for a specific model.

3.4 Specification and communication#

Communication with the code-generating LLM was primarily iterative in dialogue. There was no comprehensive pre-specification – instead, requirements and implementation developed step by step.

3.5 Management of complexity#

Overengineering was avoided primarily through experience: when it became apparent that proposed solutions were too complex, simplification was actively pursued. Examples:

Avoiding complex NLP-based chunking algorithms in favour of simple character-based division
Fixing the prompt order instead of dynamic workflow configuration
Simple file validation instead of comprehensive content analysis

The KISS principle (Keep It Small and Simple) was applied consistently: complexity only where it was demonstrably necessary.

4. Key methodological findings#

4.1 Cascading single-purpose prompts#

A key finding of this project concerns the prompt architecture:

Initial attempts with a comprehensive prompt – which instructed the LLM to clean up, stylistically revise and format text simultaneously – led to severely damaged and truncated results. The model attempted to perform too many transformations simultaneously, resulting in information loss, hallucinations and inconsistent results.

The approach that worked was focused sequencing. Each processing stage was given exactly one clearly defined task. Instead of ‘clean, revise and format,’ three separate prompts were used:

Clean only
Stylistic revision only
Formatting only

This cascading allowed for careful, precise text editing with minimal information loss. Each stage built on the output of the previous one, resulting in incremental improvement rather than abrupt transformation.

4.2 Scaling limits without deep specification#

One pragmatic insight concerns the scalability of the approach:

In the initial experiments, approximately 1,000 to 1,500 lines of code marked the maximum that could be achieved without in-depth prior specification. Beyond this order of magnitude, the complexity of the context, the number of dependencies and the need for consistent architectural decisions increase so much that more structured approaches become necessary.

4.3 Prompt engineering as a core competence#

50% of the development time was spent on prompt optimisation. This underlines the importance of prompt engineering as a core competence in LLM-based development.

Successful prompts for text processing had the following characteristics:

Explicit output constraints: Clear instructions for output

Context integration: What is it about and what is the thematic context

Structural cues: Instructions on the structural context

Positive phrasing: Formulate tasks instead of prohibitions.

4.4 Constraints of LLMs in text processing#

Two important constraints became apparent:

Speed: LLMs are not fast. Processing longer texts with several hundred chunks can take hours. This necessitates asynchronous architectures with job queues and email notifications.

Context boundaries: Despite growing context windows, boundaries remain. Chunking is unavoidable for longer documents, and merging the chunks requires careful handling of overlaps.

These limitations are not weaknesses of the approach, but rather constraints that must influence architectural decisions.

4.5 Development environments as a methodological tool#

The additional development UI proved to be an important element in making rapid progress with the developments. The ability to interactively adjust prompts and inspect intermediate results significantly accelerated iteration.

A transferable observation: For LLM workflows, investing in experimental environments proved helpful, which:

Enable rapid iteration without deployment
Make intermediate results visible
Facilitate parameter tuning
Provide sufficient meta-information for processing
Serve as a sandbox for user feedback

Once the functions have been stabilised, the final production interface can be derived.

5. Validation and productive use#

5.1 Validation process#

Initial validation was performed using the eight-hour training transcript that was the original reason for development. Further tests included lecture transcripts, seminar recordings, and spontaneous verbal notes.

The validation focused on:

Content accuracy (no loss of information)
Linguistic quality (fluent, professional wording)
Structural coherence (logical structure, meaningful layout)
Robustness with different text lengths and qualities

Productive use became possible when it became clear that the quality of the results consistently met expectations.

Metadata#

Project scope: 1,100 lines of code in 3 files Development time: 6-8 hours over several weeks

Iterations: ~10 main iterations LLM used: Mistral Small 2506 (code development), local llm3 (text processing) Technology stack: Python, Gradio 5.0, asyncio, aiohttp, Gearman, OpenAI-compatible API

Status: In productive use

This documentation is part of a series on methodological insights from LLM-supported development projects. The focus is on transferable patterns and limitations of different development approaches. Further case studies will follow.