From transcript to text: How multi-stage AI workflows help#

Transcripts are illegible. A three-step approach turns them into professional texts – without any loss of information.

We have developed a tool that converts spoken language into written text in three steps. What makes it special is that each step does only one thing, and that one thing works.

What was the challenge?#

Eight hours of training material. All transcribed.

But transcribed speech is a problem. It contains filler words such as ‘um’ and ‘so’, repetitions, broken sentences and colloquial expressions. No one reads through that.

Manually processing eight hours of material? Not practical. At the same time, we wanted to understand: How do you technically set up multi-stage AI workflows? Which approaches work for text processing?

How does the tool work?#

The tool processes texts in three consecutive steps:

Stage 1 – Cleaning: Correction of obvious errors (e.g. incorrectly recognised words, typos), completion of broken sentences, removal of filler words
Stage 2 – Revision: Reformulation of colloquial expressions into professional language, improvement of text structure
Stage 3 – Formatting: Incorporating headings, outline, structuring with Markdown

An example: ‘LLMs never say, I don’t know’ becomes ‘LLMs never say: I don’t know’ – identical in content, linguistically professional.

How did we proceed?#

We built the tool using common technologies:

User interface (Gradio): Simple web interface for uploading
Processing (Python with asyncio): Asynchronous processing of the three stages
Job management (Gearman): Background processing for longer texts
Text splitting: Character-based chunking for large documents

Total effort: Approximately 6-8 hours spread over several weeks, including 3 hours for prompt optimisation. Code scope: 1,100 lines in three files. Approximately 10 main iterations.

Development was iterative: we first built an extended development version that allowed interactive prompt adjustments during processing. Only after optimisation did we create the final user interface.

Why did this work so well?#

Because each step had only one task.

Our first attempts with a comprehensive prompt failed. The language model tried to perform all transformations at once – cleaning, reworking, formatting. The result: severely damaged and truncated texts with massive loss of information.

The breakthrough came with the division. Each stage was given exactly one focused task. Not ‘clean up, revise and format’, but three separate prompts, each with a clear goal. This cascading enabled careful, precise text editing.

Key findings#

1. Focused prompts beat complex prompts

Clear, single-purpose prompts worked significantly better than multi-task prompts. A prompt with one goal delivered more precise results than a prompt with multiple goals. The reason: the language model can focus on one transformation without considering other aspects.

2. Prompt optimisation takes time

Half of the development time was spent on prompt optimisation. That sounds like a lot, but it was necessary. Without good prompts, the tool doesn’t work. The investment was worth it: the final prompts consistently deliver good results.

3. Development interfaces speed up iteration

An extended development version with direct access to prompts and intermediate results was crucial. We were able to experiment during processing without deploying new versions. This significantly reduced the optimisation cycles.

4. Simple techniques are often sufficient

Character-based chunking worked well enough for our use case. More complex approaches (e.g. semantic chunking) were not necessary. Sometimes the simplest solution is the right one.

5. Around 1,000-1,500 lines is the practical limit

With this code size, development with language models still works well without detailed pre-specification. Larger projects require more structured approaches and more planning.

What can others learn from this?#

Break complex tasks down into individual, focused steps – each step should only do one thing.
Invest time in prompt optimisation; this makes the difference between ‘works sometimes’ and ‘works reliably’.
Build development versions with direct access to internal states; this speeds up experiments considerably.
Start with simple technical approaches and only optimise when really necessary.
For projects over 1,500 lines of code: plan in more detail in advance.

Conclusion#

✔ Multi-stage workflows work better than all-in-one approaches for complex text processing

✔ Focused prompts with a clear goal deliver more precise results than multi-task prompts.

✔ Iterative development with experimental environments saves time during optimisation.

This is part of a series on experiences with AI-supported software development. The focus is on what can be learned from such projects – not just on the results.