Manually adjusted#

Exploration of agentic LLM systems: Development of a presentation preparation tool#

Summary#

This article documents the development of a multi-agent system for AI-supported presentation preparation. The aim of the experiment was to explore methods and possibilities of agentic LLM architectures. Over a period of one week, a functional tool with approximately 3,000 lines of code was created, based on a two-agent architecture. The methodological insights gained from this project – in particular regarding the importance of prompt-following capabilities, structured agent protocols, and detailed specifications – are relevant beyond the specific project for further LLM-based software development.

1. Experimental context and motivation#

1.1 Learning objectives#

The primary learning objective of this experiment was to explore the use of agentic LLM systems. The focus was on the question of how multiple specialised agents can be coordinated to solve a complex task. Specifically, the aim was to investigate the extent to which one agent can analyse user-generated chat conversations for structurally relevant content and a second agent can convert this information into a structured artefact.

1.2 Choice of presentation preparation tool#

A presentation preparation tool was chosen for two reasons. Firstly, there was a practical need for a solution that would assist in structuring presentations – not for creating complete presentations with invented content, as many existing tools do. Secondly, this task represented a suitable level of complexity for the exploration of agentic systems: it required document processing, state management over several rounds of interaction, structured output and coordination between conversation and structuring logic.

1.3 Development approach#

The development process followed a structured methodology. First, a functional specification was developed in iterative discussions until it met the requirements. This was followed by a detailed technical specification. Both documents served as the basis for the LLM-supported code implementation. The total development time, including specification and deployment, was approximately 2-3 hours, spread over a week.

1.4 How the tool works in detail#

In order to properly classify the methodological learnings, it is important to understand what the tool actually does and how it works.

Basic principle: The tool takes documents as input and structures them into a presentable form without inventing its own content. It works exclusively with existing material and organises it according to the user’s wishes (e.g. target group, duration, context).

Workflow from the user’s perspective:

  1. Document upload: Users upload one or more documents – for example, a scientific paper (PDF), a concept document (DOCX) or an existing presentation (PPTX). The system processes these documents with the Unstructured Library and converts them internally into Markdown format.

  2. Initial structuring: The artefact agent analyses the uploaded documents and creates an initial presentation artefact. In doing so, it identifies main topics, structures them as slides and extracts relevant key points. If important information is missing (e.g. target audience, presentation duration), it generates questions.

  3. Iterative refinement: The user clarifies details in the chat: ‘The presentation is for executives, 20 minutes long, focus on business implications’ or ‘Methodology can be kept brief, more detailed on the results’. After each user input, the artefact agent analyses whether information relevant to the structure is included and adjusts the presentation structure accordingly.

  4. Export: The final artefact can be exported in three formats: as a Markdown file, as a PowerPoint presentation (PPTX) with slides and speaker notes, or as a Word document (DOCX).

The artefact – the heart of the system:

The artefact is a Markdown document with a clear structure. An example:

# Meta
- Duration: 15 minutes
- Target audience: Project team & supervisor
- Event: Interim presentation of master's thesis

# Introduction: Project status
- Survey completed (n=120 participants)
- Initial evaluations available
- Timeline: Week 8 of 12

> Speaker's note: Show distribution chart here

# Main results
- Clear correlation between ...
- Control group shows ...
- Three interesting outliers identified
  - Case 1: ...  - Case 2: ...

# Methodology (brief)
- Online survey with standardised questionnaire
- Statistical evaluation with R
- Qualitative interviews for further analysis (n=12)

# Next steps
- Detailed analysis of outliers
- Discuss results and compare with literature
- Begin writing process (chapters 3-4)

Each #\ heading represents a slide. Bullet points are displayed as key points on the slide, with sub-points indented. Block quotes (>) become speaker notes. The first ‘meta’ slide is optional and serves for internal documentation of contextual information.

Two-agent coordination in practice:

For example, if the user writes ‘Please provide more details on aspect X’, the chat agent first responds: ‘I will add the information on new customers from your documents. Should I create a separate slide for each aspect or summarise them all on one slide?’

At the same time, the artefact agent analyses the chat turn. It recognises that details on individual aspects are desired, searches the uploaded documents for relevant information and expands the structure. If details are missing, it generates a question such as ‘Which specific aspects should be presented? (A, B, C, strategic significance?)’, which the chat agent asks in the next turn.

This coordination allows users to have a natural conversation while the presentation structure is continuously refined in the background.

Practical benefits:

Tests with real documents (up to 60 pages in length) showed that the tool is particularly valuable for:

  • Time planning: Automatic adjustment of the number of slides to the duration of the presentation
  • Target group adaptation: Prioritisation and detailing according to audience
  • Structure finding: Identification of a logical presentation sequence from unstructured documents
  • Speaker notes: Assigning relevant details and source references to slides

2. Technical implementation#

2.1 Architectural decision: Two-agent system#

The decision to use a two-agent architecture was made from the outset in order to start with a simple scenario. The division of tasks followed a clear logic:

Chat agent: Responsible for conversing with users. This agent asks clarifying questions about presentation parameters (target audience, presentation duration, event context, focus areas) and receives information requests from the artefact agent, which it transparently integrates into the conversation.

Artefact agent: Responsible for maintaining the presentation structure. After each chat turn, this agent is triggered and autonomously decides whether the last conversation exchange contains information relevant to the structure. If so, it updates the artefact – a Markdown document that represents the slide structure. If information is missing, it generates questions that the chat agent passes on to the user.

This sequential orchestration proved to be feasible, but required a subsequent conceptual extension (see section 3.2).

2.2 Technology stack#

The selection of technologies was based on proven components from previous projects:

  • Framework: Gradio was chosen because it is already established in the organisation for many applications and enables rapid prototype development.
  • LLMs: Mistral Small 2506 was used for the productive agents. Various LLMs were used for code development.
  • Document parsing: Unstructured had proven itself in previous projects and supports various formats (PDF, DOCX, TXT, Markdown, PPTX).
  • Export libraries: python-pptx and python-docx for PowerPoint and Word export were already familiar.
  • Token management: Tiktoken with cl100k_base encoding for better token estimation.

2.3 Module structure and LLM maintainability#

The project structure was developed iteratively with the LLM, with LLM maintainability being a key principle. Based on previous experience, code files with fewer than 1000 lines can be maintained much better by LLMs. This is not a hard limit, but a guideline that requires more modularisation but increases development speed and maintainability.

The resulting structure includes:

presentation-prep-tool/
├── app.py (~400 lines)
├── config.py
├── core/
│   ├── document_processor.py
│   ├── llm_client.py
│   └── state_manager.py
├── agents/
│   ├── chat_agent.py (~400 lines)
│   ├── artefakt_agent.py (~900 lines)
│   └── prompts.py
└── export/
    ├── markdown_parser.py
    ├── pptx_exporter.py    └── docx_exporter.py

The largest component is the artefact agent with around 900 lines – close to the benchmark, but still manageable. Modularisation according to areas of responsibility (core/, agents/, export/) made it possible to further develop individual modules in isolation.

2.4 State management#

A deliberate simplification was the complete session-based data storage in the working memory without persistence. This significantly reduces technical complexity and simplifies data protection aspects, as no user data is stored. The application state includes:

  • Uploaded documents (raw text and metadata)
  • Chat history
  • Current artefact and version history (last 5 versions)
  • Meta information (target group, duration, event)
  • Pending questions from the artefact agent

3. Special features of the implementation#

3.1 Sequential agent orchestration#

Coordination between agents follows a sequential pattern: after each input, the chat agent responds first. The artefact agent is then triggered, which analyses the last chat turn. This architecture has proven to be feasible, but required a conceptual extension.

3.2 The ‘back channel’ – an iterative refinement#

Initially, it was assumed that the prompt-following capabilities of LLMs would be sufficient to implicitly coordinate the agents. In practice, however, coordination problems arose. The artefact agent sometimes needed information that had not yet been clarified in the chat, but was unable to communicate this directly.

The solution was a structured communication protocol – a ‘back channel’. In its JSON response, the artefact agent not only returns the updated artefact, but also:

{
  ‘update_required’: true/false,  ‘artefact’: ‘# Meta\n- Duration: 20min\n\n# Slide 1\n...’,
  ‘diff_summary’: ‘Slide “Methodology” expanded by 2 bullet points’,
  ‘needs_clarification’: true/false,
  “clarification_questions”: [‘Question 1’, ‘Question 2’]
}

The clarification_questions are written by the system into the chat agent’s state, which asks them to the user in the next interaction. After implementing this feedback channel, the results improved significantly. This shows that structured protocols are more robust than free-text communication between agents.

3.3 JSON response parsing and fault tolerance#

The JSON-based interface between the system and the artefact agent raised the question of how reliably Mistral Small follows this structure. In practice, the specification proved to be sufficiently deterministic. Nevertheless, the implementation contains robust fallback logic: in the event of parsing errors or missing keys, the artefact is not changed and an error message is written to the diff summary. This fault tolerance was rarely necessary in practice, but it gives the system additional stability.

3.4 Version history and undo functionality#

From the outset, an undo function was planned so that it would be possible to revert to previous work statuses in the event of errors in the structuring. The implementation stores the last 5 artefact versions with their respective diff summary and timestamp. The undo function replaces the current artefact with the previous version. This functionality was implemented by LLM without any problems and proved to be valuable in practice.

4. Development process with LLM#

4.1 Structured specification methodology#

The development process followed a clear three-step structure:

Phase 1: Functional specification (approx. 2 hours): The functional specification was developed in 4-5 iterations over several days. Typical topics of discussion were:

  • How exactly should the interaction between artefact and chat be organised?
  • Which mechanisms enable the artefact agent to control queries?
  • Which alternative architectures are available and which are leaner?

Technical feasibility aspects were already discussed in this phase in order to avoid unrealistic features and reduce complexity.

Phase 2: Technical specification (part of the 2 hours): After completion of the functional spec, a detailed technical specification was created with exact data structures, interface definitions, module division and error handling strategies.

Phase 3: Implementation (30-60 minutes): The finished specifications were given to the LLMs for implementation. Due to the clarity of the specs, no further prompt iteration was necessary. The LLMs generated the modules directly in a usable form.

4.2 Minimal rework on the code#

The rework consisted less of debugging code errors and more of conceptual adjustments. The need for the ‘return channel’ only became apparent during use. This conceptual iteration then required another round with the LLM to implement the corresponding functionality.

4.3 Avoiding overengineering#

A conscious goal was to adhere to the KISS principle (Keep It Small and Simple). This was achieved by interactively searching for possible simplifications – both functional and technical – during the specification process. Discussing alternative approaches helped to identify the leanest solution. LLMs tend to suggest complex solutions if the specification allows it. Clear constraints and an explicit call for simplicity helped to avoid this.

4.4 Choice of different LLMs#

Interestingly, different LLMs were used for code development than for the production agents. This allowed for a separation between the development and runtime environments. The exact selection of the code-generating LLMs was less critical than the choice of the runtime model (see section 5.4).

5. Methodological findings#

5.1 Central importance of prompt following#

An important finding of this experiment concerns the prompt-following capabilities of LLMs in agentic systems. Initially, it was assumed that precise system prompts would be sufficient to coordinate the agents. In practice, however, ‘fuzziness’ became apparent – the agents sometimes interpreted their instructions inconsistently or ignored certain constraints.

Prompt following is therefore a critical factor. When selecting models for agentic applications, the ability to follow structured instructions precisely should therefore be weighted more heavily than, for example, creative text generation or general reasoning.

5.2 Structured protocols instead of free-text communication#

The transition from implicit coordination to an explicit JSON-based protocol has enabled significant improvements. Free-text communication between agents may seem more natural, but it is more prone to errors. A structured protocol offers several advantages:

  • Parsing security: JSON can be validated robustly.
  • Explicit semantics: Each field has a defined meaning.
  • Debugging: Errors in agent communication are easier to identify.
  • Prompt stability: Structured outputs are easier to enforce than free text with implicit conventions.

Further projects will show whether this approach can also be transferred to other tasks.

5.3 Necessity of bidirectional communication#

The subsequent implementation of the return channel shows that unidirectional data flows in multi-agent systems did not work so well here. Mechanisms were needed to:

  • Request missing information
  • Signal doubts or ambiguities
  • Influence the work of other agents

This insight could possibly have been anticipated in the specification phase, but only became apparent in practical use. For future projects, such feedback mechanisms will be investigated from the outset.

5.4 Model selection: Small models with good prompt following#

Mistral Small 2506 proved to be sufficient for the agents. Larger models would probably have worked more accurately, but the high inference speed of the smaller model (response times of a few seconds) ensured a good user experience.

When selecting models for agentic systems, prompt-following capability should be the primary selection criterion, not model size or general benchmark scores. A smaller model with excellent instruction following may be superior to a larger model with weaker compliance – especially when latency is a factor.

5.5 Value of detailed specifications#

Spending two hours on functional and technical specifications enabled an implementation time of only 30-60 minutes for 3,000 lines of code. This ratio underscores the advantage of detailed specifications for LLM-based development.

The separation between the functional and technical levels was crucial. The functional specification clarified the ‘what’ and “why”, while the technical specification defined the ‘how’. Important details in the specs were:

  • Information on the technical structure
  • Exact data structures (e.g. the application state)
  • Interface definitions (e.g. the JSON format of the artefact agent)
  • Error handling strategies
  • Architecture constraints (e.g. the <1000-line rule)

Without this clarity, development would not have been possible in such a short time.

5.6 Smaller code files for LLM maintainability#

The empirical value that code files under 1000 lines are more maintainable by LLMs has been confirmed once again. This is not an absolute limit – the largest file in this project, the artefact agent with ~900 lines, is just below this. However, the guideline forces sensible modularisation.

For architectural decisions, this framework condition requires a structure that is as modular as possible, even for projects that could potentially be implemented with a single code file. It requires responsibilities in the code to be distributed across several smaller units. This pays off not only in terms of LLM maintainability, but also in terms of human readability and testability.

5.7 Iterative simplification in the specification process#

The 4-5 iterations in the specification phase served not only to clarify details, but also to examine possible simplifications. By discussing alternative approaches and explicitly searching for simplification options, overengineering was avoided. This iterative refinement phase was important in order to ultimately obtain an implementable and maintainable solution.

6. Validation and practical use#

6.1 Test scenarios and robustness#

The tool was tested by various stakeholders from different areas using real documents. The input documents ranged from a few pages to 60 pages. Despite this variance, the token limit of approximately 200k was hardly reached, confirming that the dimensioning was appropriate.

The tests covered various document types (scientific papers, concept documents, existing presentations) and different use cases (conference presentations, internal meetings, training presentations) . The structuring worked robustly in all scenarios.

6.2 Unexpected strengths#

A surprising finding was the quality of the results in terms of time planning and target group adaptation. The tool proved to be particularly valuable for:

  • Estimating the number of slides required based on the duration of the presentation
  • Prioritising content according to relevance for different target groups
  • Develop a realistic time structure

These aspects were not anticipated as primary strengths, but proved to be particularly valuable in use. The tool currently has more of an ‘inspirational value’ – it provides structured suggestions as a starting point, not complete presentations.

6.3 Usage behaviour#

The tool is used productively on occasion, but would need to be further developed for intensive use. Additional user requests would need to be integrated, which may happen in a future development round. However, the current version already fulfils its experimental purpose and demonstrates the feasibility of the approach.

7. Quantitative metrics#

7.1 Code scope and structure#

  • Total: ~3000 lines of code
  • Largest components:
  • Artefact agent: ~900 lines
  • Chat agent: ~400 lines
  • Main application (app.py): ~400 lines
  • Number of files: approx. 15 modules

7.2 Development time#

  • Specification: 2 hours over several days
  • Functional specification: 4-5 iterations
  • Technical specification: Part of the 2 hours
  • Implementation: 30-60 minutes
  • Deployment and documentation: 30-60 minutes
  • Total duration: One week (distributed sessions)
  • Conceptual iteration (feedback channel): One additional LLM session

7.3 Runtime metrics#

  • Response times: A few seconds per agent turn
  • Token limit: 200k (practically never reached)
  • Tested document sizes: A few to 60 pages
  • Maximum state: Never limiting

8. Transferable principles for future projects#

The following principles for LLM-supported multi-agent development can be derived from this experiment:

  1. Effort for specifications. The 2:1 ratio between specification and implementation time is not overhead, but rather a gain in efficiency.

  2. Separation of functional and technical specifications. Clearly clarifying the questions of ‘what’ and ‘how’ increases efficiency.

  3. Structured protocols. JSON-based interfaces are more robust than free-text communication between agents.

  4. Bidirectional communication. In this use case, it was the return channel that ensured the application’s stability.

  5. Prompt following. This appears to be an important factor for agentic systems.

  6. Modularisation with a view to LLM maintainability. The <1000-line guideline forces meaningful structuring.

  7. Simplifications. Avoid overengineering by actively searching for leaner alternatives.

  8. Conceptual iterations. Not all problems become apparent during the specification phase. The iterations also require a workflow that is as clear as possible.

9. Outlook#

The structured approach – functional specification, technical specification, then implementation – has established itself as a reproducible workflow and will be retained in future projects. The findings on prompt following, structured protocols and agent coordination may be transferable to more complex multi-agent scenarios.

Future experiments could investigate:

  • Systems with more than two agents
  • Dynamic agent orchestration instead of sequential processing
  • Specialised models for different agent roles
  • Formal verification of agent protocols

The documentation of these results serves as a basis for possible further developments in LLM-supported software development and is intended to provide reproducible patterns for similar projects.