Architecture¶

The application is built as a containerised web application with a clear separation of layers. The user interface, the orchestration of processing steps, the document and LLM connections, and the export components are organised in independent modules. The complete editing state of a session is held in memory; persistent storage is not provided.

At a glance¶

Containerised operation based on a Python slim image, with the system libraries required for document processing
Web-based user interface that runs in the browser and requires no client-side installation
Layer separation into UI, orchestration (agents and state), core functions (document, LLM, state), and export
Two-agent pipeline with separate system prompts for dialogue and artefact processing
LLM connection through an OpenAI-compatible interface, with retry logic and exponential backoff
Session-based stateful storage without disk persistence
Configuration through environment variables, including API endpoint, model name, and limits

Architecture description¶

The application is organised into four logical layers. The UI layer provides document upload, the chat window, the display of the current slide outline, and the export and undo controls. The orchestration layer comprises the two agents (chat agent and artefact agent) and the central state management. The core layer encapsulates the reusable building blocks: reading and tokenising documents, the LLM client with retry logic, and the state object with version history. The export layer converts the internal Markdown representation into PowerPoint and Word files.

flowchart TB
    User[User]

    subgraph UI[UI layer]
        Upload[Upload area]
        Chat[Chat window]
        ArtView[Artefact display]
        Export[Export controls]
    end

    subgraph Orch[Orchestration layer]
        State[State management<br/>documents, chat, artefact, history]
        ChatAgent[Chat agent]
        ArtAgent[Artefact agent<br/>two-phase logic]
    end

    subgraph Core[Core layer]
        DocProc[Document parser<br/>+ token counter]
        LLMClient[LLM client<br/>retry/backoff]
    end

    subgraph Exp[Export layer]
        MDExp[Markdown export]
        PPTXExp[PowerPoint export]
        DOCXExp[Word export]
    end

    LLM[(LLM API<br/>OpenAI-compatible)]

    User --> Upload
    User --> Chat
    User --> Export

    Upload --> DocProc
    DocProc --> State
    Chat --> ChatAgent
    ChatAgent --> State
    ChatAgent --> LLMClient
    State --> ArtAgent
    ArtAgent --> LLMClient
    ArtAgent --> State
    State --> ArtView
    LLMClient --> LLM

    State --> MDExp
    State --> PPTXExp
    State --> DOCXExp
    MDExp --> User
    PPTXExp --> User
    DOCXExp --> User

UI layer¶

The user interface is implemented as a web-based single-page application. It contains three main areas: an upload area with token display, the chat window, and a display of the current slide outline with export, undo, and status indicators. The components are connected to the state management via events; after each event (upload, chat input, undo, export) the display is re-rendered.

Orchestration layer¶

The processing core consists of two agents and a state management. The chat agent is responsible for user interaction: it accepts free-form text input, extracts statements about talk duration, audience, focus topics, and target language by rule, supplements them with the current slide context, and calls the LLM with a dialogue-oriented system prompt. The artefact agent is run after each dialogue turn and on the initial upload. It operates in two phases: the first phase generates or updates the Markdown outline from the documents and the user instruction; the second phase adjusts the slide count to the target size and triggers a translation when the language is switched. The agent expects a structured JSON response from the LLM, parses it, and removes the internal source references from the artefact intended for display.

The state management holds the complete session state: uploaded documents with their token counts, the chat history, the current artefact, the version history of the most recent five artefact states, and any open clarifying questions. It provides methods for adding documents and messages, updating the artefact with a diff description, and reverting the most recent change. The state is serialised and deserialised between calls of the user interface.

Core layer¶

The core layer encapsulates the reusable building blocks. The document parser validates uploaded files against the configured size limit and the list of supported formats, reads them through a generic partitioning library, and converts the result into a unified Markdown representation. Subsequently, the token count is determined by a BPE-based tokeniser; the same tokeniser is also used to compute the sum across documents, chat history, and artefact. The LLM client wraps the calls to the OpenAI-compatible interface, distinguishes between free-text and JSON calls, and is equipped with retry logic using exponential backoff. For JSON calls, the client extracts the JSON from embedded code blocks and parses it robustly.

Export layer¶

The export layer converts the Markdown representation into the target formats. A common Markdown slide parser identifies slides by first-level headings, gathers the associated bullet points (including sub-levels) and speaker notes, and provides them as structured slide objects. A text formatter interprets Markdown markers for bold, italic, and inline code. The format-specific exporters generate PowerPoint files with separate slides and notes fields, or Word documents with slide-level sections and separator lines.

Role of the AI components¶

The LLM is called at exactly two points: in the chat agent to generate responses to the user, and in the artefact agent to generate or update the slide outline. Both agents use the same LLM client but different system prompts and call modes (free text in the chat agent, JSON response in the artefact agent). The agentic control — phases, clarifying questions, length adjustment, translation triggers — is performed in the application code and not by the LLM itself. Embedders and rerankers are not used; the prioritisation of document sections is rule-based, derived from position, heading level, length, and key terms.

Concurrency, robustness, and configuration¶

The application processes requests sequentially within a session. Robustness against transient LLM errors is achieved through the retry logic in the LLM client; the number of attempts and the waiting times are configurable. On permanent failure, the previous artefact state is preserved and the error is shown in the status. On upload, file size and format are validated; if the token budget is exceeded, the upload is rejected. Configuration is performed entirely through environment variables (LLM endpoint, model name, API key, limits, server port and path).

Technology overview¶

Language and runtime: Python 3.13
User interface: Gradio
LLM connection: OpenAI Python client against an OpenAI-compatible interface (local operation possible)
Tokenisation: tiktoken
Retry logic: tenacity
Document parsing: unstructured
PowerPoint export: python-pptx
Word export: python-docx
Markdown processing: markdown-it-py
Image processing (for the document parser): Pillow
Containerisation: Docker, Python slim base image, with system packages for document and image processing as well as text recognition
Configuration: environment variables via dotenv