Architecture¶

Grafix is built as a containerised web application with clear separation of layers. A browser-side interface communicates with a Python backend, which performs both the intent recognition via an LLM endpoint and the rule-based diagram generation. Diagram generation is consistently separated from intent recognition: the language model decides exclusively what should be done; how the diagram is drawn is determined by a deterministic template engine.

At a glance¶

Container-based delivery via Docker with reverse-proxy integration.
Browser-side canvas rendering via the Fabric.js library; server-side logic in Python.
Multi-agent pipeline with a single LLM call per request; all other stages run rule-based.
Strict separation of intent recognition (LLM) and layout generation (template engine).
Connection to an internal LLM service via an OpenAI-compatible interface.
Session management in application memory with a complete history per session.
27 templates implemented as standalone classes derived from a shared base class.

Architecture overview¶

The application is organised in five layers:

Interface (browser) — A web UI provided through Gradio with chat input, template gallery, editing forms, and a canvas area. The canvas is rendered via Fabric.js; selection, movement, and text editing happen directly in the browser.
Orchestration — A central orchestrator receives the request, calls the Intent Agent, executes the recognised intents, hands the result to the Validation Agent, and finally to the Consistency Agent.
Agents — Four specialised components: Intent Agent (LLM-based), execution logic (rule-based), Validation Agent (LLM-based, on demand), Consistency Agent (rule-based).
Template engine — A set of 27 template classes, each with its own geometry and scaling logic. Given a template and a list of elements, the engine produces a Fabric.js-compatible JSON document with absolute positions.
Session and export services — Management of the action history per session, preparation of the context for subsequent LLM calls, preparation of export data for PNG (client-side) and JSON (server-side).

Workflow of a request¶

flowchart TD
    A[User request in chat] --> B[Session service]
    B --> C[Context preparation<br/>history + current canvas]
    C --> D[Intent Agent<br/>LLM classification]
    D --> E{Confidence<br/>sufficient?}
    E -- no --> F[Clarification question]
    E -- yes --> G[Orchestrator<br/>rule-based]
    G --> H{Intent type}
    H -- create --> I[Template engine<br/>compute layout]
    H -- modify_slot --> J[Slot update<br/>on existing canvas]
    H -- modify_style --> K[Style update<br/>on existing canvas]
    H -- upgrade_template --> L[Template switch<br/>transfer content]
    I --> M[Validation Agent]
    J --> M
    K --> M
    L --> M
    M --> N[Consistency Agent<br/>rule-based]
    N --> O[Record action<br/>in session history]
    O --> P[Canvas to browser<br/>Fabric.js rendering]
    F --> O

A user request first reaches the session service, which determines the active session or creates a new one. A context is then assembled from the previous action history and the current canvas state. This context, together with the actual user input, is sent to the Intent Agent, which calls a language model through an OpenAI-compatible interface. The language model returns a structured classification as JSON: recognised action types (create, modify slot, modify style, add or remove element, switch template, clarify), associated parameters, confidence value, and reasoning.

If the confidence is too low or the request is ambiguous, a clarification question is returned instead of an action. Otherwise the Orchestrator takes over execution. This stage runs entirely rule-based and does not call the language model again. Depending on the recognised intent, the Orchestrator either invokes the template engine (for new creation or template switch) or modifies the existing canvas JSON in a targeted way (for slot or style changes).

The template engine is the core of deterministic diagram generation. Each of the 27 templates is implemented as a standalone class derived from a shared base class. The class defines minimum and maximum element counts, scaling logic, colour and shadow handling, and the concrete geometry calculation. The result is a JSON document in Fabric.js format with absolute coordinates for each object.

The Validation Agent then checks whether the result matches the recognised intent. A second check is performed by the Consistency Agent: visibility, spacing, alignment, overlaps, and colour usage are checked in a rule-based manner. The agent recognises structured templates by typical object IDs (such as level1_, content_, text_left) and refrains from automatic corrections in those cases in order not to damage the geometry of the template. Auto-fixes are applied only to manually edited or clearly faulty layouts.

The final step records the action in the session history and transfers the canvas JSON to the interface. In the browser, Fabric.js renders the diagram and enables subsequent direct editing.

Role of the language model¶

The language model is used exclusively for intent classification, not for generating the canvas JSON. This separation has two consequences. First, layout problems that typically arise when an LLM generates coordinates are avoided. Second, a single LLM call per user request is sufficient. The Validation Agent can be invoked additionally but is not required for the standard case.

The structured output of the Intent Agent is expected as JSON. In the case of parsing errors or incomplete responses, a fallback mechanism triggers a clarification question to the user instead of executing a faulty action.

Session and context¶

Sessions are kept in application memory and contain an ordered list of actions. Each action documents the original input, the recognised action type, the resulting action JSON, the canvas state before and after the action, and the confidence and reasoning of the classification.

Before each LLM call, the history is prepared in two representations — as a natural-language summary and as compact JSON. Both representations are presented to the language model as context. This approach acts as an implicit few-shot mechanism: previously correctly classified actions serve as examples for the current classification.

Concurrency, robustness, and configuration¶

The application runs as a single container process that holds multiple parallel sessions in memory. Session IDs are generated server-side. Actions of one session are not mixed across sessions.

Robustness against unreliable LLM responses is achieved through several mechanisms: tolerant JSON parsing (markdown code blocks and imprecise bracketing are accepted), fallback for unparsable responses (clarification instead of hallucination), confidence-based escalation to the Validation Agent, and layout correction by the Consistency Agent.

Configuration is performed via environment variables — among other things for the LLM endpoint, the model, the confidence threshold, and the URL prefix for reverse-proxy operation. A sample configuration file is provided with the application.

Technology overview¶

Language and runtime — Python 3.12 as the server-side runtime.
Web interface — Gradio for serving the user interface in the browser.
Canvas rendering — Fabric.js for client-side display and direct editing of diagrams.
LLM connection — OpenAI-compatible interface for communication with the internal language model service.
Data modelling — Pydantic for typed data models and configuration-driven settings.
Image processing — Pillow for server-side image handling in export.
Containerisation — Docker and docker-compose for build and operation; integration with an upstream reverse proxy for serving under a URL path.