Architecture¶

Vision Model Interface is implemented as a lean three-module Python application. A Gradio-based browser interface encapsulates user interaction; two specialised modules handle PDF processing and export. The model binding is decoupled from the application via an external, OpenAI-compatible endpoint; all configuration is supplied through environment variables or a .env file.

At a glance¶

Three-module architecture: main application (vision.py), PDF processing (pdf_processor.py), export (export_handler.py)
Gradio 6 as the web framework, with a tab-based interface and per-session state
Sequential processing with streaming status updates to the interface (generator pattern)
Separation of image processing (Pillow), PDF rendering (PyMuPDF), and document export (python-docx, custom Markdown converter)
OpenAI-compatible chat completions endpoint with vision support as the external model service
Structured result data classes (PDFResult, ProcessingResult, PageAnalysis, PDFAnalysisResult) with consistent error handling
Configuration exclusively through environment variables; no persistent application data

Layers and components¶

The application can be decomposed into four layers.

User interface (Gradio). A single-page application with two tabs (image analysis, PDF analysis). Several Gradio state objects are kept per session — loaded PDF metadata, the selection list, the original thumbnails, and the analysis result. Inputs and outputs are bound through Gradio components (file upload, gallery, Markdown, HTML, slider, radio); there is no shared state across tabs.

Processing and orchestration layer. vision.py contains the functions for image pre-processing, model calls, and orchestration of the multi-stage PDF analysis. Images are loaded with Pillow, EXIF-oriented, rescaled, and converted to RGB. PDF jobs are processed page by page through a generator function and intermediate results are returned to the interface.

Domain modules. Two separate modules encapsulate specialised tasks:

pdf_processor.py loads and validates PDFs with PyMuPDF, generates low-resolution thumbnails for the preview and higher-resolution page images for analysis, parses page-selection expressions, and returns structured result objects.
export_handler.py contains the data classes for the analysis result and exports it as Markdown, Word, or HTML. A dedicated Markdown-to-Word converter translates the Markdown returned by the model into native Word formatting (headings, lists, tables, inline emphasis, blockquotes, code).

Model access. All model calls go through two functions (call_api for vision requests, call_text_api for plain text requests) against an OpenAI-compatible chat completions endpoint. Endpoint, model name, and access key are configurable via environment variables; the application itself makes no assumptions about the concrete model.

Workflow¶

The typical course of a PDF analysis runs through several stages, shown in the diagram below:

flowchart TD
    User[User]
    UI[Gradio interface]

    subgraph Processing
        Validate[Validation & pre-processing]
        PDFLoad[Load PDF / metadata]
        Thumbs[Generate thumbnails]
        Select[Page selection]
        Render[Render page]
        ImgPrep[Image preparation]
        VisionCall[Vision model call]
        SummaryCall[Summary call]
        Export[Export Markdown / Word / HTML]
    end

    LLM[OpenAI-compatible<br/>chat completions endpoint]

    User -->|Upload image or PDF| UI
    UI --> Validate
    Validate -->|PDF| PDFLoad
    PDFLoad --> Thumbs
    Thumbs --> UI
    UI -->|Page selection| Select
    Select --> Render
    Validate -->|Image| ImgPrep
    Render --> ImgPrep
    ImgPrep --> VisionCall
    VisionCall <--> LLM
    VisionCall -->|per page| UI
    VisionCall --> SummaryCall
    SummaryCall <--> LLM
    SummaryCall --> UI
    UI -->|Export click| Export
    Export --> User

After upload, PDFs and image inputs are first checked (size, format, password protection, EXIF). For PDFs, the processing module generates low-resolution thumbnails for the preview gallery and returns the list to the interface. The user selects the pages to be analysed — either by clicking thumbnails, by manual range entry, or by choosing "all pages".

The actual analysis runs sequentially: for each selected page a higher-resolution image is rendered, encoded as a base64 data URL, and sent with the chosen prompt to the vision model. After each page, the current state is yielded back to the interface so that progress and intermediate results become visible. If a page fails, the error is recorded and processing continues with the next page.

Once all pages have been processed, the application assembles a second prompt from the successful per-page results and calls the model again without an image part to produce a consolidated overall summary. Only at this point is the complete result object available, from which the exports are generated.

Use of the model¶

The application uses a single vision language model in two modes: first as an image-processing describer for each individual page, and then as a pure text generator for the consolidating summary across the per-page results. Neither embedding methods, nor reranking, nor agentic control are used; the multi-stage character results from the two-stage prompt chain and the deterministic flow in the orchestration layer.

Concurrency and robustness¶

Processing of a PDF analysis runs synchronously within a single call but yields incremental updates to the Gradio layer through Python's generator mechanism. The interface therefore remains responsive during analysis and shows a progress display. Robustness is achieved through typed error classes (ErrorType, PDFErrorType) and consistent return objects: every operation yields a result with a success flag, data, and a typed error message where applicable, so that errors can be surfaced in the interface in a differentiated way. A per-page error tolerance is applied so that a single render or model failure does not abort the overall run.

Configuration and deployment¶

Configuration is supplied entirely through environment variables or an optional .env file (handled by python-dotenv). Configurable settings include the endpoint URL and model name for the LLM API, the server port and URL path of the Gradio application, and render and preview resolutions for PDF processing. The application stores no persistent data; temporary files from upload and export are cleaned up automatically by Gradio's session management. Telemetry is disabled; the fonts shipped are limited to system fonts.

Technology overview¶

Web interface — Gradio 6 with Blocks/Tabs layout, state components, and event bindings
Image processing — Pillow (PIL) for loading, EXIF correction, rescaling, colour-space conversion, JPEG encoding
PDF processing — PyMuPDF (fitz) for loading, metadata, thumbnail and page rendering
HTTP communication — requests for model calls and URL image retrieval
Configuration — python-dotenv for .env support
Word export — python-docx with custom Markdown-to-Word conversion
Markdown and HTML export — own dependency-free converters in the export_handler.py module
Model API — OpenAI-compatible chat completions endpoint with vision support