Vision Model Interface¶

Vision Model Interface is a web-based application for analysing images and PDF documents with vision language models. It accepts images or multi-page PDFs, forwards them to a vision-capable model via an OpenAI-compatible interface, and returns the results in a structured form. For PDFs, individual pages are analysed and consolidated into an overall summary; results can be exported in several formats.

At a glance¶

Upload images and PDF documents through the browser and have them described or analysed by a vision model
Pick individual pages or page ranges from a PDF for targeted analysis
Choose between predefined analysis modes (e.g. alt-text generation for accessibility, text recognition, brief or detailed analysis) or supply a custom prompt
Generate accessible text descriptions from posters and other visual sources
Receive a coherent overall summary across all selected pages of a multi-page PDF
Export results as Markdown, Word, or HTML files for downstream use

Highlights¶

In contrast to a direct prompt to a language model, the application takes care of the full preparation of visual material, the orchestration of multi-step analyses, and the format-conformant output. The result is output that can be reused without further manual processing.

Two-stage PDF processing — Each selected page is analysed individually by the vision model; a follow-up text-only call consolidates the per-page results into an overall summary. Token budgets and image detail level are configurable separately for the per-page calls and for the summary call.
Click-based page selection — A preview gallery shows PDF pages as thumbnails. Clicking a thumbnail adds or removes the page from the selection; selected pages are visually marked with a coloured border and check-mark icon. Alternatives are an "all pages" mode and a manual range expression (e.g. 1-3, 5, 7-10).
Predefined prompt templates — Tested prompt templates are provided for recurring tasks such as detailed page analysis, alt-text generation, brief summarisation, and text recognition (OCR). Custom prompts can be added.
Multiple image inputs — Images can be supplied via file upload, URL, clipboard, or webcam. Local files and images fetched over HTTP are handled identically.
Automatic image preparation — Before analysis, the application corrects EXIF orientation, downscales large images to a model-compatible edge length, converts to RGB, and encodes as JPEG. Inputs are therefore processed consistently regardless of source and format.
Three export formats — Results are available as Markdown, Word, or HTML files. Word and HTML exports embed the corresponding page images so that result and source are documented together.
Accessibility as a use case — The alt-text template generates structured, screen-reader-suitable descriptions of images, diagrams, and graphics on a page. Existing posters and printed material can therefore be supplemented with accessible text descriptions after the fact.
Configurable model binding — The interface expects an OpenAI-compatible chat completions endpoint with vision support. Endpoint, model name, and access key are set through environment variables; the model is exchangeable.
GDPR-conformant defaults — The application is delivered without Google Fonts and without telemetry. Temporary files from upload and export are cleaned up automatically through the built-in session management.