Features¶

Vision Model Interface offers two main work areas: single-image analysis and multi-page PDF analysis. Both share the same model access but differ in input, control, and output. The emphasis is on repeatable, quality-assured workflows — for example, the generation of alt-texts or the page-by-page processing of large documents.

Application scenarios¶

Accessible versions of posters — Scanned posters or poster files are turned into structured image descriptions through the alt-text template. The descriptions are screen-reader-suitable and support retrofitting accessibility on existing material.
Detailed analysis of individual images — Charts, diagrams, or photographs are analysed with a freely formulated prompt — for example, to describe content, extract data from charts, or perform structured visual analysis.
Processing of multi-page PDF documents — Scanned or generated PDFs are analysed page by page and condensed into an overall summary, making the content of long documents accessible at a glance.
Text recognition from visual material — The OCR template extracts visible text from pages and returns it with structure preserved (headings, paragraphs, lists), suitable for further processing of scanned originals.
Per-page brief summaries — Long documents are reduced to two or three sentences per page — for example, for pre-screening or to produce compact tables of contents.

At a glance¶

Two work areas: image analysis and PDF analysis with page preview
Four image input sources: upload, URL, clipboard, webcam
Five prompt templates (detailed analysis, alt-text, brief summary, OCR, custom)
Three modes for PDF page selection: all pages, click-based selection, manual range entry
Three export formats: Markdown, Word, HTML — Word and HTML with embedded page images
Configurable token budgets per page and for the summary
Selectable image detail level (auto / low / high)

Input sources¶

The application processes images and PDFs from various sources without further external data connectors. The material to be analysed is supplied directly by the user.

File upload (image) — Local image files are submitted through the browser interface. Common image formats are supported; the image is automatically prepared before analysis.
URL input (image) — Images are fetched over HTTP/HTTPS directly from a given address. Content type and size are checked before processing.
Clipboard (image) — Pasted images are taken directly from the clipboard, without intermediate storage on the local device.
Webcam (image) — Captures can be taken directly within the browser interface — for example, to analyse material currently at hand.
File upload (PDF) — PDF documents are uploaded, validated, and prepared as a thumbnail gallery for preview. Password-protected and invalid files are rejected with a comprehensible error message.

Model binding¶

The application communicates with an OpenAI-compatible chat completions endpoint that supports vision input. Endpoint URL, model name, and access key are configured through environment variables; the concrete model instance is therefore exchangeable. The image part of a request is passed as a base64-encoded data URL; the image detail level (auto, low, high) can be selected per request.

Analysis modes¶

For PDF analyses, prepared prompt templates are available, each aimed at a specific output format:

Detailed analysis — Structured description of main content, illustrations, relevant data, and layout of a page.
Alt-text for images — Accessible descriptions of all visual elements in a screen-reader-suitable format with element type, content, and function.
Brief summary — Condensation of a page to two or three sentences.
Text recognition (OCR) — Full transcription of visible text with structure preserved (headings, paragraphs, lists).
Custom — Free-form prompt for special cases.

In the image analysis area, the prompt is formulated freely; the templates do not apply there.

Page selection in PDFs¶

Three modes determine which pages of a PDF enter the analysis:

All pages — Sequential analysis of the entire document.
Selection from preview — Clicking thumbnails adds or removes individual pages from the selection; selected pages receive a coloured border and check-mark indicator.
Manual entry — Mixed lists and ranges (e.g. 1-3, 5, 7-10); keywords such as last and the to-end pattern (5-last) are also recognised.

Export formats¶

Results of a PDF analysis can be written out in three formats. Image inputs are not currently exported; the result can be copied from the browser.

Markdown (.md) — Plain text output with metadata, summary, and per-page analyses. Suitable for versioning and reuse in documentation systems.
Word (.docx) — Fully formatted document with heading hierarchy, lists, tables, and embedded page images. Markdown returned by the model is translated into native Word formatting.
HTML (.html) — Self-contained HTML file with base64-embedded page images and minimal CSS. Can be passed on or archived without further dependencies.

Quality-assurance functions¶

Several mechanisms support reproducible and traceable results:

Input validation — PDFs are checked for existence, size, validity, and password protection. Images are checked for availability, size, and format before the model call.
Image pre-processing — EXIF orientation is corrected, large images are rescaled to a model-compatible edge length, RGBA/palette images are flattened against a white background and converted to RGB. The material delivered to the model is therefore uniform regardless of source.
Markdown normalisation — Unicode variants of Markdown characters returned by the model (full-width asterisk, smart quotes, and others) are converted back to ASCII before further processing, so that Word, HTML, and Markdown exports render uniformly.
Per-page error tolerance — If the analysis of a single PDF page fails (render error or model error), the run is not aborted. Successful pages flow into the overall summary; failed pages are documented with their error message.
Structured error classes — Errors are classified by type (network, API, image, validation, configuration) and surfaced to the interface with comprehensible messages.
Progress and cost display — During analysis, processing status is shown page by page. After completion, the number of analysed pages, runtime, and the token usage reported by the endpoint are displayed.
Reproducible configuration — Model access, model name, render resolutions, and processing parameters are set through environment variables or a .env file. An analysis with identical settings can therefore be repeated.