Features¶
Vision Model Interface offers two main work areas: single-image analysis and multi-page PDF analysis. Both share the same model access but differ in input, control, and output. The emphasis is on repeatable, quality-assured workflows — for example, the generation of alt-texts or the page-by-page processing of large documents.
Application scenarios¶
- Accessible versions of posters — Scanned posters or poster files are turned into structured image descriptions through the alt-text template. The descriptions are screen-reader-suitable and support retrofitting accessibility on existing material.
- Detailed analysis of individual images — Charts, diagrams, or photographs are analysed with a freely formulated prompt — for example, to describe content, extract data from charts, or perform structured visual analysis.
- Processing of multi-page PDF documents — Scanned or generated PDFs are analysed page by page and condensed into an overall summary, making the content of long documents accessible at a glance.
- Text recognition from visual material — The OCR template extracts visible text from pages and returns it with structure preserved (headings, paragraphs, lists), suitable for further processing of scanned originals.
- Per-page brief summaries — Long documents are reduced to two or three sentences per page — for example, for pre-screening or to produce compact tables of contents.
At a glance¶
- Two work areas: image analysis and PDF analysis with page preview
- Four image input sources: upload, URL, clipboard, webcam
- Five prompt templates (detailed analysis, alt-text, brief summary, OCR, custom)
- Three modes for PDF page selection: all pages, click-based selection, manual range entry
- Three export formats: Markdown, Word, HTML — Word and HTML with embedded page images
- Configurable token budgets per page and for the summary
- Selectable image detail level (auto / low / high)
Input sources¶
The application processes images and PDFs from various sources without further external data connectors. The material to be analysed is supplied directly by the user.
- File upload (image) — Local image files are submitted through the browser interface. Common image formats are supported; the image is automatically prepared before analysis.
- URL input (image) — Images are fetched over HTTP/HTTPS directly from a given address. Content type and size are checked before processing.
- Clipboard (image) — Pasted images are taken directly from the clipboard, without intermediate storage on the local device.
- Webcam (image) — Captures can be taken directly within the browser interface — for example, to analyse material currently at hand.
- File upload (PDF) — PDF documents are uploaded, validated, and prepared as a thumbnail gallery for preview. Password-protected and invalid files are rejected with a comprehensible error message.
Model binding¶
The application communicates with an OpenAI-compatible chat completions endpoint that supports vision input. Endpoint URL, model name, and access key are configured through environment variables; the concrete model instance is therefore exchangeable. The image part of a request is passed as a base64-encoded data URL; the image detail level (auto, low, high) can be selected per request.
Analysis modes¶
For PDF analyses, prepared prompt templates are available, each aimed at a specific output format:
- Detailed analysis — Structured description of main content, illustrations, relevant data, and layout of a page.
- Alt-text for images — Accessible descriptions of all visual elements in a screen-reader-suitable format with element type, content, and function.
- Brief summary — Condensation of a page to two or three sentences.
- Text recognition (OCR) — Full transcription of visible text with structure preserved (headings, paragraphs, lists).
- Custom — Free-form prompt for special cases.
In the image analysis area, the prompt is formulated freely; the templates do not apply there.
Page selection in PDFs¶
Three modes determine which pages of a PDF enter the analysis:
- All pages — Sequential analysis of the entire document.
- Selection from preview — Clicking thumbnails adds or removes individual pages from the selection; selected pages receive a coloured border and check-mark indicator.
- Manual entry — Mixed lists and ranges (e.g.
1-3, 5, 7-10); keywords such aslastand the to-end pattern (5-last) are also recognised.
Export formats¶
Results of a PDF analysis can be written out in three formats. Image inputs are not currently exported; the result can be copied from the browser.
- Markdown (
.md) — Plain text output with metadata, summary, and per-page analyses. Suitable for versioning and reuse in documentation systems. - Word (
.docx) — Fully formatted document with heading hierarchy, lists, tables, and embedded page images. Markdown returned by the model is translated into native Word formatting. - HTML (
.html) — Self-contained HTML file with base64-embedded page images and minimal CSS. Can be passed on or archived without further dependencies.
Quality-assurance functions¶
Several mechanisms support reproducible and traceable results:
- Input validation — PDFs are checked for existence, size, validity, and password protection. Images are checked for availability, size, and format before the model call.
- Image pre-processing — EXIF orientation is corrected, large images are rescaled to a model-compatible edge length, RGBA/palette images are flattened against a white background and converted to RGB. The material delivered to the model is therefore uniform regardless of source.
- Markdown normalisation — Unicode variants of Markdown characters returned by the model (full-width asterisk, smart quotes, and others) are converted back to ASCII before further processing, so that Word, HTML, and Markdown exports render uniformly.
- Per-page error tolerance — If the analysis of a single PDF page fails (render error or model error), the run is not aborted. Successful pages flow into the overall summary; failed pages are documented with their error message.
- Structured error classes — Errors are classified by type (network, API, image, validation, configuration) and surfaced to the interface with comprehensible messages.
- Progress and cost display — During analysis, processing status is shown page by page. After completion, the number of analysed pages, runtime, and the token usage reported by the endpoint are displayed.
- Reproducible configuration — Model access, model name, render resolutions, and processing parameters are set through environment variables or a
.envfile. An analysis with identical settings can therefore be repeated.