Skip to content

Features

Vision Model Interface offers two main work areas: single-image analysis and multi-page PDF analysis. Both share the same model access but differ in input, control, and output. The emphasis is on repeatable, quality-assured workflows — for example, the generation of alt-texts or the page-by-page processing of large documents.

Application scenarios

  • Accessible versions of posters — Scanned posters or poster files are turned into structured image descriptions through the alt-text template. The descriptions are screen-reader-suitable and support retrofitting accessibility on existing material.
  • Detailed analysis of individual images — Charts, diagrams, or photographs are analysed with a freely formulated prompt — for example, to describe content, extract data from charts, or perform structured visual analysis.
  • Processing of multi-page PDF documents — Scanned or generated PDFs are analysed page by page and condensed into an overall summary, making the content of long documents accessible at a glance.
  • Text recognition from visual material — The OCR template extracts visible text from pages and returns it with structure preserved (headings, paragraphs, lists), suitable for further processing of scanned originals.
  • Per-page brief summaries — Long documents are reduced to two or three sentences per page — for example, for pre-screening or to produce compact tables of contents.

At a glance

  • Two work areas: image analysis and PDF analysis with page preview
  • Four image input sources: upload, URL, clipboard, webcam
  • Five prompt templates (detailed analysis, alt-text, brief summary, OCR, custom)
  • Three modes for PDF page selection: all pages, click-based selection, manual range entry
  • Three export formats: Markdown, Word, HTML — Word and HTML with embedded page images
  • Configurable token budgets per page and for the summary
  • Selectable image detail level (auto / low / high)

Input sources

The application processes images and PDFs from various sources without further external data connectors. The material to be analysed is supplied directly by the user.

  • File upload (image) — Local image files are submitted through the browser interface. Common image formats are supported; the image is automatically prepared before analysis.
  • URL input (image) — Images are fetched over HTTP/HTTPS directly from a given address. Content type and size are checked before processing.
  • Clipboard (image) — Pasted images are taken directly from the clipboard, without intermediate storage on the local device.
  • Webcam (image) — Captures can be taken directly within the browser interface — for example, to analyse material currently at hand.
  • File upload (PDF) — PDF documents are uploaded, validated, and prepared as a thumbnail gallery for preview. Password-protected and invalid files are rejected with a comprehensible error message.

Model binding

The application communicates with an OpenAI-compatible chat completions endpoint that supports vision input. Endpoint URL, model name, and access key are configured through environment variables; the concrete model instance is therefore exchangeable. The image part of a request is passed as a base64-encoded data URL; the image detail level (auto, low, high) can be selected per request.

Analysis modes

For PDF analyses, prepared prompt templates are available, each aimed at a specific output format:

  • Detailed analysis — Structured description of main content, illustrations, relevant data, and layout of a page.
  • Alt-text for images — Accessible descriptions of all visual elements in a screen-reader-suitable format with element type, content, and function.
  • Brief summary — Condensation of a page to two or three sentences.
  • Text recognition (OCR) — Full transcription of visible text with structure preserved (headings, paragraphs, lists).
  • Custom — Free-form prompt for special cases.

In the image analysis area, the prompt is formulated freely; the templates do not apply there.

Page selection in PDFs

Three modes determine which pages of a PDF enter the analysis:

  • All pages — Sequential analysis of the entire document.
  • Selection from preview — Clicking thumbnails adds or removes individual pages from the selection; selected pages receive a coloured border and check-mark indicator.
  • Manual entry — Mixed lists and ranges (e.g. 1-3, 5, 7-10); keywords such as last and the to-end pattern (5-last) are also recognised.

Export formats

Results of a PDF analysis can be written out in three formats. Image inputs are not currently exported; the result can be copied from the browser.

  • Markdown (.md) — Plain text output with metadata, summary, and per-page analyses. Suitable for versioning and reuse in documentation systems.
  • Word (.docx) — Fully formatted document with heading hierarchy, lists, tables, and embedded page images. Markdown returned by the model is translated into native Word formatting.
  • HTML (.html) — Self-contained HTML file with base64-embedded page images and minimal CSS. Can be passed on or archived without further dependencies.

Quality-assurance functions

Several mechanisms support reproducible and traceable results:

  • Input validation — PDFs are checked for existence, size, validity, and password protection. Images are checked for availability, size, and format before the model call.
  • Image pre-processing — EXIF orientation is corrected, large images are rescaled to a model-compatible edge length, RGBA/palette images are flattened against a white background and converted to RGB. The material delivered to the model is therefore uniform regardless of source.
  • Markdown normalisation — Unicode variants of Markdown characters returned by the model (full-width asterisk, smart quotes, and others) are converted back to ASCII before further processing, so that Word, HTML, and Markdown exports render uniformly.
  • Per-page error tolerance — If the analysis of a single PDF page fails (render error or model error), the run is not aborted. Successful pages flow into the overall summary; failed pages are documented with their error message.
  • Structured error classes — Errors are classified by type (network, API, image, validation, configuration) and surfaced to the interface with comprehensible messages.
  • Progress and cost display — During analysis, processing status is shown page by page. After completion, the number of analysed pages, runtime, and the token usage reported by the endpoint are displayed.
  • Reproducible configuration — Model access, model name, render resolutions, and processing parameters are set through environment variables or a .env file. An analysis with identical settings can therefore be repeated.