Flex-Kartierung — Highlights¶

Flex-Kartierung is an application for the structured capture of university offerings from publicly accessible web pages. Operators define in YAML files which content types should be captured — for example research projects, AI services, guidelines, or cooperative services —, which fields a record contains, and how it is rendered as a profile (Steckbrief). From submitted URLs, structured profiles are produced through a multi-stage LLM-supported pipeline and made available in a searchable static website. The application's emphasis lies on a dedicated quality-assurance layer that assigns a checked statement and a confidence score to every extracted value, and on a fully configurable profile definition.

At a glance¶

Produce structured profiles from arbitrary university web pages without manual copying per page.
Define custom content types with fields, prompts, and profile templates in YAML — new profile types without program code.
Inspect, correct, and selectively re-extract extraction results through a review queue.
Unify inconsistent spellings of universities and locations (for example "TU Munich" / "Technische Universität München") through entity management.
Generate bilingual profiles (German / English) automatically, with protection for proper names and URLs against unintended translation.
Export a fully static public website with category pages, detail pages, full-text search, sitemap, and API.
Run crawling, extraction, and generation as asynchronous background jobs without blocking the operating interface.

Highlights¶

In contrast to a direct LLM prompt or a simple crawl script, Flex-Kartierung combines a configurable profile definition with a multi-stage processing pipeline and a dedicated quality-assurance layer. As a result, each field is traceably evaluated, manually correctable, and reproducible over time.

Custom profile definition per content type. Categories, field groups, prompts, confidence thresholds, and Markdown templates are described entirely in YAML. A new capture type is created through a configuration file, not through code changes.
Two-stage extraction with a dedicated validation phase. For each field, the system performs an extract call followed by a validate call; the validate phase returns a quality class (HIGH / MEDIUM / LOW / INSUFFICIENT), a score, and a justification.
Confidence-based review queue. Fields below the per-prompt required confidence are automatically flagged for manual review and made available with their source context for inline editing.
LLM-supported entity normalization with thresholds and review. Extractions with entity references (universities, locations) are matched against a canonical entity inventory; above a minimum confidence the link is established automatically, otherwise a review item is created.
Prompt dependencies for context-aware fields. Results of individual fields can be passed as context into downstream prompts (for example "Which subject area at {institution} runs the project?"), executed in dependency-resolved waves.
Multi-stage translation pipeline with proper-name protection. Validated fields are translated to English individually, with explicit rules for personal, institutional, and product names, URLs, and technical abbreviations.
Connectors to three source/backend types: public university web pages via a crawler with robots.txt and rate-limit awareness; an OpenAI-compatible LLM backend; PostgreSQL for persistence; Redis for queue and cache.
Robustness layer between pipeline and backend. A circuit breaker pauses requests to the LLM after repeated errors; a retry manager with exponential backoff and jitter absorbs transient failures.
Fully traceable processing. For each source, the crawled Markdown, the raw extract result, the validated result, manual corrections, and translations are stored separately; processing is observable through structured logs and Prometheus metrics.
Static delivery of the public content. The generated website is serverless and can be operated as plain files behind a CDN or web server; client-side search and API endpoints for machine consumers are included.