Flex-Kartierung — Highlights¶
Flex-Kartierung is an application for the structured capture of university offerings from publicly accessible web pages. Operators define in YAML files which content types should be captured — for example research projects, AI services, guidelines, or cooperative services —, which fields a record contains, and how it is rendered as a profile (Steckbrief). From submitted URLs, structured profiles are produced through a multi-stage LLM-supported pipeline and made available in a searchable static website. The application's emphasis lies on a dedicated quality-assurance layer that assigns a checked statement and a confidence score to every extracted value, and on a fully configurable profile definition.
At a glance¶
- Produce structured profiles from arbitrary university web pages without manual copying per page.
- Define custom content types with fields, prompts, and profile templates in YAML — new profile types without program code.
- Inspect, correct, and selectively re-extract extraction results through a review queue.
- Unify inconsistent spellings of universities and locations (for example "TU Munich" / "Technische Universität München") through entity management.
- Generate bilingual profiles (German / English) automatically, with protection for proper names and URLs against unintended translation.
- Export a fully static public website with category pages, detail pages, full-text search, sitemap, and API.
- Run crawling, extraction, and generation as asynchronous background jobs without blocking the operating interface.
Highlights¶
In contrast to a direct LLM prompt or a simple crawl script, Flex-Kartierung combines a configurable profile definition with a multi-stage processing pipeline and a dedicated quality-assurance layer. As a result, each field is traceably evaluated, manually correctable, and reproducible over time.
-
Custom profile definition per content type. Categories, field groups, prompts, confidence thresholds, and Markdown templates are described entirely in YAML. A new capture type is created through a configuration file, not through code changes.
-
Two-stage extraction with a dedicated validation phase. For each field, the system performs an extract call followed by a validate call; the validate phase returns a quality class (HIGH / MEDIUM / LOW / INSUFFICIENT), a score, and a justification.
-
Confidence-based review queue. Fields below the per-prompt required confidence are automatically flagged for manual review and made available with their source context for inline editing.
-
LLM-supported entity normalization with thresholds and review. Extractions with entity references (universities, locations) are matched against a canonical entity inventory; above a minimum confidence the link is established automatically, otherwise a review item is created.
-
Prompt dependencies for context-aware fields. Results of individual fields can be passed as context into downstream prompts (for example "Which subject area at
{institution}runs the project?"), executed in dependency-resolved waves. -
Multi-stage translation pipeline with proper-name protection. Validated fields are translated to English individually, with explicit rules for personal, institutional, and product names, URLs, and technical abbreviations.
-
Connectors to three source/backend types: public university web pages via a crawler with robots.txt and rate-limit awareness; an OpenAI-compatible LLM backend; PostgreSQL for persistence; Redis for queue and cache.
-
Robustness layer between pipeline and backend. A circuit breaker pauses requests to the LLM after repeated errors; a retry manager with exponential backoff and jitter absorbs transient failures.
-
Fully traceable processing. For each source, the crawled Markdown, the raw extract result, the validated result, manual corrections, and translations are stored separately; processing is observable through structured logs and Prometheus metrics.
-
Static delivery of the public content. The generated website is serverless and can be operated as plain files behind a CDN or web server; client-side search and API endpoints for machine consumers are included.