Skip to content

Flex-Kartierung — Highlights

Flex-Kartierung is an application for the structured capture of university offerings from publicly accessible web pages. Operators define in YAML files which content types should be captured — for example research projects, AI services, guidelines, or cooperative services —, which fields a record contains, and how it is rendered as a profile (Steckbrief). From submitted URLs, structured profiles are produced through a multi-stage LLM-supported pipeline and made available in a searchable static website. The application's emphasis lies on a dedicated quality-assurance layer that assigns a checked statement and a confidence score to every extracted value, and on a fully configurable profile definition.

At a glance

  • Produce structured profiles from arbitrary university web pages without manual copying per page.
  • Define custom content types with fields, prompts, and profile templates in YAML — new profile types without program code.
  • Inspect, correct, and selectively re-extract extraction results through a review queue.
  • Unify inconsistent spellings of universities and locations (for example "TU Munich" / "Technische Universität München") through entity management.
  • Generate bilingual profiles (German / English) automatically, with protection for proper names and URLs against unintended translation.
  • Export a fully static public website with category pages, detail pages, full-text search, sitemap, and API.
  • Run crawling, extraction, and generation as asynchronous background jobs without blocking the operating interface.

Highlights

In contrast to a direct LLM prompt or a simple crawl script, Flex-Kartierung combines a configurable profile definition with a multi-stage processing pipeline and a dedicated quality-assurance layer. As a result, each field is traceably evaluated, manually correctable, and reproducible over time.

  • Custom profile definition per content type. Categories, field groups, prompts, confidence thresholds, and Markdown templates are described entirely in YAML. A new capture type is created through a configuration file, not through code changes.

  • Two-stage extraction with a dedicated validation phase. For each field, the system performs an extract call followed by a validate call; the validate phase returns a quality class (HIGH / MEDIUM / LOW / INSUFFICIENT), a score, and a justification.

  • Confidence-based review queue. Fields below the per-prompt required confidence are automatically flagged for manual review and made available with their source context for inline editing.

  • LLM-supported entity normalization with thresholds and review. Extractions with entity references (universities, locations) are matched against a canonical entity inventory; above a minimum confidence the link is established automatically, otherwise a review item is created.

  • Prompt dependencies for context-aware fields. Results of individual fields can be passed as context into downstream prompts (for example "Which subject area at {institution} runs the project?"), executed in dependency-resolved waves.

  • Multi-stage translation pipeline with proper-name protection. Validated fields are translated to English individually, with explicit rules for personal, institutional, and product names, URLs, and technical abbreviations.

  • Connectors to three source/backend types: public university web pages via a crawler with robots.txt and rate-limit awareness; an OpenAI-compatible LLM backend; PostgreSQL for persistence; Redis for queue and cache.

  • Robustness layer between pipeline and backend. A circuit breaker pauses requests to the LLM after repeated errors; a retry manager with exponential backoff and jitter absorbs transient failures.

  • Fully traceable processing. For each source, the crawled Markdown, the raw extract result, the validated result, manual corrections, and translations are stored separately; processing is observable through structured logs and Prometheus metrics.

  • Static delivery of the public content. The generated website is serverless and can be operated as plain files behind a CDN or web server; client-side search and API endpoints for machine consumers are included.