LLM coding experiment: Development of a generic information extraction system#

Part of a series on methodological insights from LLM-supported development projects

Introduction and context of the series#

Over several months, various software tools with LLM support were developed, primarily as learning projects to explore the possibilities and limitations of LLM-supported coding. This documentation describes one of these projects: a generic system for extracting information from websites that generates structured profiles.

The focus of this documentation is on methodological insights into LLM-supported development, not on the tool itself. The experiences described are intended to provide transferable principles for similar projects.

Initial situation and motivation#

The problem with the previous system#

The project did not arise as an isolated experiment, but as a response to the limitations of an existing system. An ‘AI mapping’ system for recording AI initiatives at German universities already existed, but proved to be too inflexible. The system could only generate one type of profile. Each new mapping request – such as AI handouts, AI services, general research projects or cooperative projects to provide services to several universities – would have required considerable programming effort.

The inflexibility manifested itself on several levels: New fields required code changes, different profile structures were not provided for, and the quality assurance of the extracted data was insufficient.

Objective of the new system#

The central requirement was a system that would enable different types of mapping without the need for individual programming for each profile type. Flexibility was to be achieved through configuration rather than code.

Specifically, the following design goals were defined:

Prompt-based configuration of new data fields in minutes instead of weeks
Quality assurance as an integral part of the extraction process
Entity normalisation for comprehensive searchability
Scalable architecture for parallel processing of many sources
Admin interface for prompt management, review, and inline editing

The system should also be usable in production, not just serve as a prototype.

How the system works#

Basic concept: From Markdown to profile#

The system follows a clear pipeline: Website → Markdown → LLM extraction → LLM validation → structured data → profile → static website.

The first step is to crawl the source URL. Since relevant information is often spread across several subpages – team information on /team, contact details on /contact, details on /publications – the system crawls up to five pages per source. The subpages are prioritised based on rules using keyword scoring: links with terms such as ‘team’, “contact” or ‘about’ are given higher priority. This rule-based prioritisation was a conscious decision against an LLM-based variant in order to reduce complexity and API costs.

The crawled HTML is converted into structured Markdown, which summarises all pages with metadata (URL, page type, crawl time). This consolidated Markdown forms the basis for all further extractions.

Prompt-based configuration#

At the core of the system is a prompt-based configuration concept. A profile is defined as a Markdown template with variables, for example:

# {project_name}

**Institution:** {institution}
**Lead:** {project_lead}

## Description
{description}

There is a prompt pair for each variable:

Extract prompt: Describes precisely which information is to be extracted from the source text. The prompt contains examples of correct and incorrect extractions to give the LLM clear guidance.
Validate prompt: Evaluates the quality of the extraction result based on defined criteria such as uniqueness, plausibility and consistency with the source text.

Prompts can be organised into field groups (base, team, details), which are displayed differently in the profile. In addition, the system supports single-level dependencies: A prompt can use the result of another prompt as context, for example, if the department can only be extracted meaningfully if the institution is already known.

Two-phase extraction with quality assurance#

The extraction process runs in two phases, which is the central quality feature of the system:

Phase 1 (Extract): An LLM extracts the raw data from the crawled Markdown. The temperature is set low (0.1) to achieve consistent, deterministic results. The prompt is precisely formulated and contains positive and negative examples.

Phase 2 (Validate): A separate LLM call evaluates the quality of the result from phase 1. The validation prompt receives the raw result and the original context. The result is a structured format:

Quality level: HIGH, MEDIUM, LOW or INSUFFICIENT
Numerical score: 0.0 to 1.0
Optional comments: e.g. ‘Too vague’, ‘Contradictory’
Cleaned result or ‘INSUFFICIENT’

The final confidence is calculated as the minimum of the raw confidence from phase 1 and the validation score from phase 2. Fields below a configurable threshold (default: 0.6, for critical fields such as project names: 0.8) are automatically placed in a review queue for manual verification.

This two-phase architecture proved to be crucial for data quality. The validation phase systematically catches errors that arise in the extraction phase – such as vague wording, mix-ups or hallucinations.

Entity normalisation#

The system automatically normalises entities such as university names. The challenge: LLMs extract institution names inconsistently. ‘TU Berlin’, ‘Technische Universität Berlin’ and ‘TUB’ refer to the same institution, but appear as different entries.

The solution also uses LLM calls. The system manages a database of canonical names with known variants. Each time an entity field is extracted, an LLM call checks whether the extracted text corresponds to a known entity.

The decision logic is confidence-based:

Confidence > 0.9: Automatic linking
Confidence 0.6-0.9: Review queue, admin decides
Confidence < 0.6: Ignore

Once the assignment has been confirmed, the new spelling is saved as a variant so that future extractions are automatically assigned. Entity normalisation is currently implemented for universities and is also planned for locations, people and technologies in the future.

Technical architecture#

Selected stack#

The architecture is based on proven components:

FastAPI as the backend framework for admin API, public API and WebSocket communication
PostgreSQL for persistent data storage (sources, prompts, extractions, entities)
Redis for the job queue with priority handling
Worker pool for parallel processing (default: 3 workers, 5 parallel LLM calls)

The admin interface uses htmx + Alpine.js – a conscious decision for simplicity. No build step, no complex state management, yet reactive UI with inline editing and live status updates.

The public website is generated as static HTML pages, with client-side search via a JSON index. This enables fast performance without backend dependency for read access.

Robustness pattern#

An LLM-intensive system places special demands on error handling and resilience:

Circuit breaker: In the event of repeated LLM errors (default: 10 consecutive errors), the circuit breaker opens and pauses all requests for a configurable time (default: 5 minutes). After the timeout, a test request is sent; if successful, the breaker closes again. This pattern prevents an overloaded or failed LLM backend from blocking the entire system.

Retry management: Failed extractions are repeated with exponential backoff (2s, 4s, 8s). After three attempts, the field is marked as ‘failed’ and ends up in the review queue. An automatic retry occurs after one hour, up to a maximum of three times per day.

Parallel processing: The worker pool processes jobs from the Redis queue according to priority. A semaphore limits the number of simultaneous LLM calls to prevent the backend from becoming overloaded. Processing a source with 20 fields and dependencies typically takes 10-60 seconds.

Data model#

The data model is tailored to the requirements of prompt-based extraction:

Sources: Crawled web pages with Markdown content and status
Prompts: Extract/validate pairs with field groups and confidence thresholds
Categories: Profile types with template and active field groups
Extractions: Results with raw and validated values, confidence scores, entity links
Entities: Canonical names with variants and metadata
Job queue: Prioritised processing jobs with retry tracking

Development process with LLM#

The phase model#

The development process followed a strict phase model, which proved to be crucial to the success of the project:

Phase 1 – Functional discussion: Intensive discussions about requirements and possible solutions. Various architectures were discussed to address the shortcomings of the previous system. This phase only ended when clear concepts for all core functions had been defined.

Phase 2 – Architecture discussion: Technical implementation options were discussed and evaluated. Which stack? How will parallelisation be implemented? What will the data model look like? These discussions sharpened the concepts and prevented hasty decisions.

Phase 3 – Rough implementation: With the architecture defined, large parts were implemented in one go. The detailed specification enabled the LLM to generate coherent code.

Phase 4 – Modular refinement: Individual sub-areas were refined in separate sessions with the LLM. Each session had a clear focus (e.g. entity normalisation, review queue) and a partial specification.

Specification depth#

The specification documents created comprise three main documents totalling over 50 pages:

Functional specification: MVP scope, data flows, quality assurance workflow, admin workflow, success criteria
Technical specification: Architecture diagrams, complete database schemas with indexes, service classes with method signatures, code examples
User view specification: page structure, wireframes as ASCII art, JavaScript logic for client-side search, SEO requirements

The specifications contain explicit MVP boundaries: what is included in the scope, what is explicitly not included, and what can be added after validation. This clarity prevented scope creep during implementation.

Dealing with overengineering#

LLMs tend to suggest overly complex solutions – more levels of abstraction, more features, more flexibility than necessary. This was counteracted in several ways:

Explicit KISS principles in the specifications
Active questioning of every architecture proposal: ‘Do we really need this?’
Clear guidelines for recognisable overengineering: ‘Simpler solution preferred’
Conscious decisions in favour of simpler alternatives (htmx instead of React, rule-based instead of LLM-based link prioritisation)

LLM as an architecture consultant#

Interestingly, some robust patterns originated from LLM suggestions. The circuit breaker, for example, was introduced by the LLM during the architecture discussion when the question of how to deal with LLM backend failures arose.

This highlights an important insight: LLMs can not only generate code, but also contribute architectural best practices from their training corpus. The trick is to critically evaluate these suggestions and only adopt those that are truly useful.

Methodological insights#

Transferable principles#

Specification before implementation: The time invested in detailed specification pays off in the form of more error-free implementation. The clearer and more complete the specifications, the better the code quality. The effort shifts from debugging to planning.

From rough to fine: For large projects, LLM coding works when you progress from the overall architecture to modules to functions. The architecture must be robust before details are implemented. Otherwise, inconsistent modules will be created that will require time-consuming integration later on.

Modular refinement: After the rough implementation, sub-areas are refined in separate sessions. Each session has a clear focus and a partial specification. This keeps the context manageable and enables focused improvements.

Active control: The developer must actively control, question and correct. LLMs are tools, not autonomous developers. All architectures and concepts must be critically discussed – otherwise, hasty decisions will be made.

Limitations of the approach#

Scaling: The approach reaches its limits with very large projects. Context windows are limited, and coherence across many modules requires careful planning. Modularisation helps, but does not completely solve the problem.

LLM response quality: The challenge of managing many LLM calls and the variability of response quality was significant. Two-phase extraction addresses the problem but requires additional complexity.

Debugging: Manual intervention is still necessary for complex errors that arise across multiple modules. LLMs can help with debugging, but the developer retains the overall view.

Workflow change#

The experience with this project has changed the development workflow. The approach now follows modules and functional areas rather than linear development. The specification phase is given significantly more weight, while the pure implementation phase is shorter and less error-prone.

Metrics and results#

Project scope#

The project comprises:

95 files with a total of 43,346 lines
37 Python files with approx. 22,000 lines (16,000 lines of code)
20 HTML templates with approx. 8,700 lines
3 JavaScript files for client-side functionality
SQL schemas, YAML configuration, documentation

Development effort#

Total duration: approx. 2 months
Number of sessions: approx. 10
Main iterations: 3 (fundamental architecture revisions)
Phases: Specification → Implementation → Deployment → Fine-tuning

Production data#

Configured prompts: approx. 100 in 4 categories
Processed sources: approx. 100 (validation phase)
Processing time per source: 10-60 seconds

Quality results#

The two-phase extraction meets the defined success criteria:

Over 80% of fields are extracted with confidence >0.7
The QA layer reliably filters out unreliable extractions
Entity normalisation achieves high auto-link rates for well-known universities
According to the developers, the system works ‘better than expected’

The system is currently undergoing intensive evaluation and is proving successful across various categories. The quality assurance layer in particular is proving its worth in practice.

Reflection: What would be done differently?#

In retrospect, some decisions have proven particularly valuable, while other areas show room for improvement.

Proven decisions:

The early decision to use two-phase extraction was the right one. Initially, the double LLM call seemed like overhead, but the improvement in quality justifies the effort. Without the validation phase, the amount of manual rework would be considerably higher.

The choice of htmx + Alpine.js for the admin interface also proved to be a good one. The simplicity accelerated development and reduced sources of error. No complex front-end framework is necessary for an internal tool.

Room for improvement:

Prompt management could be more structured. With 100 prompts in four categories, it becomes difficult to maintain an overview. A versioning system for prompts and better grouping logic would be helpful.

Test coverage could be improved. Development focused on functionality; automated tests were not created systematically. This should be rectified for a productive system.

Conclusion and outlook#

The experiment demonstrates that LLM-supported development also works for large-scale, architecturally demanding projects. The success factors are:

**Resilient architecture: ** The overall structure must be clearly defined before implementation. Subsequent architectural changes are costly.
Detailed specification: The more precise the specifications, the better the generated code quality. ASCII diagrams, code examples and explicit delimitations help the LLM to generate coherent code.
*Modular approach: * From rough to fine, with separate refinement sessions per module. This keeps the context manageable.
Active control: Continuous questioning of LLM suggestions and explicit specifications against overengineering are necessary.

Two-phase extraction with integrated quality assurance has proven to be a robust concept. The validation phase is not just a technical feature, but a central design principle that increases the reliability of the overall system.

For future LLM coding projects, the recommendation is to invest more time in the specification phase and critically discuss architectural decisions before implementation begins. The effort pays off in the form of more efficient and error-free implementation.