From rigid mapping tools to flexible configuration: a system for information extraction#

We have developed a system that extracts structured information from websites – configurable through prompts instead of code. This solves a common problem: previously, each new requirement had to be programmed individually.

What was the challenge?#

An existing mapping system for AI initiatives at German universities was too inflexible. Separate programming would have been necessary for each new type of profile – whether for services, handouts or research projects.

We wanted a system that would enable different types of mapping. The flexibility was to be achieved through structured configuration, not programming.

What can the system do?#

Crawl websites and extract information (from university pages, project documentation, service descriptions)
Configure profiles via prompt (Markdown templates with variables instead of hard-coded fields)
Two-phase quality assurance (first extraction, then automatic evaluation with confidence score)
Automatic review routing (uncertain results are placed in a queue for manual review)
Normalise entities (e.g. ‘TU Berlin’, ‘Technische Universität Berlin’ and ‘TUB’ are standardised to one name)

This creates a consistent database without manual rework.

How did we develop it?#

We followed a strict phased approach:

Phase 1 (several sessions): Discussed functionalities until clear solution concepts were established
Phase 2 (parallel): Technical architecture evaluated and implementation variants compared
Phase 3 (after architecture approval): Implementation with LLM support

Total effort: Approximately 10 sessions over two months, with 3 major architecture iterations

Result: 95 files with 43,000 lines of code, including 37 Python files with approximately 22,000 lines

Why did it go so well?#

Because we invested a lot of time in detailed specifications.

The documents we created contained ASCII diagrams for architectures, complete database schemas, and code examples for key services. This level of detail meant that the code generation by the LLM was largely error-free.

Initial tests show that the system is performing better than expected. Over 80 per cent of the fields are extracted with sufficient confidence. The quality assurance layer in particular is proving its worth – it reliably filters out unreliable extractions.

Key findings#

Work from the rough to the fine

For large-scale projects, LLM-supported development works when you progress from the overall architecture to modules to functions. We first developed large parts as an overall specification, then refined them modularly in separate sessions.

Architecture discussions protect against overengineering

LLMs tend to suggest overly complex solutions. Intensive discussions and explicit guidelines for simplicity counteracted this. The choice of htmx + Alpine.js instead of an SPA framework was a conscious decision in favour of simplicity.

LLMs introduce architectural best practices

Some robust patterns, such as the circuit breaker (a protection mechanism against backend overload), emerged as LLM suggestions. This shows that LLMs can not only generate code, but also introduce patterns that you might not immediately have on your radar.

What can others learn from this?#

The architecture must be resilient before details are implemented.
Detailed specifications pay off through error-free implementation.
Actively counteract overengineering – LLMs tend to suggest complex solutions.
Take LLM suggestions for architecture patterns seriously and review them.

Conclusion#

✔ LLM-supported development also works for large projects – with a clear architecture in advance

✔ Investing time in specification pays off through error-free implementation

✔ A modular approach from rough to fine is key

This is part of a series on insights from LLM-supported development projects. The focus is on what can be learned from such projects – not just on the technical results.