LLM coding experiment: A generic information extraction system#

Part of a series on methodological findings from LLM-supported development projects

Initial situation and objectives#

As part of a series of experiments on LLM-supported software development, a generic system for extracting information from websites was created. The project was not purely a learning experiment, but addressed a specific problem: an existing mapping system for AI initiatives at German universities was too inflexible to generate different types of profiles. Each new requirement – whether AI guidelines, services or research projects – would have required individual programming effort.

The central requirement was therefore a system that would enable different types of mapping, configurable by prompts rather than code. Flexibility was to be achieved not through programming, but through structured configuration.

What the system does#

The developed system crawls websites, extracts structured information and generates profiles from it. At its core is a prompt-based configuration concept: Instead of writing code, a profile is defined as a Markdown template with variables. For each variable, there is an extraction prompt that describes what information is being sought and a quality assurance prompt that evaluates the result.

The extraction process takes place in two phases. In the first phase, an LLM extracts the raw data from the crawled Markdown. The prompt contains precise instructions and examples of correct and incorrect extractions. In the second phase, a separate LLM call evaluates the quality of the result. This validation checks criteria such as uniqueness, plausibility and consistency with the source text. The result is a confidence score between 0 and 1.

Fields below a configurable threshold automatically end up in a review queue for manual checking. This ensures that only high-quality extractions are included in the profiles.

In addition, the system normalises entities such as university names. The variants ‘TU Berlin’, ‘Technische Universität Berlin’ and ‘TUB’ are automatically mapped to a canonical name. This enables comprehensive searches and consistent links.

The development process#

The development process followed a strict phase model. First, functionalities were discussed intensively until clear solution concepts were defined. This was followed by technical architecture discussions in which various implementation variants were evaluated. Only when the overall architecture was in place did the actual implementation begin.

The specification documents created are remarkably detailed: ASCII diagrams for architectures, complete database schemas, code examples for central services and clear MVP delimitations. This level of detail meant that code generation by the LLM was largely error-free.

Methodological findings#

**From rough to fine: ** For large projects, LLM coding works when you progress from the overall architecture to modules to functions. Large parts were first developed as an overall specification and then refined in separate sessions on a modular basis. The architecture must be robust before details are implemented.

Architecture discussion as protection against overengineering: LLMs tend to suggest overly complex solutions. Intensive discussions about architectural decisions and explicit guidelines for simplicity counteracted this. The choice of htmx + Alpine.js instead of an SPA framework was a conscious decision in favour of simplicity.

LLM-initiated patterns: Some robust architectural patterns, such as the circuit breaker for the LLM backend, emerged as LLM suggestions during the discussion. This shows that LLMs can not only generate code, but also contribute architectural best practices that you might not immediately have on your radar.

Scope and results#

The project comprises 95 files with a total of 43,000 lines, including 37 Python files with approximately 22,000 lines and 20 HTML templates. Development was spread over approximately 10 sessions over two months, with three major architecture iterations.

Currently, about 100 prompts are configured in four categories. The system has processed about 100 sources so far, with each source taking between 10 and 60 seconds depending on its complexity.

The two-phase extraction achieves the desired quality targets: over 80 per cent of the fields are extracted with sufficient confidence. According to the developers, the system works ‘better than expected’. The QA layer in particular has proven its worth – it reliably filters out unreliable extractions.

Conclusion#

The experiment shows that LLM-supported development also works for large-scale projects if three conditions are met: a robust architecture before implementation begins, a modular approach from rough to fine, and active control against overengineering. The time invested in specification and architecture discussion pays off in the form of largely error-free implementation.