PROMPT-AI User Documentation#

Purpose#

PROMPT-AI is a web-based information extraction system that automatically extracts structured data from websites and converts it into uniform profiles. The system is aimed at institutions and projects that want to systematically collect and process large amounts of web content.

The basic principle is based on a three-step approach: First, you define a category that describes the type of content you want to capture. Next, you formulate prompts that tell the integrated AI which specific information fields to extract. Finally, you enter the URLs to be analysed. The system crawls the web pages, extracts the desired information using a large language model, and generates structured profiles from it.

Typical application scenarios include mapping research projects, recording service offerings, or systematically documenting institutional activities.

Range of functions#

The system offers the following core functions:

  • Web crawling with multi-page support: Automatic reading of web pages, including up to five linked subpages, in order to obtain the most complete information possible.

  • Two-phase information extraction: The AI first extracts the desired information and then validates it in a separate step to ensure the quality of the results.

  • Prompt-based configuration: New information fields can be added through simple prompt definitions without the need for programming.

  • Entity normalisation: The system recognises different spellings of the same entity and reduces them to a canonical form.

  • Review queue: Extractions with low confidence are automatically flagged for manual review.

  • Profile generation: Formatted Markdown profiles are automatically generated from the extracted data according to configurable templates.

  • Static website generation: All published profiles can be exported as a complete static website with a search function.

Operation#

Overview of the work steps#

The typical workflow is divided into five phases: Create category, define prompts, crawl URLs, check results, and publish profiles.

Create category#

Navigate to Admin UI > Categories > New Category. A category defines the type of content to be captured and the format of the resulting profiles.

You must fill in the following fields:

  • Internal Name: A technical identifier in lowercase letters that serves as a unique identifier.
  • Display Name: The display name that appears in the user interface.
  • Profile Template: A Markdown template with placeholders in the form $fieldname, which will later be replaced by the extracted values.

Define prompts#

Navigate to Admin UI > Prompts > New Prompt. A prompt instructs the AI which specific information to extract from the website text.

Each prompt requires:

  • Internal Name: Must match the corresponding placeholder in the profile template exactly.
  • Display Name: Descriptive name for the user interface.
  • Extract Prompt: The instruction to the AI for extraction.
  • Validate Prompt: The instruction to the AI for quality checking.
  • Field Group: Assigns the prompt to a thematic group.
  • Required Confidence: The minimum confidence value at which an extraction is considered sufficient.

Then assign the prompts you have created to the appropriate category via Admin UI > Categories > [Category] > Manage Prompts.

Crawling URLs#

Navigate to Admin UI > Sources > New Source. Enter the URL of the website to be analysed, select the appropriate category and activate multi-page crawling if required.

After starting, the processing goes through the following status transitions:

  1. pending – The request is waiting in the queue
  2. crawled – The website has been read
  3. extracting – The AI is extracting the information
  4. completed – Processing is complete

Processing typically takes 30 seconds to two minutes per source.

Check and correct results#

Under Admin UI > Sources > [Your source] > Extractions, you can see all extracted information with the corresponding confidence value and quality rating.

Extractions with low confidence automatically appear in the review queue (Admin UI > Review). There you can edit the values directly. Manually corrected entries automatically receive a confidence rating of 100%.

Publish profile#

Once all extractions have been validated, you can generate and publish the profile via Admin UI > Sources > [Your source] > Publish.

Special notes#

  • The maximum number of crawled subpages is limited to five.
  • The Markdown size is limited to 200,000 characters.
  • The system respects robots.txt restrictions on the target websites.
  • For optimal results, prompts should be precise and unambiguous.

Application example#

Initial situation#

A university library wants to systematically record all AI-related research projects at its own university and make them accessible in a web portal. Previously, the research was done manually via faculty websites, which was time-consuming and prone to errors.

Implementation with PROMPT-AI#

First, a category researchproject is created with a template that contains the project name, participating institution, project management, funding body, duration and description.

Six prompts are then defined:

  • project_name: Extracts the official project name
  • institution: Extracts the lead institution
  • project_lead: Extracts the project management with title
  • funding_body: Extracts the funding body
  • funding_period: Extracts the funding period
  • short_description: Generates a concise summary

The employee now enters the URLs of the project websites, for example https://www.uni-musterstadt.de/ki-diagnostik. The system crawls the main page and linked subpages such as team or publication pages and extracts the defined information.

For three out of ten projects, the funding body appears in the review queue because it was only mentioned as an acronym on the website. The employee adds the full name manually.

Result#

After about an hour of work, structured profiles are available for all ten projects. Static website generation creates a searchable portal that can be made available on the library’s web space without any additional server infrastructure.

Recommendations for efficient use#

Prompt formulation#

  • Formulate prompts clearly and unambiguously with specific instructions.
  • Provide examples of expected formats.
  • Use phrases such as β€˜Answer only with…’ to avoid superfluous text elements.
  • Avoid vague instructions such as β€˜find something about…’.
  • Limit each prompt to exactly one piece of information.

Categories and field groups#

  • Start with a manageable number of five to ten prompts per category.
  • Organise prompts into logical field groups.
  • Split categories with more than 20 prompts into several categories.

Entity maintenance#

  • Create canonical entries for frequently occurring entities.
  • Continuously maintain variants to improve the recognition rate.
  • Use the LLM suggestions for efficient variant recognition.

Quality assurance#

  • Check the first extractions of newly created prompts particularly carefully.
  • Adjust prompts in case of systematic errors.
  • Use the re-extract function after prompt optimisations.

System limitations#

The system is subject to the following technical and conceptual limitations:

  • No JavaScript processing: Websites that load their content dynamically via JavaScript cannot be fully captured. The system only processes the HTML content initially delivered.

  • Limited crawl depth: A maximum of five subpages are captured per source. Complex website structures may require several separate source entries.

  • No authentication: Content behind login areas or with access restrictions is not accessible.

  • Text-based extraction: The system only extracts text content. Information from images, PDFs or tables is not captured.

  • No real-time updates: Changes to the source websites are not automatically detected. A new crawl must be initiated manually.

  • Language dependency: The extraction quality depends on the match between the prompt language and the website language.

  • No fully automatic categorisation: The assignment of a URL to a category is done manually by the user.

Summary#

PROMPT-KI enables the automated extraction of structured information from websites through the use of a large language model. The system combines web crawling, AI-supported information extraction with two-stage quality checking, and entity normalisation into an integrated workflow.

The main strength of the system lies in its flexibility: by defining prompts, you can specify which information should be extracted without any programming knowledge. At the same time, the system requires your active participation in quality assurance. The automated extractions form a working basis that is refined through manual checking and correction to produce reliable results.

PROMPT-AI does not replace the user’s content expertise, but automates the time-consuming data entry work, thus creating space for the qualitative evaluation and curation of the information obtained.