Architecture overview#

Introduction#

AI mapping at universities is a system for recording, analysing and visualising AI activities at German universities. The application allows users to submit URLs for AI projects, automatically extracts relevant information, creates structured profiles and displays them in a clear web interface.

System Architecture#

The application is designed as a modern web application with a clear separation between backend and frontend. It follows an API-first approach with static site generation for the presentation of data.

Architecture diagram#

graph TD
    User -->|Submits URL| API[FastAPI backend]
    Admin -->|Manages data| API
    
    API -->|Stores data| DB[(SQLite/PostgreSQL)]    API -->|Extracts text| Crawler[Web crawler]
    Crawler -->|Processes content| LLM[LLM client]
    
    API -->|Generates profile| LLM
    API -->|Generates static pages| Generator[Site generator]
    
    Generator -->|Generates HTML| Static [Static website]
    
    User -->|Views results| Static
    
    subgraph Backend
        API
        Crawler
        LLM
        Generator
    end
    
    subgraph Database
        DB
    end
    
    subgraph Frontend
        Static
    end

Main components#

1. FastAPI backend#

The backend is implemented with FastAPI and forms the core of the application. It provides RESTful API endpoints for:

  • Submitting URLs
  • Extracting and analysing website content
  • Managing submissions and projects
  • Generating profiles
  • Generating the static website

2. Web Crawler & Extractor#

This component is responsible for extracting relevant information from submitted web pages:

  • SimpleExtractor: Extracts basic text and metadata from web pages
  • FeatureAwareExtractor: Advanced extraction with a focus on AI-specific content

3. LLM integration#

The application uses local language models (LLM) via an OpenAI-compatible client to:

  • Analyse the extracted texts
  • Identify AI-relevant information
  • Generate structured profiles
  • Recognise relationships between different projects

4. Database#

The application uses SQLite for development and PostgreSQL for production:

  • Submissions: Stores submitted URLs and their status
  • Projects: Stores approved and structured project data
  • Users: Manages administrator accounts for the backend

5. Site generator#

The site generator:

  • Generates static HTML pages from the project data
  • Uses Jinja2 templates for consistent design
  • Creates index, project list and detail pages
  • Generates metadata for search engines

6. Frontend#

The frontend consists of:

  • Static HTML pages with CSS and JavaScript
  • User-friendly forms for submitting URLs
  • Visualisations and filters for project data
  • Responsive design for various end devices

Technology stack#

Backend#

  • Python 3.11+: Basic programming language
  • FastAPI: Web framework for modern API development
  • SQLAlchemy: ORM for database access
  • Pydantic: Data validation and conversion
  • Alembic: Database migration tool
  • OpenAI client: Communication with local LLMs
  • BeautifulSoup/Trafilatura: HTML parsing and text extraction

Frontend#

  • HTML5/CSS3: Markup and styling
  • JavaScript: Client-side interactivity
  • Jinja2: Template engine for site generation

Database#

  • SQLite: For development and smaller deployments
  • PostgreSQL: For production environments

Deployment#

  • Docker: Containerisation
  • Docker Compose: Container orchestration

Data flow#

  1. User submits URL to an AI project via the web form
  2. The URL is validated and stored as a submission in the database
  3. The crawler extracts text and metadata from the URL
  4. The FeatureAwareExtractor analyses the content and identifies relevant structures
  5. The LLM client generates a structured profile based on the extracted text
  6. An administrator reviews and approves the submission
  7. The approved project is stored in the database
  8. The site generator creates updated static HTML pages
  9. Users can view and search the projects on the website

Security and authentication concept#

  • API key-based authentication for admin endpoints
  • Validation and cleansing of all user input
  • CORS configuration for front-end security
  • Server-based CSRF tokens

Scaling concept#

The application is designed for different scaling levels:

  1. Simple deployment: Single server with SQLite for smaller instances
  2. Medium scaling: PostgreSQL database and Docker deployment
  3. Advanced scaling: Distributed crawlers and LLM processing for high request volumes

Since the front-end presentation is done via static pages, scaling the website deployment is easy via CDNs or static hosting services.