Skip to content

Architecture

The architecture follows a three-layer model: a management layer (Python application with web UI), a container orchestration layer (Docker API, shared bridge network), and an inference layer (one vLLM container per model). All long-running components — reverse proxy, API proxy, database, monitoring — run as Docker containers in the same bridge network and are controlled by the management layer through the Docker API. The management layer itself runs on the host outside the container network; inference requests flow exclusively through the reverse proxy.

At a glance

  • Single-file management application in Python with a Gradio UI; bound exclusively to 127.0.0.1:7860.
  • Docker bridge network llm-network as the shared communication layer of all containers, with Docker-internal DNS.
  • Caddy as the central HTTPS reverse proxy for external access; all other containers without host port mapping.
  • LiteLLM proxy as an OpenAI-compatible mediator for chat, vision, and embedding; direct routes to vLLM containers for speech-to-text, text-to-speech, and image generation.
  • PostgreSQL for virtual keys, with no port exposure and reachable only inside the Docker network.
  • Optional monitoring stack consisting of Prometheus, Grafana, and the NVIDIA DCGM Exporter.
  • A single central JSON file holds the entire state; all further configuration files are generated from it.

Architecture description

The management application runs as a systemd service on the host and communicates with the Docker daemon over the Docker Unix socket. It is the only component outside the container network and holds the global state: detected GPUs, MIG configuration, registered models, API keys, image versions, and HTTPS settings. Whenever this state changes, the configuration files for LiteLLM, Caddy, Prometheus, and Grafana are regenerated, and the affected containers are restarted or reloaded.

Components and data flow

flowchart TB
    Client[Client / Browser / API]

    subgraph Host[Host system]
        Manager[LLM-Manager<br/>Gradio Web UI<br/>127.0.0.1:7860]
        NVSMI[nvidia-smi]
        Config[(config.json)]
        HFCache[(Hugging Face Cache)]

        subgraph DockerNet[Docker network llm-network]
            Caddy[Caddy<br/>HTTPS Reverse Proxy<br/>:443]
            LiteLLM[LiteLLM Proxy<br/>:4000]
            Postgres[(PostgreSQL<br/>Virtual Keys)]

            subgraph vLLMs[vLLM containers per model]
                vLLM1[Chat / Vision / Embedding]
                vLLM2[Speech-to-Text]
                vLLM3[Text-to-Speech / Image Generation]
            end

            subgraph Monitoring[Monitoring optional]
                Prom[Prometheus]
                Grafana[Grafana]
                DCGM[DCGM Exporter]
            end
        end
    end

    Client -->|HTTPS| Caddy

    Manager -->|Docker API| Caddy
    Manager -->|Docker API| LiteLLM
    Manager -->|Docker API| vLLMs
    Manager -->|nvidia-smi| NVSMI
    Manager <--> Config

    Caddy -->"/v1/*"| LiteLLM
    Caddy -->"/models/{name}/*"| vLLM2
    Caddy -->"/models/{name}/*"| vLLM3
    Caddy -->"/open/{name}/*"| LiteLLM
    Caddy -->"/grafana/*"| Grafana

    LiteLLM --> vLLM1
    LiteLLM --> Postgres

    vLLM1 -.-> HFCache
    vLLM2 -.-> HFCache
    vLLM3 -.-> HFCache

    Prom --> vLLM1
    Prom --> vLLM2
    Prom --> vLLM3
    Prom --> LiteLLM
    Prom --> DCGM
    Grafana --> Prom

Diagram explanation

Incoming requests arrive exclusively at Caddy. Caddy distinguishes three routing paths: requests to /v1/* are forwarded to LiteLLM; LiteLLM authenticates via the master key or a virtual key (against PostgreSQL) and forwards the request to the appropriate vLLM container, using a least-busy strategy when multiple replicas are present. Requests to /models/<n>/v1/* are routed directly to the named vLLM container; Caddy strips the prefix and verifies the Authorization header through an expression matcher against the master key and all configured direct-route keys. Requests to /open/<n>/v1/* (when enabled) accept any non-empty bearer token, replace it with the master key, and forward the request to LiteLLM.

The management application itself is not part of the inference data flow. It reads GPU information via nvidia-smi, controls the container lifecycle through the Docker API, communicates with the LiteLLM management API for key generation, and writes all configuration files from the central JSON state. The Hugging Face cache is a shared volume mounted into all vLLM containers, so that model files are stored only once.

Container layout

All containers run in the bridge network llm-network and communicate through Docker-internal DNS. Only Caddy (ports 443 and 80) and the web UI (port 7860, exclusively 127.0.0.1) are reachable on the host; PostgreSQL, LiteLLM, all vLLM containers, and the monitoring stack are configured without host port mapping. Containers are tagged with labels (managed-by, model-name, model-type) so that containers maintained by the management application can be distinguished from others.

Image selection per model follows a fixed priority: an optional model-level override takes effect first, followed by a recipe image (such as for speech-to-text with Whisper or audio dependencies), then the vLLM-Omni image (for text-to-speech and image generation), and finally the standard vLLM image (chat, vision, embedding). Recipe images can be built directly from the interface; the corresponding packages are stored as recipes.

Management application layers

The Python application is internally organized into several layers:

  • Hardware layer — GPU discovery, NVLink and MIG detection, MIG management, and MIG persistence.
  • Data model layer — model and application configuration as typed dataclasses with migration paths for older configurations.
  • Orchestration layer — container lifecycle (create, start, stop, logs), image selection, and image build logic for recipe images.
  • Configuration layer — persistence and generation of the LiteLLM, Caddy, and Prometheus configurations from the central JSON state.
  • Monitoring layer — energy sampling via nvidia-smi, session tracking, and CSV export.
  • UI layer — Gradio interface with tab-based structure, real-time updates, and event handlers.

Concurrency and robustness

Safety-critical operations are designed so that they cannot leave the system in an inconsistent state. PostgreSQL containers are stopped but never removed, and the corresponding volume is never deleted. LiteLLM restarts wait a fixed period between stop and start and do not touch PostgreSQL or Caddy. On the first PostgreSQL start, database migrations are protected by a health check with generous warm-up time.

Restoration after a reboot runs as a five-step procedure: restoring MIG partitions, starting PostgreSQL, starting all models with status "running", starting LiteLLM, and finally updating the Caddy configuration. If the monitoring stack is active, it follows as a sixth step.

Configuration and deployment

The application is operated as a systemd service; its configuration resides in a fixed directory on the host, which can be overridden via an environment variable. Whenever it changes, litellm-config.yaml, Caddyfile, prometheus.yml, and the Grafana provisioning files are regenerated from the central config.json. Sensitive data — config.json with keys and the database password — is stored with restricted file permissions. The DATABASE_URL is passed only as a container environment variable and never written to configuration files.

Technology overview

Layer Component
Web UI Gradio
Programming language Python
Container orchestration Docker, Docker SDK for Python, NVIDIA Container Toolkit
Inference backends vLLM, vLLM-Omni
API proxy LiteLLM
HTTPS reverse proxy Caddy
Database PostgreSQL
Monitoring Prometheus, Grafana, NVIDIA DCGM Exporter
Model retrieval Hugging Face Hub (Hugging Face CLI)
Configuration formats JSON, YAML
Data processing pandas
Deployment systemd, Docker bridge network