Features¶
The functionality is organized into hardware and GPU management, model management, routing and API provisioning, key management, quality-assurance functions, and monitoring and energy tracking. Operation is carried out through a tab-based web interface.
Use cases¶
- Chat, vision, and embedding models for research groups. On a server with several NVIDIA GPUs, a larger chat model (spread across multiple GPUs via tensor parallelism) and a smaller embedding model (on a MIG instance) are made available in parallel. Group members access them through virtual keys with individual rate limits and model restrictions.
- Speech recognition and speech synthesis as internal services. Speech-to-text and text-to-speech models run on MIG instances or smaller GPUs and are exposed through the direct endpoints
/models/<n>/v1/*to internal applications such as transcription tools or read-aloud features. - Image generation as an internal service. Image generation models are deployed on a dedicated GPU and bound to an internal application through a direct-route key, without that application participating in the virtual-key system.
- Multi-tenant operation with usage accounting. Several working groups share the same server. A virtual key with RPM limits, model restrictions, and spend tracking is issued per group or person; usage data can be evaluated through the LiteLLM management API.
- Provision within a campus network without distributing keys. In firewall-protected networks, selected chat, vision, and embedding models are made reachable via the open-access route
/open/<n>/v1/*with any bearer token. Access control is enforced by the upstream firewall; a kill switch disables all open-access routes immediately. - Performance comparison before model selection. Before adopting a model for regular operation, TTFT percentiles (P50/P95/P99), end-to-end latency, and throughput are measured directly against the vLLM container — without proxy overhead. Several models can be tested sequentially under identical conditions.
At a glance¶
- Automatic detection of all NVIDIA GPUs (count, type, VRAM, NVLink, MIG status) and persistence of the MIG configuration across reboots.
- Seven model types with type-specific defaults for Docker image, arguments, routing, and endpoints.
- Two API routing paths (LiteLLM-mediated and direct) behind a shared HTTPS reverse proxy.
- Three-tier API key system (master, direct-route keys, virtual keys) with separate scopes.
- Replicas with transparent load balancing and live status display (e.g. 4/4, 3/4 on partial failure).
- Built-in Prometheus and Grafana monitoring plus a dedicated energy monitor with CSV export.
- Restoration of the previous operating state after a reboot in a single step.
Hardware and GPU management¶
At startup, all NVIDIA GPUs are detected and listed in the interface — including VRAM, NVLink and NVSwitch links, and MIG status. Heterogeneous configurations (such as one A100 alongside several H100s) are supported. The MIG section partitions GPUs into isolated instances (profiles from 1g.45gb to 4g.180gb, up to seven instances per GPU); small models run on MIG instances while large models occupy whole GPUs. The MIG configuration is persisted in a dedicated file and restored automatically after a reboot.
A built-in GPU memory calculator estimates VRAM requirements (model weights, KV cache, activations, overhead) from parameter count, quantization, context length, and architecture details, and proposes a tensor-parallel configuration matching the NVLink topology.
Model management¶
Models are added by Hugging Face ID; model type, GPU or MIG assignment, quantization (auto, fp8, awq, gptq, bitsandbytes), maximum context length, and additional vLLM arguments and environment variables can be set per model. Each model runs in its own Docker container; container logs are viewable directly from the interface. Optionally, a custom Jinja2 chat template, an alternate Docker image, and multiple replicas (1–8) can be configured per model. With replicas > 1, LLM-Manager creates identical containers across the assigned GPUs; validation requires a GPU count divisible by the replica count.
A built-in pre-download via the Hugging Face CLI loads models into the local cache without occupying GPUs, so that very large models can be staged outside of actual deployment. A cache overview shows downloaded models with their size and supports selective cleanup.
Connectors and external services¶
- Hugging Face Hub — source for all models. Access is performed through the official Hugging Face CLI in a temporary container; a Hugging Face token can be configured for gated models. Downloaded models reside in a shared cache and are mounted into all vLLM containers as a volume.
- Docker Hub — source for the official vLLM images, the PostgreSQL image, the Caddy image, the Prometheus image, and the Grafana image.
- GitHub Container Registry (ghcr.io) — source for the LiteLLM images (standard variant and database variant with Prisma).
- ACME server (Let's Encrypt or self-hosted) — source for TLS certificates, which Caddy obtains and renews automatically. Endpoint and email address are configurable.
Routing and API provisioning¶
Two routing paths cover all model types. Chat, vision, and embedding models are exposed at /v1/chat/completions, /v1/embeddings, and /v1/models through LiteLLM; LiteLLM authenticates via the master key or a virtual key and routes internally to the respective vLLM container. Speech-to-text, text-to-speech, and image generation models are routed directly to the vLLM container at /models/<n>/v1/*; Caddy strips the prefix and verifies the Authorization header through an expression matcher against the master key and all configured direct-route keys.
Optionally, open-access mode can be enabled per chat-, vision-, or embedding-capable model, which exposes a route /open/<n>/v1/* accepting any bearer token. Caddy replaces the token with the actual master key and forwards the request to LiteLLM. A kill switch in the HTTPS section disables all open-access routes simultaneously.
The model table displays the full client URL per model, so that endpoints can be taken directly from the interface.
API key management¶
Three key types with different scopes:
- Master key — applies to all routes, is stored in the configuration file, and is accepted by both LiteLLM and Caddy.
- Direct-route keys — an arbitrary number of keys that apply only to
/models/<n>/v1/*and are intended for internal applications. They are validated through a Caddy expression matcher. - Virtual keys — generated through the LiteLLM management API and persisted in PostgreSQL; each key can carry RPM limits, model restrictions, and spend tracking. PostgreSQL is set up automatically when the first virtual key is created — no separate configuration step is required.
Quality-assurance functions¶
Several mechanisms safeguard consistency, robustness, and traceability:
- Configuration validation. Replicas require a GPU count divisible by the replica count; reserved model names (
open,models,v1) are rejected; MIG changes are only possible when no containers are running on the affected GPU. - Safe container operations. The PostgreSQL stop procedure never removes the container or the volume; the LiteLLM restart waits three seconds between stop and start and never touches PostgreSQL or Caddy; health checks allow 60 seconds of warm-up time for database migrations.
- Persistence and restoration. Model and MIG configuration are stored in local files; a restore procedure recreates the previous operating state in five steps after a reboot — MIG partitions, PostgreSQL, models, LiteLLM, Caddy. Optionally, the monitoring stack follows as a sixth step.
- Live logs and live status. Container logs are viewable directly from the interface; the status of running models, including replica counts, is displayed in the model table.
- Reproducible image selection. Image selection follows a fixed priority (override, recipe image, vLLM-Omni image, standard vLLM image); recipe images for speech-to-text models are built reproducibly from declared package lists.
- Benchmark function. Performance measurements are taken directly against the vLLM container — without proxy overhead — with configurable request count, concurrency, and prompt length. The results report TTFT percentiles (P50/P95/P99), end-to-end latency, tokens per second, and requests per second, each preceded by a warm-up request.
Monitoring and energy tracking¶
Prometheus, Grafana, and the NVIDIA DCGM Exporter are started, stopped, and configured as additional containers through the interface. Prometheus scrape targets are updated automatically whenever a model is started or stopped; retention is configurable between 30 and 365 days. Grafana is reachable at https://host/grafana/ with IP whitelist or via SSH tunnel on port 3000.
Four preconfigured dashboards are provided: an overview dashboard (token usage, throughput, TTFT P95, GPU power, KV cache, requests per second), a per-model detail dashboard (latency percentiles, queue wait time, prefix-cache hit rate), a comparison dashboard (multi-select across models), and an energy and GPU dashboard (GPU power, temperature, VRAM, efficiency in Wh / 1,000 tokens). Labels model_name and model_type on all metrics allow aggregation across model changes.
Independently of the Grafana stack, a lightweight energy monitor measures power draw per GPU (via nvidia-smi, every two seconds) and assigns it to sessions. Session token counts and efficiency metrics can be exported as CSV.
Configuration and operation¶
Configuration is held in a single JSON file; the LiteLLM configuration, Caddyfile, Prometheus configuration, and Grafana provisioning files are generated from this central file and updated whenever models or routes change. The web interface binds exclusively to 127.0.0.1:7860; access is from the server itself or via SSH tunnel. External reachability is only available through Caddy on port 443; the admin UI can additionally be protected by an IP whitelist.
Import and export formats¶
- Model import: Hugging Face IDs (public or token-gated); models are loaded into a shared Hugging Face cache and mounted into all containers as a volume.
- Configuration import and export: backup and restore via the configuration directory (e.g. with
tar); MIG configuration and model definitions are included. The centralconfig.jsonis the only file that needs to be retained. - Energy export: CSV per session with timestamp, GPU power, token count, and efficiency.
- Generated configurations:
litellm-config.yaml,Caddyfile,prometheus.yml, and Grafana dashboard and datasource JSONs are regenerated on every change and serve as auditable snapshots of the current routing.