LLM-Manager¶

LLM-Manager is a web-based management interface for vLLM containers running on servers with NVIDIA GPUs. The application covers seven model types — chat language models, vision-language models, embedding models, speech-to-text, real-time speech-to-text, text-to-speech, and image generation — and exposes them behind a unified, OpenAI-compatible API. Container orchestration is handled directly through the Docker API; no Kubernetes cluster or external cloud services are required.

At a glance¶

Run a range of multimodal AI models on a single machine, from chat language models through to image generation.
Use available NVIDIA GPUs without manual hardware configuration; count, type, VRAM, NVLink topology, and MIG status are detected automatically at startup.
Select models from the Hugging Face Hub, parameterize them, and start or stop them as containers from the web UI.
Expose models behind a unified, OpenAI-compatible API and protect them with tiered keys appropriate to each use case.
Track and compare token throughput, latency, GPU utilization, and energy consumption per model.
Scale a single model name across multiple GPUs — replicas with transparent load balancing appear to clients as one model.
Restore the previous operating state after a reboot in a single step.

Highlights¶

Compared with a manually assembled vLLM setup, LLM-Manager takes over GPU detection, container layout, API routing, and key issuance, and consolidates seven model types behind a unified API. This removes common sources of error — wrong image selection, missing audio libraries, inconsistent endpoints, unprotected direct routes — and makes operation reproducible even on heterogeneous machines.

Automatic hardware detection — at startup, count, type, VRAM, NVLink and NVSwitch topology, and MIG status of all NVIDIA GPUs are read out, alongside system RAM. The configuration interface marks free GPUs in color and provides hints for NVLink-compatible tensor-parallel configurations.
Seven model types, one interface — chat, vision-language, embedding, batch speech-to-text, real-time speech-to-text, text-to-speech, and image generation are managed with type-specific defaults for Docker images, arguments, and API endpoints.
Two routing paths behind one URL — chat-, vision-, and embedding-capable models are served through LiteLLM at /v1/*; speech-to-text, text-to-speech, and image generation models are served directly at /models/<n>/v1/*. Both paths use the OpenAI API without client-side modifications.
Three-tier API key system — a master key, an arbitrary number of direct-route keys (for internal applications), and virtual keys (with RPM limits, model restrictions, and spend tracking, persisted in PostgreSQL) can be issued and revoked separately.
MIG partitioning — individual GPUs can be split into up to seven isolated instances; small models (speech-to-text, text-to-speech, embedding) occupy MIG instances while large models use whole GPUs. The configuration survives reboots.
Replicas with load balancing — the same model name can run on multiple GPUs; LiteLLM distributes requests using a least-busy strategy and detects partial replica failures automatically. Clients still see a single model name.
Connections to four external sources and services — Hugging Face Hub for models, Docker Hub and the GitHub Container Registry for container images, and ACME servers (such as Let's Encrypt) for TLS certificates.
Built-in monitoring — Prometheus, Grafana, and the NVIDIA DCGM Exporter ship as containers; four preconfigured dashboards aggregate token usage, latency percentiles, throughput, queue wait time, KV cache, prefix-cache hit rate, and GPU metrics across model changes.
Session-level energy tracking — power draw per GPU, token throughput, and efficiency (Wh / 1,000 tokens) are recorded and can be exported as CSV.
Data-minimizing operation without cloud dependencies — the web interface binds exclusively to 127.0.0.1, loads no Google Fonts, and sends no telemetry. All models, keys, and logs remain on the machine.