LLM-Manager¶
LLM-Manager is a web-based management interface for vLLM containers running on servers with NVIDIA GPUs. The application covers seven model types — chat language models, vision-language models, embedding models, speech-to-text, real-time speech-to-text, text-to-speech, and image generation — and exposes them behind a unified, OpenAI-compatible API. Container orchestration is handled directly through the Docker API; no Kubernetes cluster or external cloud services are required.
At a glance¶
- Run a range of multimodal AI models on a single machine, from chat language models through to image generation.
- Use available NVIDIA GPUs without manual hardware configuration; count, type, VRAM, NVLink topology, and MIG status are detected automatically at startup.
- Select models from the Hugging Face Hub, parameterize them, and start or stop them as containers from the web UI.
- Expose models behind a unified, OpenAI-compatible API and protect them with tiered keys appropriate to each use case.
- Track and compare token throughput, latency, GPU utilization, and energy consumption per model.
- Scale a single model name across multiple GPUs — replicas with transparent load balancing appear to clients as one model.
- Restore the previous operating state after a reboot in a single step.
Highlights¶
Compared with a manually assembled vLLM setup, LLM-Manager takes over GPU detection, container layout, API routing, and key issuance, and consolidates seven model types behind a unified API. This removes common sources of error — wrong image selection, missing audio libraries, inconsistent endpoints, unprotected direct routes — and makes operation reproducible even on heterogeneous machines.
- Automatic hardware detection — at startup, count, type, VRAM, NVLink and NVSwitch topology, and MIG status of all NVIDIA GPUs are read out, alongside system RAM. The configuration interface marks free GPUs in color and provides hints for NVLink-compatible tensor-parallel configurations.
- Seven model types, one interface — chat, vision-language, embedding, batch speech-to-text, real-time speech-to-text, text-to-speech, and image generation are managed with type-specific defaults for Docker images, arguments, and API endpoints.
- Two routing paths behind one URL — chat-, vision-, and embedding-capable models are served through LiteLLM at
/v1/*; speech-to-text, text-to-speech, and image generation models are served directly at/models/<n>/v1/*. Both paths use the OpenAI API without client-side modifications. - Three-tier API key system — a master key, an arbitrary number of direct-route keys (for internal applications), and virtual keys (with RPM limits, model restrictions, and spend tracking, persisted in PostgreSQL) can be issued and revoked separately.
- MIG partitioning — individual GPUs can be split into up to seven isolated instances; small models (speech-to-text, text-to-speech, embedding) occupy MIG instances while large models use whole GPUs. The configuration survives reboots.
- Replicas with load balancing — the same model name can run on multiple GPUs; LiteLLM distributes requests using a least-busy strategy and detects partial replica failures automatically. Clients still see a single model name.
- Connections to four external sources and services — Hugging Face Hub for models, Docker Hub and the GitHub Container Registry for container images, and ACME servers (such as Let's Encrypt) for TLS certificates.
- Built-in monitoring — Prometheus, Grafana, and the NVIDIA DCGM Exporter ship as containers; four preconfigured dashboards aggregate token usage, latency percentiles, throughput, queue wait time, KV cache, prefix-cache hit rate, and GPU metrics across model changes.
- Session-level energy tracking — power draw per GPU, token throughput, and efficiency (Wh / 1,000 tokens) are recorded and can be exported as CSV.
- Data-minimizing operation without cloud dependencies — the web interface binds exclusively to
127.0.0.1, loads no Google Fonts, and sends no telemetry. All models, keys, and logs remain on the machine.