Skip to content

LLM-Manager

LLM-Manager is a web-based management interface for vLLM containers running on servers with NVIDIA GPUs. The application covers seven model types — chat language models, vision-language models, embedding models, speech-to-text, real-time speech-to-text, text-to-speech, and image generation — and exposes them behind a unified, OpenAI-compatible API. Container orchestration is handled directly through the Docker API; no Kubernetes cluster or external cloud services are required.

At a glance

  • Run a range of multimodal AI models on a single machine, from chat language models through to image generation.
  • Use available NVIDIA GPUs without manual hardware configuration; count, type, VRAM, NVLink topology, and MIG status are detected automatically at startup.
  • Select models from the Hugging Face Hub, parameterize them, and start or stop them as containers from the web UI.
  • Expose models behind a unified, OpenAI-compatible API and protect them with tiered keys appropriate to each use case.
  • Track and compare token throughput, latency, GPU utilization, and energy consumption per model.
  • Scale a single model name across multiple GPUs — replicas with transparent load balancing appear to clients as one model.
  • Restore the previous operating state after a reboot in a single step.

Highlights

Compared with a manually assembled vLLM setup, LLM-Manager takes over GPU detection, container layout, API routing, and key issuance, and consolidates seven model types behind a unified API. This removes common sources of error — wrong image selection, missing audio libraries, inconsistent endpoints, unprotected direct routes — and makes operation reproducible even on heterogeneous machines.

  • Automatic hardware detection — at startup, count, type, VRAM, NVLink and NVSwitch topology, and MIG status of all NVIDIA GPUs are read out, alongside system RAM. The configuration interface marks free GPUs in color and provides hints for NVLink-compatible tensor-parallel configurations.
  • Seven model types, one interface — chat, vision-language, embedding, batch speech-to-text, real-time speech-to-text, text-to-speech, and image generation are managed with type-specific defaults for Docker images, arguments, and API endpoints.
  • Two routing paths behind one URL — chat-, vision-, and embedding-capable models are served through LiteLLM at /v1/*; speech-to-text, text-to-speech, and image generation models are served directly at /models/<n>/v1/*. Both paths use the OpenAI API without client-side modifications.
  • Three-tier API key system — a master key, an arbitrary number of direct-route keys (for internal applications), and virtual keys (with RPM limits, model restrictions, and spend tracking, persisted in PostgreSQL) can be issued and revoked separately.
  • MIG partitioning — individual GPUs can be split into up to seven isolated instances; small models (speech-to-text, text-to-speech, embedding) occupy MIG instances while large models use whole GPUs. The configuration survives reboots.
  • Replicas with load balancing — the same model name can run on multiple GPUs; LiteLLM distributes requests using a least-busy strategy and detects partial replica failures automatically. Clients still see a single model name.
  • Connections to four external sources and services — Hugging Face Hub for models, Docker Hub and the GitHub Container Registry for container images, and ACME servers (such as Let's Encrypt) for TLS certificates.
  • Built-in monitoring — Prometheus, Grafana, and the NVIDIA DCGM Exporter ship as containers; four preconfigured dashboards aggregate token usage, latency percentiles, throughput, queue wait time, KV cache, prefix-cache hit rate, and GPU metrics across model changes.
  • Session-level energy tracking — power draw per GPU, token throughput, and efficiency (Wh / 1,000 tokens) are recorded and can be exported as CSV.
  • Data-minimizing operation without cloud dependencies — the web interface binds exclusively to 127.0.0.1, loads no Google Fonts, and sends no telemetry. All models, keys, and logs remain on the machine.