A localhost control plane that turns a multi-GPU workstation into a set of stable, OpenAI-compatible model endpoints. It launches and supervises local Ollama workers — each pinned to a specific GPU via CUDA_VISIBLE_DEVICES — exposes every worker behind a single OpenAI-compatible proxy, and renders live GPU/VRAM telemetry, health probes, and in-flight requests. Apps change one line — the base_url — and stop caring about the runtime.
Section 02 / Why it exists
Local, multi-GPU inference has no built-in orchestration or observability layer. Once you go past "one model on one GPU," several real problems show up at once.
Runtimes load onto whatever card they please — or shard across cards — wrecking isolation. Pinning must be set at process launch, not retrofitted.
Each worker is just a port. Apps hardcode host/port and reimplement health checks, retries, and model-presence logic.
A model that "fits" can still OOM on a long prompt from KV-cache growth. Without per-GPU accounting you find out by crashing.
The runtime doesn't expose job-level telemetry, so "is this GPU doing work right now, and for whom?" is genuinely hard to answer.
Section 03 / The flow
FastAPI backend on :8765, a Vite + React dashboard on :17320, Ollama as the worker runtime, and nvidia-smi + psutil for telemetry. Single host, localhost-bound.
Section 04 / Live snapshot
Per-GPU utilization, VRAM used/free, power, and temperature — sampled from nvidia-smi and keyed by GPU UUID, with listening PIDs and connection counts from psutil.
— illustrative snapshot · "likely inferencing" and token counts are heuristics, not exact measurements —
Section 05 / Implemented today
Everything here is built and working. Lead differentiators first: per-GPU pinning, the OpenAI-compatible proxy, and live telemetry with guardrails.
Each worker spawns with CUDA_VISIBLE_DEVICES set to its intended GPU UUID — pinned at launch, then verified against observed GPU processes.
Every worker is reachable at /proxy/{id}/v1 — chat, completions, models, plus native Ollama passthrough. Apps integrate by changing only the base_url.
Per-GPU utilization, memory, power, temperature, and compute processes from nvidia-smi; PIDs and connection counts from psutil — assembled into one snapshot.
Load with keep_alive, num_ctx, and an exclusive mode. A pre/post-load guard blocks loads that would spill across a GPU, and a KV-cache estimator projects per-request VRAM cost.
Per-endpoint polling with latency and missing-model detection, plus a warnings engine: endpoint down, unmanaged worker, cross-GPU memory, near-full GPU, pinned-to-wrong-GPU.
Start one or all workers, ensure-running idempotently, restart-pinned, stop by PID. Live request visibility and app registration show who owns each worker.
Section 06 / Roadmap
Honest about the edges. Today it's a localhost, single-user tool — these are the directions that would take it further.
None today — localhost-only with open CORS by design. Authentication and per-user isolation are a planned direction, not a current claim.
The worker runtime is Ollama-specific today. vLLM and llama.cpp-style servers are a roadmap item, not supported yet.
Everything assumes one local host. Networked workers across machines is a future direction.
Tooling and launchers are Windows-first today. Portable packaging for other platforms is on the list.
Section 07 / Gallery
The live control plane. Real screenshots drop into these slots — captured at a consistent window size, with at least two workers up so pinning and routing are visible.