TypeInfrastructure / Control plane
Year2026
RoleSolo — backend, dashboard, telemetry
Status● Running behind a real consumer (Need Job)
Personal infrastructure · 2026

Inference Lab —
a control plane for local GPUs.

A localhost control plane that turns a multi-GPU workstation into a set of stable, OpenAI-compatible model endpoints. It launches and supervises local Ollama workers — each pinned to a specific GPU via CUDA_VISIBLE_DEVICES — exposes every worker behind a single OpenAI-compatible proxy, and renders live GPU/VRAM telemetry, health probes, and in-flight requests. Apps change one line — the base_url — and stop caring about the runtime.

PythonFastAPIReactTypeScriptOllamanvidia-smiCUDA
● DASHBOARD WALKTHROUGH · 45–75sLOCALHOST · SINGLE HOST
The live control plane.
Start workers → pin to GPUs → load a model → call the proxy · video slot

The problem.

Section 02 / Why it exists

Local, multi-GPU inference has no built-in orchestration or observability layer. Once you go past "one model on one GPU," several real problems show up at once.

Running several local model servers across multiple GPUs sounds simple until you actually do it — the runtime won't pin itself, won't account for VRAM, and won't give your apps a stable address to call.Inference Lab is the layer that owns all of that: predictable placement, a stable endpoint contract, and enough observability to trust what's happening on each card.
Failure 01

Implicit GPU placement.

Runtimes load onto whatever card they please — or shard across cards — wrecking isolation. Pinning must be set at process launch, not retrofitted.

Failure 02

No stable contract.

Each worker is just a port. Apps hardcode host/port and reimplement health checks, retries, and model-presence logic.

Failure 03

Silent VRAM / OOM.

A model that "fits" can still OOM on a long prompt from KV-cache growth. Without per-GPU accounting you find out by crashing.

Failure 04

Thin telemetry.

The runtime doesn't expose job-level telemetry, so "is this GPU doing work right now, and for whom?" is genuinely hard to answer.

How it works, six steps.

Section 03 / The flow

FastAPI backend on :8765, a Vite + React dashboard on :17320, Ollama as the worker runtime, and nvidia-smi + psutil for telemetry. Single host, localhost-bound.

01
Configure
Each worker is a named endpoint with a URL, intended GPU UUID, role, and required models — persisted to local JSON.
→ config
02
Launch & pin
Start ollama serve per worker with the right OLLAMA_HOST + CUDA_VISIBLE_DEVICES, then verify it landed on the intended GPU.
→ pinned
03
Expose
Every worker is mounted behind one OpenAI-compatible proxy. Apps call …/proxy/{id}/v1 and never touch raw ports.
→ /proxy
04
Observe
A sampler reads nvidia-smi (GPU/VRAM/power/temp/processes) and psutil (PIDs, connections); the proxy records every request.
→ snapshot
05
Control
Load/unload models with single-GPU guards, probe health, restart-pinned, register apps — all over HTTP.
→ commands
06
Render
The React dashboard polls the snapshot (~1.5s) and shows endpoint cards, a GPU table, live inference, and warnings.
→ dashboard

Telemetry, per card.

Section 04 / Live snapshot

Per-GPU utilization, VRAM used/free, power, and temperature — sampled from nvidia-smi and keyed by GPU UUID, with listening PIDs and connection counts from psutil.

gpu.snapshot · illustrativeSchematic
GPUWORKER · MODELUTILVRAMPOWERTEMP
GPU 0worker-a · qwen2.5:14b
11.8 / 24G241W64°C
GPU 1worker-b · llama3.1:8b
9.2 / 24G163W57°C

— illustrative snapshot · "likely inferencing" and token counts are heuristics, not exact measurements —

What it does.

Section 05 / Implemented today

Everything here is built and working. Lead differentiators first: per-GPU pinning, the OpenAI-compatible proxy, and live telemetry with guardrails.

01 · Placement

Per-GPU pinning.

Each worker spawns with CUDA_VISIBLE_DEVICES set to its intended GPU UUID — pinned at launch, then verified against observed GPU processes.

02 · Contract

OpenAI-compatible proxy.

Every worker is reachable at /proxy/{id}/v1 — chat, completions, models, plus native Ollama passthrough. Apps integrate by changing only the base_url.

03 · Telemetry

Live GPU + VRAM.

Per-GPU utilization, memory, power, temperature, and compute processes from nvidia-smi; PIDs and connection counts from psutil — assembled into one snapshot.

04 · Safety

Single-GPU guardrails.

Load with keep_alive, num_ctx, and an exclusive mode. A pre/post-load guard blocks loads that would spill across a GPU, and a KV-cache estimator projects per-request VRAM cost.

05 · Health

Probes & warnings.

Per-endpoint polling with latency and missing-model detection, plus a warnings engine: endpoint down, unmanaged worker, cross-GPU memory, near-full GPU, pinned-to-wrong-GPU.

06 · Lifecycle

Start, restart-pinned.

Start one or all workers, ensure-running idempotently, restart-pinned, stop by PID. Live request visibility and app registration show who owns each worker.

What's next.

Section 06 / Roadmap

Honest about the edges. Today it's a localhost, single-user tool — these are the directions that would take it further.

Planned

Auth & multi-user.

None today — localhost-only with open CORS by design. Authentication and per-user isolation are a planned direction, not a current claim.

Planned

Non-Ollama backends.

The worker runtime is Ollama-specific today. vLLM and llama.cpp-style servers are a roadmap item, not supported yet.

Planned

Multi-machine.

Everything assumes one local host. Networked workers across machines is a future direction.

Planned

Cross-platform packaging.

Tooling and launchers are Windows-first today. Portable packaging for other platforms is on the list.

What it is — and isn't
A localhost control plane, not a production platform. Single-user, no authentication, open CORS by design. The worker runtime is Ollama-specific; tooling is Windows-first; the "likely inferencing" signal and token counts are heuristics, not exact measurements. No multi-machine, no cloud, no invented metrics — just a real, working control plane for one workstation.
Its first real consumerInference Lab powers the inference behind Need JobA local-first AI job-application pipeline that scrapes roles, grades fit honestly, and generates tailored documents entirely on local GPUs — driving two pinned workers through Inference Lab in production. See more.