llm-serve

Scripts to run a self-hosted, OpenAI-compatible LLM inference server on your local network using vLLM and Docker. Optimized for NVFP4-quantized models on NVIDIA Blackwell GPUs.

Features

OpenAI-compatible API (/v1/chat/completions) — works with any client that supports OpenAI
FP8 KV cache quantization — cuts KV memory ~50% with negligible quality loss
Prefix caching — reuses KV cache across requests sharing a common prefix (big win for agentic workloads with repeated system prompts)
Continuous batching — handles concurrent requests automatically
Daemon mode with auto-restart on crash

Prerequisites

Docker Engine — install guide
NVIDIA Container Toolkit 1.17+ — install guide
NVIDIA GPU — Blackwell (sm_120) or later recommended for NVFP4 models
A HuggingFace account with a read token

Docker daemon config (required)

Docker 25+ enables the containerd image store by default, which has a bug pulling images with zstd:chunked layer compression. Disable it before running install.sh:

echo '{ "features": { "containerd-snapshotter": false } }' | sudo tee /etc/docker/daemon.json
sudo systemctl restart docker

NVIDIA Container Toolkit — CDI mode

Toolkit 1.17+ defaults to CDI mode. Verify your CDI devices are registered:

nvidia-ctk cdi list
# should show: nvidia.com/gpu=0  nvidia.com/gpu=all

If the list is empty, generate the specs:

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Setup

# 1. Pull the image and create .env
./install.sh

# 2. Edit .env — set HF_TOKEN and MODEL at minimum
#    For gated models, accept the license on huggingface.co first
nano .env

# 3. Start
./start.sh          # foreground (use with screen or tmux)
./start.sh -d       # background daemon, auto-restarts on crash

Usage

./status.sh         # container state, health check, model list, recent logs
./stop.sh           # stop and remove the container

The API is available at http://<host-ip>:8000 from any machine on the network.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/Gemma-4-31B-IT-NVFP4",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Configuration

All settings live in .env (created from .env.example by install.sh).

Variable	Default	Description
`HF_TOKEN`	—	HuggingFace read token (required)
`HF_CACHE`	`~/.cache/huggingface`	Local path for cached model weights
`MODEL`	—	HuggingFace model ID (required)
`MODEL_ALIAS`	—	Override the model name exposed in the API
`PORT`	`8000`	Host port to expose the API on
`TENSOR_PARALLEL`	`1`	Number of GPUs for tensor parallelism
`GPU_MEM_UTIL`	`0.92`	Fraction of VRAM to use
`MAX_MODEL_LEN`	`262144`	Max context length in tokens (256K)
`KV_CACHE_DTYPE`	`fp8`	KV cache precision (`fp8` or `auto`)
`MAX_NUM_SEQS`	`256`	Max concurrent sequences
`ENABLE_PREFIX_CACHING`	`true`	Cache KV for shared prompt prefixes

Docker image

vllm/vllm-openai:gemma4 is used instead of latest because it is the build explicitly validated for Gemma 4 architecture support (MoE, multimodal, tool use). The latest tag moves forward and may not maintain Gemma 4 compatibility.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-serve

Features

Prerequisites

Docker daemon config (required)

NVIDIA Container Toolkit — CDI mode

Setup

Usage

Configuration

Docker image

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
start.sh		start.sh
status.sh		status.sh
stop.sh		stop.sh

Folders and files

Latest commit

History

Repository files navigation

llm-serve

Features

Prerequisites

Docker daemon config (required)

NVIDIA Container Toolkit — CDI mode

Setup

Usage

Configuration

Docker image

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages