GPUStack inference service deployment

GPUStack · LLM inference · Cloudflare Tunnel · Docker · NVIDIA

Overview

This turns a single GPU host into an inference service for LLM, VLM, embedding, and rerank models. The GPUStack v2.1.2 server and worker run as Docker Compose containers, exposed through a Cloudflare Tunnel so the host opens no inbound ports. The public interface is OpenAI-compatible, so existing code using the openai SDK connects with almost no changes.

Core features

server and worker as two containers, brought up with a single docker compose up
Cloudflare Tunnel for public access; the host firewall opens no inbound ports
Two hostnames, split by purpose: the API uses a CF Access Service Token, the Web UI uses an email allowlist
OpenAI-compatible endpoint (/v1/chat/completions), usable directly with the openai SDK
Models deployed from the Web UI, sourced from HuggingFace, Ollama, or a local path

Architecture

The server and worker talk over an internal bridge network, and inference containers attach to the worker via host networking, so the outside only ever reaches the Cloudflare layer.

Quick start

Prerequisites (fresh machine only)

sudo bash scripts/prereq.sh

Checks the GPU, installs Docker Engine and the NVIDIA Container Toolkit, and creates the data directories.

Set environment variables

cp .env.example .env

Fill in the admin password, the shared server/worker token, the HuggingFace token, and the Cloudflare Tunnel token.

Configure the Cloudflare Tunnel

In Zero Trust → Tunnels, create a tunnel and point the Public Hostname service URL at the container name http://gpustack:80, not localhost.

Start and verify

docker compose up -d
docker compose logs -f gpustack

Notes

For the API hostname that uses a Service Token, leave the Public Hostname Access field unbound to any Application. If you bind one, cloudflared re-validates at the origin, drops the Service Token JWT, and the CF edge returns 502. The .env file holds passwords and tokens, so keep it out of version control.

In practice

This fits a lab machine or any host without a public IP where you want self-hosted LLM/VLM inference, a programmatic API for other services to call, and no need to open inbound firewall ports or rent cloud GPUs.

Overview

AI & Computer Vision

Self-hosting & Networking

IoT & Cloud

GPUStack inference service deployment

Overview

Core features

Architecture

Quick start

Notes

In practice

Links

​Overview

​Core features

​Architecture

​Quick start

​Notes

​In practice

​Links

Overview

Core features

Architecture

Quick start

Notes

In practice

Links