GGUF Cloud

GGUF Cloud lets you deploy any GGUF (llama.cpp) model on a dedicated, single-tenant GPU. Each deployment becomes its own private API endpoint, backed by llama.cpp (llama-server), and speaks both the OpenAI and Anthropic protocols natively — so you can point the OpenAI SDK, the Anthropic SDK, Claude Code, or any compatible client at it with just a base URL change. Unlike the shared LLM API, a GGUF Cloud deployment runs only your model on your own GPU. There are no neighbors, the model stays loaded, and the endpoint is reachable exclusively with your ModelsLab API key.

How it works

Pick a GGUF model

Choose any GGUF model (from Hugging Face, your own quantization, or one of our presets) at modelslab.com/gguf-cloud.

Deploy to a dedicated GPU

We provision a single-tenant GPU pod running llama-server with your model loaded. Your deployment gets a unique deployment_id.

Call your endpoint

Point the OpenAI SDK, Anthropic SDK, or Claude Code at your deployment’s base URL using your existing ModelsLab API key.

Get a deployment and find your deployment_id on the dashboard at modelslab.com/gguf-cloud.

Base URL

Every deployment has its own base URL. The {deployment_id} is shown on the deployment’s dashboard page:

https://modelslab.com/api/gguf/{deployment_id}

OpenAI SDKs use the base URL with /v1 appended → https://modelslab.com/api/gguf/{deployment_id}/v1
Anthropic SDKs / Claude Code use the base URL as-is → https://modelslab.com/api/gguf/{deployment_id}

See Authentication for the full details on base URLs and API keys.

Quickstart

Authenticate with your existing ModelsLab API key. Because the deployment serves a single model, the model field can be "local" (or the model id you deployed) — it’s always routed to your deployment’s model.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODELSLAB_API_KEY",
    base_url="https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1",
)

response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)

print(response.choices[0].message.content)

from anthropic import Anthropic

client = Anthropic(
    api_key="YOUR_MODELSLAB_API_KEY",
    base_url="https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID",
)

message = client.messages.create(
    model="local",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}],
)

print(message.content[0].text)

curl -X POST "https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1/chat/completions" \
  -H "Authorization: Bearer $MODELSLAB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Explore

Authentication

Base URLs, your deployment_id, and the three accepted auth headers.

Chat Completions

OpenAI-compatible /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models.

Messages

Anthropic-compatible /v1/messages. Works with the Anthropic SDK and Claude Code.

Errors

Gateway error reference: 401, 404, 503, and 502 and how to handle them.

Using the APIs

Our AI APIs

How it works

Base URL

Quickstart

Explore

Authentication

Chat Completions

Messages

Errors

​How it works

​Base URL

​Quickstart

​Explore

Authentication

Chat Completions

Messages

Errors

How it works

Base URL

Quickstart

Explore