llama.cpp (llama-server), and speaks both the OpenAI and Anthropic protocols natively — so you can point the OpenAI SDK, the Anthropic SDK, Claude Code, or any compatible client at it with just a base URL change.
Unlike the shared LLM API, a GGUF Cloud deployment runs only your model on your own GPU. There are no neighbors, the model stays loaded, and the endpoint is reachable exclusively with your ModelsLab API key.
How it works
Pick a GGUF model
Choose any GGUF model (from Hugging Face, your own quantization, or one of our presets) at modelslab.com/gguf-cloud.
Deploy to a dedicated GPU
We provision a single-tenant GPU pod running
llama-server with your model loaded. Your deployment gets a unique deployment_id.deployment_id on the dashboard at modelslab.com/gguf-cloud.
Base URL
Every deployment has its own base URL. The{deployment_id} is shown on the deployment’s dashboard page:
- OpenAI SDKs use the base URL with
/v1appended →https://modelslab.com/api/gguf/{deployment_id}/v1 - Anthropic SDKs / Claude Code use the base URL as-is →
https://modelslab.com/api/gguf/{deployment_id}
Quickstart
Authenticate with your existing ModelsLab API key. Because the deployment serves a single model, themodel field can be "local" (or the model id you deployed) — it’s always routed to your deployment’s model.
Explore
Authentication
Base URLs, your
deployment_id, and the three accepted auth headers.Chat Completions
OpenAI-compatible
/v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models.Messages
Anthropic-compatible
/v1/messages. Works with the Anthropic SDK and Claude Code.Errors
Gateway error reference: 401, 404, 503, and 502 and how to handle them.

