Skip to main content
Your GGUF Cloud deployment exposes the standard OpenAI-compatible routes, backed by your model’s llama-server. Point any OpenAI SDK or compatible client at your deployment’s base URL with /v1 appended.

Endpoints

MethodPathDescription
POST/v1/chat/completionsChat-style completions (messages array).
POST/v1/completionsClassic text completions (single prompt).
POST/v1/embeddingsEmbeddings for the deployed model (when supported).
GET/v1/modelsList the model served by this deployment.
All paths are relative to your deployment base URL:
https://modelslab.com/api/gguf/{deployment_id}

Request

POST https://modelslab.com/api/gguf/{deployment_id}/v1/chat/completions
Authenticate with your ModelsLab API key (see Authentication).
curl -X POST "https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1/chat/completions" \
  -H "Authorization: Bearer $MODELSLAB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Body

{
  "model": "local",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0.95,
  "stop": ["\n\n"],
  "stream": false
}

Body Attributes

model
string
default:"local"
The model to use. A deployment serves a single model, so this can be "local" or the model id you deployed — either way the request is routed to your deployment’s model.
messages
array
required
Array of message objects, each with a role (system, user, or assistant) and content. Used by /v1/chat/completions.
prompt
string
A single text prompt. Used by the /v1/completions endpoint instead of messages.
max_tokens
integer
default:"256"
Maximum number of tokens to generate. Range: 1 to the model’s context limit.
temperature
number
default:"0.8"
Sampling temperature. Lower values (0.10.3) produce focused, deterministic output; higher values increase creativity. Range: 0.02.0.
top_p
number
default:"0.95"
Nucleus sampling — only consider tokens with cumulative probability above this threshold. Range: 0.01.0. Use either temperature or top_p.
top_k
integer
Only sample from the top K most likely tokens (a llama.cpp sampling option).
repeat_penalty
number
default:"1.1"
Penalty applied to repeated tokens. Values > 1 discourage repetition (a llama.cpp sampling option).
presence_penalty
number
default:"0"
Penalizes tokens that have already appeared, encouraging new topics. Range: -22.
frequency_penalty
number
default:"0"
Penalizes tokens proportionally to how often they have appeared. Range: -22.
stop
string or array
One or more sequences where generation stops. The stop sequence is not included in the output.
seed
integer
Seed for reproducible sampling. The same seed and input produce the same output.
stream
boolean
default:"false"
When true, responses are streamed as Server-Sent Events (text/event-stream). See Streaming.

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1712345678,
  "model": "local",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  }
}

Response Fields

id
string
Unique identifier for the completion.
object
string
The object type, e.g. chat.completion (or chat.completion.chunk while streaming).
model
string
The model that produced the response (your deployment’s model).
choices
array
The generated choices. Each item contains an index, a message (with role and content), and a finish_reason.
usage
object
Token accounting: prompt_tokens, completion_tokens, and total_tokens.

Streaming

Set "stream": true to receive Server-Sent Events (SSE) as tokens are generated:
curl -X POST "https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1/chat/completions" \
  -H "Authorization: Bearer $MODELSLAB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Write a haiku"}],
    "stream": true
  }'
Each SSE event carries a chat.completion.chunk object, terminated by data: [DONE]:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Silent"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" snow"},"finish_reason":null}]}
data: [DONE]

SDK Examples

This endpoint is a drop-in replacement for the OpenAI API. Just change the base_url and api_key:
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODELSLAB_API_KEY",
    base_url="https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1",
)

# Non-streaming
response = client.chat.completions.create(
    model="local",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms"},
    ],
    max_tokens=256,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Other OpenAI routes

curl "https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1/models" \
  -H "Authorization: Bearer $MODELSLAB_API_KEY"
/v1/embeddings is only available when the deployed model supports embeddings. If it does not, the endpoint returns an error from llama-server.