> ## Documentation Index > Fetch the complete documentation index at: https://docs.modelslab.com/llms.txt > Use this file to discover all available pages before exploring further. # Chat Completions > OpenAI-compatible chat completions, completions, embeddings, and models endpoints served by your dedicated GGUF Cloud deployment. Your GGUF Cloud deployment exposes the standard **OpenAI-compatible** routes, backed by your model's `llama-server`. Point any OpenAI SDK or compatible client at your deployment's base URL with `/v1` appended. ## Endpoints | Method | Path | Description | | ------ | ---------------------- | --------------------------------------------------- | | `POST` | `/v1/chat/completions` | Chat-style completions (messages array). | | `POST` | `/v1/completions` | Classic text completions (single `prompt`). | | `POST` | `/v1/embeddings` | Embeddings for the deployed model (when supported). | | `GET` | `/v1/models` | List the model served by this deployment. | All paths are relative to your deployment base URL: ``` https://modelslab.com/api/gguf/{deployment_id} ``` ## Request ```bash theme={null} POST https://modelslab.com/api/gguf/{deployment_id}/v1/chat/completions ``` Authenticate with your ModelsLab API key (see [Authentication](/gguf-cloud/authentication)). ```bash theme={null} curl -X POST "https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1/chat/completions" \ -H "Authorization: Bearer $MODELSLAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "local", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"} ], "max_tokens": 256, "temperature": 0.7 }' ``` ## Body ```json theme={null} { "model": "local", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"} ], "max_tokens": 256, "temperature": 0.7, "top_p": 0.95, "stop": ["\n\n"], "stream": false } ``` ## Body Attributes The model to use. A deployment serves a **single** model, so this can be `"local"` or the model id you deployed — either way the request is routed to your deployment's model. Array of message objects, each with a `role` (`system`, `user`, or `assistant`) and `content`. Used by `/v1/chat/completions`. A single text prompt. Used by the `/v1/completions` endpoint instead of `messages`. Maximum number of tokens to generate. Range: `1` to the model's context limit. Sampling temperature. Lower values (`0.1`–`0.3`) produce focused, deterministic output; higher values increase creativity. Range: `0.0`–`2.0`. Nucleus sampling — only consider tokens with cumulative probability above this threshold. Range: `0.0`–`1.0`. Use either `temperature` or `top_p`. Only sample from the top K most likely tokens (a llama.cpp sampling option). Penalty applied to repeated tokens. Values `> 1` discourage repetition (a llama.cpp sampling option). Penalizes tokens that have already appeared, encouraging new topics. Range: `-2`–`2`. Penalizes tokens proportionally to how often they have appeared. Range: `-2`–`2`. One or more sequences where generation stops. The stop sequence is not included in the output. Seed for reproducible sampling. The same seed and input produce the same output. When `true`, responses are streamed as Server-Sent Events (`text/event-stream`). See [Streaming](#streaming). ## Response ```json theme={null} { "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1712345678, "model": "local", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The capital of France is Paris." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 25, "completion_tokens": 8, "total_tokens": 33 } } ``` ## Response Fields Unique identifier for the completion. The object type, e.g. `chat.completion` (or `chat.completion.chunk` while streaming). The model that produced the response (your deployment's model). The generated choices. Each item contains an `index`, a `message` (with `role` and `content`), and a `finish_reason`. Token accounting: `prompt_tokens`, `completion_tokens`, and `total_tokens`. ## Streaming Set `"stream": true` to receive Server-Sent Events (SSE) as tokens are generated: ```bash theme={null} curl -X POST "https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1/chat/completions" \ -H "Authorization: Bearer $MODELSLAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "local", "messages": [{"role": "user", "content": "Write a haiku"}], "stream": true }' ``` Each SSE event carries a `chat.completion.chunk` object, terminated by `data: [DONE]`: ``` data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Silent"},"finish_reason":null}]} data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" snow"},"finish_reason":null}]} data: [DONE] ``` ## SDK Examples This endpoint is a drop-in replacement for the OpenAI API. Just change the `base_url` and `api_key`: ```python Python theme={null} from openai import OpenAI client = OpenAI( api_key="YOUR_MODELSLAB_API_KEY", base_url="https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1", ) # Non-streaming response = client.chat.completions.create( model="local", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum computing in simple terms"}, ], max_tokens=256, ) print(response.choices[0].message.content) # Streaming stream = client.chat.completions.create( model="local", messages=[{"role": "user", "content": "Write a story"}], stream=True, ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") ``` ```javascript JavaScript theme={null} import OpenAI from 'openai'; const client = new OpenAI({ apiKey: 'YOUR_MODELSLAB_API_KEY', baseURL: 'https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1', }); const response = await client.chat.completions.create({ model: 'local', messages: [ { role: 'system', content: 'You are a helpful assistant.' }, { role: 'user', content: 'Hello!' }, ], }); console.log(response.choices[0].message.content); ``` ```bash cURL theme={null} curl -X POST "https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1/chat/completions" \ -H "Authorization: Bearer $MODELSLAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "local", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ## Other OpenAI routes ```bash List models theme={null} curl "https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1/models" \ -H "Authorization: Bearer $MODELSLAB_API_KEY" ``` ```bash Text completions theme={null} curl -X POST "https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1/completions" \ -H "Authorization: Bearer $MODELSLAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "local", "prompt": "Once upon a time", "max_tokens": 128 }' ``` ```bash Embeddings theme={null} curl -X POST "https://modelslab.com/api/gguf/YOUR_DEPLOYMENT_ID/v1/embeddings" \ -H "Authorization: Bearer $MODELSLAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "local", "input": "The quick brown fox" }' ``` `/v1/embeddings` is only available when the deployed model supports embeddings. If it does not, the endpoint returns an error from `llama-server`.