llama-server. Point any OpenAI SDK or compatible client at your deployment’s base URL with /v1 appended.
Endpoints
| Method | Path | Description |
|---|---|---|
POST | /v1/chat/completions | Chat-style completions (messages array). |
POST | /v1/completions | Classic text completions (single prompt). |
POST | /v1/embeddings | Embeddings for the deployed model (when supported). |
GET | /v1/models | List the model served by this deployment. |
Request
Body
Body Attributes
The model to use. A deployment serves a single model, so this can be
"local" or the model id you deployed — either way the request is routed to your deployment’s model.Array of message objects, each with a
role (system, user, or assistant) and content. Used by /v1/chat/completions.A single text prompt. Used by the
/v1/completions endpoint instead of messages.Maximum number of tokens to generate. Range:
1 to the model’s context limit.Sampling temperature. Lower values (
0.1–0.3) produce focused, deterministic output; higher values increase creativity. Range: 0.0–2.0.Nucleus sampling — only consider tokens with cumulative probability above this threshold. Range:
0.0–1.0. Use either temperature or top_p.Only sample from the top K most likely tokens (a llama.cpp sampling option).
Penalty applied to repeated tokens. Values
> 1 discourage repetition (a llama.cpp sampling option).Penalizes tokens that have already appeared, encouraging new topics. Range:
-2–2.Penalizes tokens proportionally to how often they have appeared. Range:
-2–2.One or more sequences where generation stops. The stop sequence is not included in the output.
Seed for reproducible sampling. The same seed and input produce the same output.
When
true, responses are streamed as Server-Sent Events (text/event-stream). See Streaming.Response
Response Fields
Unique identifier for the completion.
The object type, e.g.
chat.completion (or chat.completion.chunk while streaming).The model that produced the response (your deployment’s model).
The generated choices. Each item contains an
index, a message (with role and content), and a finish_reason.Token accounting:
prompt_tokens, completion_tokens, and total_tokens.Streaming
Set"stream": true to receive Server-Sent Events (SSE) as tokens are generated:
chat.completion.chunk object, terminated by data: [DONE]:
SDK Examples
This endpoint is a drop-in replacement for the OpenAI API. Just change thebase_url and api_key:
Other OpenAI routes
/v1/embeddings is only available when the deployed model supports embeddings. If it does not, the endpoint returns an error from llama-server.
