Ollama's API allows you to run and interact with models programatically.
> Note: Ollama's API docs are moving to https://docs.ollama.com/api
## Get started
## Endpoints
If you're just getting started, follow the [quickstart](/quickstart) documentation to get up and running with Ollama's API.
-[Generate a completion](#generate-a-completion)
-[Generate a chat completion](#generate-a-chat-completion)
-[Create a Model](#create-a-model)
-[List Local Models](#list-local-models)
-[Show Model Information](#show-model-information)
-[Copy a Model](#copy-a-model)
-[Delete a Model](#delete-a-model)
-[Pull a Model](#pull-a-model)
-[Push a Model](#push-a-model)
-[Generate Embeddings](#generate-embeddings)
-[List Running Models](#list-running-models)
-[Version](#version)
## Base URL
## Conventions
After installation, Ollama's API is served by default at:
### Model names
Model names follow a `model:tag` format, where `model` can have an optional namespace such as `example/model`. Some examples are `orca-mini:3b-q8_0` and `llama3:70b`. The tag is optional and, if not provided, will default to `latest`. The tag is used to identify a specific version.
### Durations
All durations are returned in nanoseconds.
### Streaming responses
Certain endpoints stream responses as JSON objects. Streaming can be disabled by providing `{"stream": false}` for these endpoints.
## Generate a completion
```
```
http://localhost:11434/api
POST /api/generate
```
Generate a response for a given prompt with a provided model. This is a streaming endpoint, so there will be a series of responses. The final response object will include statistics and additional data from the request.
### Parameters
-`model`: (required) the [model name](#model-names)
-`prompt`: the prompt to generate a response for
-`suffix`: the text after the model response
-`images`: (optional) a list of base64-encoded images (for multimodal models such as `llava`)
-`think`: (for thinking models) should the model think before responding?
Advanced parameters (optional):
-`format`: the format to return a response in. Format can be `json` or a JSON schema
-`options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
-`system`: system message to (overrides what is defined in the `Modelfile`)
-`template`: the prompt template to use (overrides what is defined in the `Modelfile`)
-`stream`: if `false` the response will be returned as a single response object, rather than a stream of objects
-`raw`: if `true` no formatting will be applied to the prompt. You may choose to use the `raw` parameter if you are specifying a full templated prompt in your request to the API
-`keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
-`context` (deprecated): the context parameter returned from a previous request to `/generate`, this can be used to keep a short conversational memory
#### Structured outputs
Structured outputs are supported by providing a JSON schema in the `format` parameter. The model will generate a response that matches the schema. See the [structured outputs](#request-structured-outputs) example below.
#### JSON mode
Enable JSON mode by setting the `format` parameter to `json`. This will structure the response as a valid JSON object. See the JSON mode [example](#request-json-mode) below.
> [!IMPORTANT]
> It's important to instruct the model to use JSON in the `prompt`. Otherwise, the model may generate large amounts whitespace.
### Examples
#### Generate request (Streaming)
##### Request
```shell
curl http://localhost:11434/api/generate -d'{
"model": "llama3.2",
"prompt": "Why is the sky blue?"
}'
```
```
For running cloud models on **ollama.com**, the same API is available with the following base URL:
"response":"A happy cartoon character, which is cute and cheerful.",
"done":true,
"context":[1,2,3],
"total_duration":2938432250,
"load_duration":2559292,
"prompt_eval_count":1,
"prompt_eval_duration":2195557000,
"eval_count":44,
"eval_duration":736432000
}
```
#### Request (Raw Mode)
In some cases, you may wish to bypass the templating system and provide a full prompt. In this case, you can use the `raw` parameter to disable templating. Also note that raw mode will not return a context.
##### Request
```shell
curl http://localhost:11434/api/generate -d'{
"model": "mistral",
"prompt": "[INST] why is the sky blue? [/INST]",
"raw": true,
"stream": false
}'
```
#### Request (Reproducible outputs)
For reproducible outputs, set `seed` to a number:
##### Request
```shell
curl http://localhost:11434/api/generate -d'{
"model": "mistral",
"prompt": "Why is the sky blue?",
"options": {
"seed": 123
}
}'
```
##### Response
```json
{
"model":"mistral",
"created_at":"2023-11-03T15:36:02.583064Z",
"response":" The sky appears blue because of a phenomenon called Rayleigh scattering.",
"done":true,
"total_duration":8493852375,
"load_duration":6589624375,
"prompt_eval_count":14,
"prompt_eval_duration":119039000,
"eval_count":110,
"eval_duration":1779061000
}
```
#### Generate request (With options)
If you want to set custom options for the model at runtime rather than in the Modelfile, you can do so with the `options` parameter. This example sets every available option, but you can set any of them individually and omit the ones you do not want to override.
##### Request
```shell
curl http://localhost:11434/api/generate -d'{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false,
"options": {
"num_keep": 5,
"seed": 42,
"num_predict": 100,
"top_k": 20,
"top_p": 0.9,
"min_p": 0.0,
"typical_p": 0.7,
"repeat_last_n": 33,
"temperature": 0.8,
"repeat_penalty": 1.2,
"presence_penalty": 1.5,
"frequency_penalty": 1.0,
"penalize_newline": true,
"stop": ["\n", "user:"],
"numa": false,
"num_ctx": 1024,
"num_batch": 2,
"num_gpu": 1,
"main_gpu": 0,
"use_mmap": true,
"num_thread": 8
}
}'
```
##### Response
```json
{
"model":"llama3.2",
"created_at":"2023-08-04T19:22:45.499127Z",
"response":"The sky is blue because it is the color of the sky.",
"done":true,
"context":[1,2,3],
"total_duration":4935886791,
"load_duration":534986708,
"prompt_eval_count":26,
"prompt_eval_duration":107345000,
"eval_count":237,
"eval_duration":4289432000
}
```
#### Load a model
If an empty prompt is provided, the model will be loaded into memory.
##### Request
```shell
curl http://localhost:11434/api/generate -d'{
"model": "llama3.2"
}'
```
##### Response
A single JSON object is returned:
```json
{
"model":"llama3.2",
"created_at":"2023-12-18T19:52:07.071755Z",
"response":"",
"done":true
}
```
#### Unload a model
If an empty prompt is provided and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory.
##### Request
```shell
curl http://localhost:11434/api/generate -d'{
"model": "llama3.2",
"keep_alive": 0
}'
```
##### Response
A single JSON object is returned:
```json
{
"model":"llama3.2",
"created_at":"2024-09-12T03:54:03.516566Z",
"response":"",
"done":true,
"done_reason":"unload"
}
```
## Generate a chat completion
```
POST /api/chat
```
Generate the next message in a chat with a provided model. This is a streaming endpoint, so there will be a series of responses. Streaming can be disabled using `"stream": false`. The final response object will include statistics and additional data from the request.
### Parameters
-`model`: (required) the [model name](#model-names)
-`messages`: the messages of the chat, this can be used to keep a chat memory
-`tools`: list of tools in JSON for the model to use if supported
-`think`: (for thinking models) should the model think before responding?
The `message` object has the following fields:
-`role`: the role of the message, either `system`, `user`, `assistant`, or `tool`
-`content`: the content of the message
-`thinking`: (for thinking models) the model's thinking process
-`images` (optional): a list of images to include in the message (for multimodal models such as `llava`)
-`tool_calls` (optional): a list of tools in JSON that the model wants to use
-`tool_name` (optional): add the name of the tool that was executed to inform the model of the result
Advanced parameters (optional):
-`format`: the format to return a response in. Format can be `json` or a JSON schema.
-`options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
-`stream`: if `false` the response will be returned as a single response object, rather than a stream of objects
-`keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
### Tool calling
Tool calling is supported by providing a list of tools in the `tools` parameter. The model will generate a response that includes a list of tool calls. See the [Chat request (Streaming with tools)](#chat-request-streaming-with-tools) example below.
Models can also explain the result of the tool call in the response. See the [Chat request (With history, with tools)](#chat-request-with-history-with-tools) example below.
[See models with tool calling capabilities](https://ollama.com/search?c=tool).
### Structured outputs
Structured outputs are supported by providing a JSON schema in the `format` parameter. The model will generate a response that matches the schema. See the [Chat request (Structured outputs)](#chat-request-structured-outputs) example below.
curl -X POST http://localhost:11434/api/chat -H"Content-Type: application/json"-d'{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Ollama is 22 years old and busy saving the world. Return a JSON object with the age and availability."}],
"stream": false,
"format": {
"type": "object",
"properties": {
"age": {
"type": "integer"
},
"available": {
"type": "boolean"
}
},
"required": [
"age",
"available"
]
},
"options": {
"temperature": 0
}
}'
```
##### Response
```json
{
"model":"llama3.1",
"created_at":"2024-12-06T00:46:58.265747Z",
"message":{
"role":"assistant",
"content":"{\"age\": 22, \"available\": false}"
},
"done_reason":"stop",
"done":true,
"total_duration":2254970291,
"load_duration":574751416,
"prompt_eval_count":34,
"prompt_eval_duration":1502000000,
"eval_count":12,
"eval_duration":175000000
}
```
#### Chat request (With History)
Send a chat message with a conversation history. You can use this same approach to start the conversation using multi-shot or chain-of-thought prompting.
##### Request
```shell
curl http://localhost:11434/api/chat -d'{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
},
{
"role": "assistant",
"content": "due to rayleigh scattering."
},
{
"role": "user",
"content": "how is that different than mie scattering?"
"content":" The image features a cute, little pig with an angry facial expression. It's wearing a heart on its shirt and is waving in the air. This scene appears to be part of a drawing or sketching project.",
"images":null
},
"done":true,
"total_duration":1668506709,
"load_duration":1986209,
"prompt_eval_count":26,
"prompt_eval_duration":359682000,
"eval_count":83,
"eval_duration":1303285000
}
```
#### Chat request (Reproducible outputs)
##### Request
```shell
curl http://localhost:11434/api/chat -d'{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "Hello!"
}
],
"options": {
"seed": 101,
"temperature": 0
}
}'
```
##### Response
```json
{
"model":"llama3.2",
"created_at":"2023-12-12T14:13:43.416799Z",
"message":{
"role":"assistant",
"content":"Hello! How are you today?"
},
"done":true,
"total_duration":5191566416,
"load_duration":2154458,
"prompt_eval_count":26,
"prompt_eval_duration":383809000,
"eval_count":298,
"eval_duration":4799921000
}
```
#### Chat request (with tools)
##### Request
```shell
curl http://localhost:11434/api/chat -d'{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "What is the weather today in Paris?"
}
],
"stream": false,
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the weather for, e.g. San Francisco, CA"
},
"format": {
"type": "string",
"description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location", "format"]
}
}
}
]
}'
}'
```
```
## Libraries
##### Response
Ollama has official libraries for Python and JavaScript:
Several community-maintained libraries are available for Ollama. For a full list, see the [Ollama GitHub repository](https://github.com/ollama/ollama?tab=readme-ov-file#libraries-1).
If the messages array is empty, the model will be loaded into memory.
## Versioning
##### Request
Ollama's API isn't strictly versioned, but the API is expected to be stable and backwards compatible. Deprecations are rare and will be announced in the [release notes](https://github.com/ollama/ollama/releases).
```shell
curl http://localhost:11434/api/chat -d'{
"model": "llama3.2",
"messages": []
}'
```
##### Response
```json
{
"model":"llama3.2",
"created_at":"2024-09-12T21:17:29.110811Z",
"message":{
"role":"assistant",
"content":""
},
"done_reason":"load",
"done":true
}
```
#### Unload a model
If the messages array is empty and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory.
##### Request
```shell
curl http://localhost:11434/api/chat -d'{
"model": "llama3.2",
"messages": [],
"keep_alive": 0
}'
```
##### Response
A single JSON object is returned:
```json
{
"model":"llama3.2",
"created_at":"2024-09-12T21:33:17.547535Z",
"message":{
"role":"assistant",
"content":""
},
"done_reason":"unload",
"done":true
}
```
## Create a Model
```
POST /api/create
```
Create a model from:
- another model;
- a safetensors directory; or
- a GGUF file.
If you are creating a model from a safetensors directory or from a GGUF file, you must [create a blob](#create-a-blob) for each of the files and then use the file name and SHA256 digest associated with each blob in the `files` field.
### Parameters
-`model`: name of the model to create
-`from`: (optional) name of an existing model to create the new model from
-`files`: (optional) a dictionary of file names to SHA256 digests of blobs to create the model from
-`adapters`: (optional) a dictionary of file names to SHA256 digests of blobs for LORA adapters
-`template`: (optional) the prompt template for the model
-`license`: (optional) a string or list of strings containing the license or licenses for the model
-`system`: (optional) a string containing the system prompt for the model
-`parameters`: (optional) a dictionary of parameters for the model (see [Modelfile](./modelfile.md#valid-parameters-and-values) for a list of parameters)
-`messages`: (optional) a list of message objects used to create a conversation
-`stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects
-`quantize` (optional): quantize a non-quantized (e.g. float16) model
#### Quantization types
| Type | Recommended |
| ------ | :---------: |
| q4_K_M | \* |
| q4_K_S | |
| q8_0 | \* |
### Examples
#### Create a new model
Create a new model from an existing model.
##### Request
```shell
curl http://localhost:11434/api/create -d'{
"model": "mario",
"from": "llama3.2",
"system": "You are Mario from Super Mario Bros."
}'
```
##### Response
A stream of JSON objects is returned:
```json
{"status":"reading model metadata"}
{"status":"creating system layer"}
{"status":"using already created layer sha256:22f7f8ef5f4c791c1b03d7eb414399294764d7cc82c7e94aa81a1feb80a983a2"}
{"status":"using already created layer sha256:8c17c2ebb0ea011be9981cc3922db8ca8fa61e828c5d3f44cb6ae342bf80460b"}
{"status":"using already created layer sha256:7c23fb36d80141c4ab8cdbb61ee4790102ebd2bf7aeff414453177d4f2110e5d"}
{"status":"using already created layer sha256:2e0493f67d0c8c9c68a8aeacdf6a38a2151cb3c4c1d42accf296e19810527988"}
{"status":"using already created layer sha256:2759286baa875dc22de5394b4a925701b1896a7e3f8e53275c36f75a877a82c9"}
Create a model from a GGUF file. The `files` parameter should be filled out with the file name and SHA256 digest of the GGUF file you wish to use. Use [/api/blobs/:digest](#push-a-blob) to push the GGUF file to the server before calling this API.
The `files` parameter should include a dictionary of files for the safetensors model which includes the file names and SHA256 digest of each file. Use [/api/blobs/:digest](#push-a-blob) to first push each of the files to the server before calling this API. Files will remain in the cache until the Ollama server is restarted.
Show information about a model including details, modelfile, template, parameters, license, system prompt.
### Parameters
-`model`: name of the model to show
-`verbose`: (optional) if set to `true`, returns full data for verbose response fields
### Examples
#### Request
```shell
curl http://localhost:11434/api/show -d'{
"model": "llava"
}'
```
#### Response
```json5
{
modelfile: '# Modelfile generated by "ollama show"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llava:latest\n\nFROM /Users/matt/.ollama/models/blobs/sha256:200765e1283640ffbd013184bf496e261032fa75b99498a9613be4e94d63ad52\nTEMPLATE """{{ .System }}\nUSER: {{ .Prompt }}\nASSISTANT: """\nPARAMETER num_ctx 4096\nPARAMETER stop "\u003c/s\u003e"\nPARAMETER stop "USER:"\nPARAMETER stop "ASSISTANT:"',
template: "{{ if .System }}<|start_header_id|>system<|end_header_id|>\n\n{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>\n\n{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>\n\n{{ .Response }}<|eot_id|>",
Returns a 200 OK if successful, 404 Not Found if the model to be deleted doesn't exist.
## Pull a Model
```
POST /api/pull
```
Download a model from the ollama library. Cancelled pulls are resumed from where they left off, and multiple calls will share the same download progress.
### Parameters
-`model`: name of the model to pull
-`insecure`: (optional) allow insecure connections to the library. Only use this if you are pulling from your own library during development.
-`stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects
### Examples
#### Request
```shell
curl http://localhost:11434/api/pull -d'{
"model": "llama3.2"
}'
```
#### Response
If `stream` is not specified, or set to `true`, a stream of JSON objects is returned:
The first object is the manifest:
```json
{
"status":"pulling manifest"
}
```
Then there is a series of downloading responses. Until any of the download is completed, the `completed` key may not be included. The number of files to be downloaded depends on the number of layers specified in the manifest.
```json
{
"status":"pulling digestname",
"digest":"digestname",
"total":2142590208,
"completed":241970
}
```
After all the files are downloaded, the final responses are:
```json
{
"status":"verifying sha256 digest"
}
{
"status":"writing manifest"
}
{
"status":"removing any unused layers"
}
{
"status":"success"
}
```
if `stream` is set to false, then the response is a single JSON object:
```json
{
"status":"success"
}
```
## Push a Model
```
POST /api/push
```
Upload a model to a model library. Requires registering for ollama.ai and adding a public key first.
### Parameters
-`model`: name of the model to push in the form of `<namespace>/<model>:<tag>`
-`insecure`: (optional) allow insecure connections to the library. Only use this if you are pushing to your library during development.
-`stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects
### Examples
#### Request
```shell
curl http://localhost:11434/api/push -d'{
"model": "mattw/pygmalion:latest"
}'
```
#### Response
If `stream` is not specified, or set to `true`, a stream of JSON objects is returned:
If `stream` is set to `false`, then the response is a single JSON object:
```json
{"status":"success"}
```
## Generate Embeddings
```
POST /api/embed
```
Generate embeddings from a model
### Parameters
-`model`: name of model to generate embeddings from
-`input`: text or list of text to generate embeddings for
Advanced parameters:
-`truncate`: truncates the end of each input to fit within context length. Returns error if `false` and context length is exceeded. Defaults to `true`
-`options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
-`keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
-`dimensions`: number of dimensions for the embedding
> Note: this endpoint has been superseded by `/api/embed`
```
POST /api/embeddings
```
Generate embeddings from a model
### Parameters
-`model`: name of model to generate embeddings from
-`prompt`: text to generate embeddings for
Advanced parameters:
-`options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
-`keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
Go benchmark tests that measure end-to-end performance of a running Ollama server. Run these tests to evaluate model inference performance on your hardware and measure the impact of code changes.
## When to use
Run these benchmarks when:
- Making changes to the model inference engine
- Modifying model loading/unloading logic
- Changing prompt processing or token generation code
- Implementing a new model architecture
- Testing performance across different hardware setups
## Prerequisites
- Ollama server running locally with `ollama serve` on `127.0.0.1:11434`
## Usage and Examples
<Note>
All commands must be run from the root directory of the Ollama project.
</Note>
Basic syntax:
```bash
go test -bench=. ./benchmark/... -m $MODEL_NAME
```
Required flags:
- `-bench=.`: Run all benchmarks
- `-m`: Model name to benchmark
Optional flags:
- `-count N`: Number of times to run the benchmark (useful for statistical analysis)
- `-timeout T`: Maximum time for the benchmark to run (e.g. "10m" for 10 minutes)
Common usage patterns:
Single benchmark run with a model specified:
```bash
go test -bench=. ./benchmark/... -m llama3.3
```
## Output metrics
The benchmark reports several key metrics:
- `gen_tok/s`: Generated tokens per second
- `prompt_tok/s`: Prompt processing tokens per second
- `ttft_ms`: Time to first token in milliseconds
- `load_ms`: Model load time in milliseconds
- `gen_tokens`: Total tokens generated
- `prompt_tokens`: Total prompt tokens processed
Each benchmark runs two scenarios:
- Cold start: Model is loaded from disk for each test
- Warm start: Model is pre-loaded in memory
Three prompt lengths are tested for each scenario: