system_message="You are an advanced AI, tasked to assist the user by calling functions in JSON format. The following are the available functions and their parameters and types:\n\n"+documentation
system_message="You are an advanced AI, tasked to assist the user by calling functions in JSON format. The following are the available functions and their parameters and types:\n\n"+documentation
system_message="You are an advanced AI, tasked to create a dataset entry in JSON for a Book. The following is the expected output model:\n\n"+documentation
text="""The Feynman Lectures on Physics is a physics textbook based on some lectures by Richard Feynman, a Nobel laureate who has sometimes been called "The Great Explainer". The lectures were presented before undergraduate students at the California Institute of Technology (Caltech), during 1961–1963. The book's co-authors are Feynman, Robert B. Leighton, and Matthew Sands."""
system_message="You are an advanced AI assistant. You are interacting with the user and with your environment by calling functions. You call functions by writing JSON objects, which represent specific function calls.\nBelow is a list of your available function calls:\n\n"+documentation
text="""Get the date and time, get the current weather in celsius in London and solve the following calculation: 42 * 42"""
When running the larger models, make sure you have enough disk space to store all the intermediate files.
## Memory/Disk Requirements
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
| Model | Original size | Quantized size (Q4_0) |
|------:|--------------:|----------------------:|
| 7B | 13 GB | 3.9 GB |
| 13B | 24 GB | 7.8 GB |
| 30B | 60 GB | 19.5 GB |
| 65B | 120 GB | 38.5 GB |
## Quantization
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
@@ -5,7 +5,7 @@ Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/
...
@@ -5,7 +5,7 @@ Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/
Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
**Features:**
**Features:**
* LLM inference of F16 and quantum models on GPU and CPU
* LLM inference of F16 and quantized models on GPU and CPU
*[OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
*[OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
* Parallel decoding with multi-user support
* Parallel decoding with multi-user support
* Continuous batching
* Continuous batching
...
@@ -15,91 +15,261 @@ Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
...
@@ -15,91 +15,261 @@ Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216).
The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216).
**Command line options:**
## Usage
-`-v`, `--verbose`: Enable verbose server output. When using the `/completion` endpoint, this includes the tokenized prompt, the full request and the full response.
```
-`-t N`, `--threads N`: Set the number of threads to use by CPU layers during generation. Not used by model layers that are offloaded to GPU. This option has no effect when using the maximum number of GPU layers. Default: `std::thread::hardware_concurrency()` (number of CPU cores).
usage: ./llama-server [options]
-`-tb N, --threads-batch N`: Set the number of threads to use by CPU layers during batch and prompt processing (>= 32 tokens). This option has no effect if a GPU is available. Default: `--threads`.
-`--threads-http N`: Number of threads in the http server pool to process requests. Default: `max(std::thread::hardware_concurrency() - 1, --parallel N + 2)`
general:
-`-m FNAME`, `--model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.gguf`).
-`-mu MODEL_URL --model-url MODEL_URL`: Specify a remote http url to download the file. Default: unused
-h, --help, --usage print usage and exit
-`-hfr REPO, --hf-repo REPO`: Hugging Face model repository. Default: unused
--version show version and build info
-`-hff FILE, --hf-file FILE`: Hugging Face model file. Default: unused
-v, --verbose print verbose information
-`-a ALIAS`, `--alias ALIAS`: Set an alias for the model. The alias will be returned in API responses.
--verbosity N set specific verbosity level (default: 0)
-`-c N`, `--ctx-size N`: Set the size of the prompt context. The default is `512`, but LLaMA models were built with a context of `2048`, which will provide better results for longer input/inference. The size may differ in other models, for example, baichuan models were build with a context of `4096`.
--verbose-prompt print a verbose prompt before generation (default: false)
-`-ngl N`, `--n-gpu-layers N`: When compiled with GPU support, this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
--no-display-prompt don't print prompt at generation (default: false)
-`-mg i, --main-gpu i`: When using multiple GPUs, this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default, GPU `0` is used.
-co, --color colorise output to distinguish prompt and user input from generations (default: false)
-`-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs, this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default, the data is split in proportion to VRAM, but this may not be optimal for performance.
-s, --seed SEED RNG seed (default: -1, use random seed for < 0)
-`-b N`, `--batch-size N`: Set the batch size for prompt processing. Default: `2048`
-t, --threads N number of threads to use during generation (default: 8)
-`-ub N`, `--ubatch-size N`: Physical maximum batch size. Default: `512`
-tb, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads)
-`--mlock`: Lock the model in memory, preventing it from being swapped out when memory-mapped.
-td, --threads-draft N number of threads to use during generation (default: same as --threads)
-`--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed.
-tbd, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft)
-`--numa STRATEGY`: Attempt one of the below optimization strategies that may help on some NUMA systems
--draft N number of tokens to draft for speculative decoding (default: 5)
-`--numa distribute`: Spread execution evenly over all nodes
-ps, --p-split N speculative decoding split probability (default: 0.1)
-`--numa isolate`: Only spawn threads on CPUs on the node that execution started on
-lcs, --lookup-cache-static FNAME
-`--numa numactl`: Use the CPU map provided by numactl. If run without this previously, it is recommended to drop the system page cache before using this. See https://github.com/ggerganov/llama.cpp/issues/1437
path to static lookup cache to use for lookup decoding (not updated by generation)
-`--numa`: Attempt optimizations that may help on some NUMA systems.
-lcd, --lookup-cache-dynamic FNAME
-`--lora FNAME`: Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap). This allows you to adapt the pretrained model to specific tasks or domains.
path to dynamic lookup cache to use for lookup decoding (updated by generation)
-`--lora-base FNAME`: Optional model to use as a base for the layers modified by the LoRA adapter. This flag is used in conjunction with the `--lora` flag, and specifies the base model for the adaptation.
-c, --ctx-size N size of the prompt context (default: 0, 0 = loaded from model)
-`-to N`, `--timeout N`: Server read/write timeout in seconds. Default `600`
-n, --predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
-`--host`: Set the hostname or ip address to listen. Default `127.0.0.1`
-b, --batch-size N logical maximum batch size (default: 2048)
-`--port`: Set the port to listen. Default: `8080`
-ub, --ubatch-size N physical maximum batch size (default: 512)
-`--path`: Path from which to serve static files. Default: disabled
--keep N number of tokens to keep from the initial prompt (default: 0, -1 = all)
-`--api-key`: Set an api key for request authorization. By default, the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token. May be used multiple times to enable multiple valid keys.
--chunks N max number of chunks to process (default: -1, -1 = all)
-`--api-key-file`: Path to file containing api keys delimited by new lines. If set, requests must include one of the keys for access. May be used in conjunction with `--api-key`s.
-`--embeddings`: Enable embedding vector output and the OAI compatible endpoint /v1/embeddings. Physical batch size (`--ubatch-size`) must be carefully defined. Default: disabled
-p, --prompt PROMPT prompt to start generation with
-`-np N`, `--parallel N`: Set the number of slots for process requests. Default: `1`. Values > 1 will allow for higher throughput with multiple parallel requests but the results will **not** be deterministic due to differences in rounding error.
in conversation mode, this will be used as system prompt
-`-spf FNAME`, `--system-prompt-file FNAME` Set a file to load a system prompt (initial prompt of all slots). This is useful for chat applications. [See more](#change-system-prompt-on-runtime)
-f, --file FNAME a file containing the prompt (default: none)
-`--mmproj MMPROJ_FILE`: Path to a multimodal projector file for LLaVA.
--in-file FNAME an input file (repeat to specify multiple files)
-`--grp-attn-n`: Set the group attention factor to extend context size through self-extend. Used together with group attention width `--grp-attn-w`. Default: `1`, which is disabled.
-bf, --binary-file FNAME binary file containing the prompt (default: none)
-`--grp-attn-w`: Set the group attention width to extend context size through self-extend. Used together with group attention factor `--grp-attn-n`. Default: `512`
--prompt-cache FNAME file to cache prompt state for faster startup (default: none)
-`--slot-save-path PATH`: Specifies the path where the state of slots (the prompt cache) can be stored. If not provided, the slot management endpoints will be disabled.
--prompt-cache-all if specified, saves user input and generations to cache as well
-`--chat-template JINJA_TEMPLATE`: Set custom jinja chat template. This parameter accepts a string, not a file name. Default: template taken from model's metadata. We only support [some pre-defined templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
not supported with --interactive or other interactive options
-`--log-disable`: Output logs to stdout only, not to `llama.log`. Default: enabled
--prompt-cache-ro if specified, uses the prompt cache but does not update it
-`--log-format FORMAT`: Define the log output to FORMAT: json or text Default: `json`
-r, --reverse-prompt PROMPT halt generation at PROMPT, return control in interactive mode
-`--rope-scaling` : RoPE scaling method. Defaults to linear unless otherwise specified by the model. Options are `none`, `linear`, `yarn`
can be specified more than once for multiple prompts
-`--rope-freq-base N` : RoPE frequency base (default: loaded from model)
-sp, --special special tokens output enabled (default: false)
-`--rope-freq-scale N`: RoPE frequency scaling factor, expands context by a factor of 1/N (e.g. 0.25)
-cnv, --conversation run in conversation mode, does not print special tokens and suffix/prefix
--in-prefix STRING string to prefix user inputs with (default: empty)
-`-ctk TYPE`, `--cache-type-k TYPE` : KV cache data type for K (default: `f16`, options `f32`, `f16`, `q8_0`, `q4_0`, `q4_1`, `iq4_nl`, `q5_0`, or `q5_1`)
--in-suffix STRING string to suffix after user inputs with (default: empty)
-`-ctv TYPE`, `--cache-type-v TYPE` : KV cache type for V (default `f16`, see `-ctk` for options)
--spm-infill use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this. (default: disabled)
**If compiled with `LLAMA_SERVER_SSL=ON`**
sampling:
-`--ssl-key-file FNAME`: path to file a PEM-encoded SSL private key
-`--ssl-cert-file FNAME`: path to file a PEM-encoded SSL certificate
--samplers SAMPLERS samplers that will be used for generation in the order, separated by ';'
The above command will start a server that by default listens on `127.0.0.1:8080`.
The above command will start a server that by default listens on `127.0.0.1:8080`.
...
@@ -198,7 +368,8 @@ node index.js
...
@@ -198,7 +368,8 @@ node index.js
## API Endpoints
## API Endpoints
-**GET**`/health`: Returns the current state of the server:
### GET `/health`: Returns the current state of the server
- 503 -> `{"status": "loading model"}` if the model is still being loaded.
- 503 -> `{"status": "loading model"}` if the model is still being loaded.
- 500 -> `{"status": "error"}` if the model failed to load.
- 500 -> `{"status": "error"}` if the model failed to load.
- 200 -> `{"status": "ok", "slots_idle": 1, "slots_processing": 2 }` if the model is successfully loaded and the server is ready for further requests mentioned below.
- 200 -> `{"status": "ok", "slots_idle": 1, "slots_processing": 2 }` if the model is successfully loaded and the server is ready for further requests mentioned below.
...
@@ -207,7 +378,7 @@ node index.js
...
@@ -207,7 +378,7 @@ node index.js
If the query parameter `include_slots` is passed, `slots` field will contain internal slots data except if `--slots-endpoint-disable` is set.
If the query parameter `include_slots` is passed, `slots` field will contain internal slots data except if `--slots-endpoint-disable` is set.
-**POST**`/completion`: Given a `prompt`, it returns the predicted completion.
### POST `/completion`: Given a `prompt`, it returns the predicted completion.
*Options:*
*Options:*
...
@@ -231,7 +402,7 @@ node index.js
...
@@ -231,7 +402,7 @@ node index.js
`n_predict`: Set the maximum number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. Default: `-1`, where `-1` is infinity.
`n_predict`: Set the maximum number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. Default: `-1`, where `-1` is infinity.
`n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded.
`n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. The number excludes the BOS token.
By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to retain all tokens from the prompt.
By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to retain all tokens from the prompt.
`stream`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.
`stream`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.
...
@@ -279,13 +450,13 @@ node index.js
...
@@ -279,13 +450,13 @@ node index.js
`id_slot`: Assign the completion task to an specific slot. If is -1 the task will be assigned to a Idle slot. Default: `-1`
`id_slot`: Assign the completion task to an specific slot. If is -1 the task will be assigned to a Idle slot. Default: `-1`
`cache_prompt`: Re-use previously cached prompt from the last request if possible. This may prevent re-caching the prompt from scratch. Default: `false`
`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `false`
`system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime)
`system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime)
`samplers`: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. Default: `["top_k", "tfs_z", "typical_p", "top_p", "min_p", "temperature"]` - these are all the available values.
`samplers`: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. Default: `["top_k", "tfs_z", "typical_p", "top_p", "min_p", "temperature"]` - these are all the available values.
### Result JSON
**Response format**
- Note: When using streaming mode (`stream`), only `content` and `stop` will be returned until end of completion.
- Note: When using streaming mode (`stream`), only `content` and `stop` will be returned until end of completion.
...
@@ -324,7 +495,7 @@ Notice that each `probs` is an array of length `n_probs`.
...
@@ -324,7 +495,7 @@ Notice that each `probs` is an array of length `n_probs`.
-`tokens_evaluated`: Number of tokens evaluated in total from the prompt
-`tokens_evaluated`: Number of tokens evaluated in total from the prompt
-`truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`)
-`truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`)
-**POST**`/tokenize`: Tokenize a given text.
### POST `/tokenize`: Tokenize a given text
*Options:*
*Options:*
...
@@ -332,13 +503,15 @@ Notice that each `probs` is an array of length `n_probs`.
...
@@ -332,13 +503,15 @@ Notice that each `probs` is an array of length `n_probs`.
`add_special`: Boolean indicating if special tokens, i.e. `BOS`, should be inserted. Default: `false`
`add_special`: Boolean indicating if special tokens, i.e. `BOS`, should be inserted. Default: `false`
-**POST**`/detokenize`: Convert tokens to text.
### POST `/detokenize`: Convert tokens to text
*Options:*
*Options:*
`tokens`: Set the tokens to detokenize.
`tokens`: Set the tokens to detokenize.
-**POST**`/embedding`: Generate embedding of a given text just as [the embedding example](../embedding) does.
### POST `/embedding`: Generate embedding of a given text
The same as [the embedding example](../embedding) does.
*Options:*
*Options:*
...
@@ -346,7 +519,9 @@ Notice that each `probs` is an array of length `n_probs`.
...
@@ -346,7 +519,9 @@ Notice that each `probs` is an array of length `n_probs`.
`image_data`: An array of objects to hold base64-encoded image `data` and its `id`s to be reference in `content`. You can determine the place of the image in the content as in the following: `Image: [img-21].\nCaption: This is a picture of a house`. In this case, `[img-21]` will be replaced by the embeddings of the image with id `21` in the following `image_data` array: `{..., "image_data": [{"data": "<BASE64_STRING>", "id": 21}]}`. Use `image_data` only with multimodal models, e.g., LLaVA.
`image_data`: An array of objects to hold base64-encoded image `data` and its `id`s to be reference in `content`. You can determine the place of the image in the content as in the following: `Image: [img-21].\nCaption: This is a picture of a house`. In this case, `[img-21]` will be replaced by the embeddings of the image with id `21` in the following `image_data` array: `{..., "image_data": [{"data": "<BASE64_STRING>", "id": 21}]}`. Use `image_data` only with multimodal models, e.g., LLaVA.
-**POST**`/infill`: For code infilling. Takes a prefix and a suffix and returns the predicted completion as stream.
### POST `/infill`: For code infilling.
Takes a prefix and a suffix and returns the predicted completion as stream.
*Options:*
*Options:*
...
@@ -358,14 +533,15 @@ Notice that each `probs` is an array of length `n_probs`.
...
@@ -358,14 +533,15 @@ Notice that each `probs` is an array of length `n_probs`.
-**GET**`/props`: Return current server settings.
-**GET**`/props`: Return current server settings.
### Result JSON
**Response format**
```json
```json
{
{
"assistant_name":"",
"assistant_name":"",
"user_name":"",
"user_name":"",
"default_generation_settings":{...},
"default_generation_settings":{...},
"total_slots":1
"total_slots":1,
"chat_template":""
}
}
```
```
...
@@ -373,8 +549,11 @@ Notice that each `probs` is an array of length `n_probs`.
...
@@ -373,8 +549,11 @@ Notice that each `probs` is an array of length `n_probs`.
-`user_name` - the required anti-prompt to generate the prompt in case you have specified a system prompt for all slots.
-`user_name` - the required anti-prompt to generate the prompt in case you have specified a system prompt for all slots.
-`default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
-`default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
-`total_slots` - the total number of slots for process requests (defined by `--parallel` option)
-`total_slots` - the total number of slots for process requests (defined by `--parallel` option)
-`chat_template` - the model's original Jinja2 prompt template
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
-**POST**`/v1/chat/completions`: OpenAI-compatible Chat Completions API. Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only model with [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, ChatML template will be used.
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
*Options:*
*Options:*
...
@@ -426,7 +605,7 @@ Notice that each `probs` is an array of length `n_probs`.
...
@@ -426,7 +605,7 @@ Notice that each `probs` is an array of length `n_probs`.
### POST `/v1/embeddings`: OpenAI-compatible embeddings API
*Options:*
*Options:*
...
@@ -460,9 +639,9 @@ Notice that each `probs` is an array of length `n_probs`.
...
@@ -460,9 +639,9 @@ Notice that each `probs` is an array of length `n_probs`.
}'
}'
```
```
-**GET**`/slots`: Returns the current slots processing state. Can be disabled with `--slots-endpoint-disable`.
### GET `/slots`: Returns the current slots processing state. Can be disabled with `--slots-endpoint-disable`.
### Result JSON
**Response format**
```json
```json
[
[
...
@@ -523,7 +702,7 @@ Notice that each `probs` is an array of length `n_probs`.
...
@@ -523,7 +702,7 @@ Notice that each `probs` is an array of length `n_probs`.
]
]
```
```
-**GET**`/metrics`: [Prometheus](https://prometheus.io/) compatible metrics exporter endpoint if `--metrics` is enabled:
### GET `/metrics`: Prometheus compatible metrics exporter endpoint if `--metrics` is enabled:
Available metrics:
Available metrics:
-`llamacpp:prompt_tokens_total`: Number of prompt tokens processed.
-`llamacpp:prompt_tokens_total`: Number of prompt tokens processed.
...
@@ -535,13 +714,13 @@ Available metrics:
...
@@ -535,13 +714,13 @@ Available metrics:
-`llamacpp:requests_processing`: Number of requests processing.
-`llamacpp:requests_processing`: Number of requests processing.
-`llamacpp:requests_deferred`: Number of requests deferred.
-`llamacpp:requests_deferred`: Number of requests deferred.
-**POST**`/slots/{id_slot}?action=save`: Save the prompt cache of the specified slot to a file.
### POST `/slots/{id_slot}?action=save`: Save the prompt cache of the specified slot to a file.
*Options:*
*Options:*
`filename`: Name of the file to save the slot's prompt cache. The file will be saved in the directory specified by the `--slot-save-path` server parameter.
`filename`: Name of the file to save the slot's prompt cache. The file will be saved in the directory specified by the `--slot-save-path` server parameter.
### Result JSON
**Response format**
```json
```json
{
{
...
@@ -555,13 +734,13 @@ Available metrics:
...
@@ -555,13 +734,13 @@ Available metrics:
}
}
```
```
-**POST**`/slots/{id_slot}?action=restore`: Restore the prompt cache of the specified slot from a file.
### POST `/slots/{id_slot}?action=restore`: Restore the prompt cache of the specified slot from a file.
*Options:*
*Options:*
`filename`: Name of the file to restore the slot's prompt cache from. The file should be located in the directory specified by the `--slot-save-path` server parameter.
`filename`: Name of the file to restore the slot's prompt cache from. The file should be located in the directory specified by the `--slot-save-path` server parameter.
### Result JSON
**Response format**
```json
```json
{
{
...
@@ -575,9 +754,9 @@ Available metrics:
...
@@ -575,9 +754,9 @@ Available metrics:
}
}
```
```
-**POST**`/slots/{id_slot}?action=erase`: Erase the prompt cache of the specified slot.
### POST `/slots/{id_slot}?action=erase`: Erase the prompt cache of the specified slot.
### Result JSON
**Response format**
```json
```json
{
{
...
@@ -586,6 +765,42 @@ Available metrics:
...
@@ -586,6 +765,42 @@ Available metrics:
}
}
```
```
### GET `/lora-adapters`: Get list of all LoRA adapters
If an adapter is disabled, the scale will be set to 0.
**Response format**
```json
[
{
"id":0,
"path":"my_adapter_1.gguf",
"scale":0.0
},
{
"id":1,
"path":"my_adapter_2.gguf",
"scale":0.0
}
]
```
### POST `/lora-adapters`: Set list of LoRA adapters
To disable an adapter, either remove it from the list below, or set scale to 0.
**Request format**
To know the `id` of the adapter, use GET `/lora-adapters`
```json
[
{"id":0,"scale":0.2},
{"id":1,"scale":0.8}
]
```
## More examples
## More examples
### Change system prompt on runtime
### Change system prompt on runtime
...
@@ -629,11 +844,11 @@ bash chat.sh
...
@@ -629,11 +844,11 @@ bash chat.sh
### OAI-like API
### OAI-like API
The HTTP `server` supports an OAI-like API: https://github.com/openai/openai-openapi
The HTTP `llama-server` supports an OAI-like API: https://github.com/openai/openai-openapi
### API errors
### API errors
`server` returns errors in the same format as OAI: https://github.com/openai/openai-openapi
`llama-server` returns errors in the same format as OAI: https://github.com/openai/openai-openapi