- Using the commands in the [`tests`](tests) folder. For instance, running the `./tests/test-backend-ops` command tests different backend implementations of the GGML library
- Execute [the full CI locally on your machine](ci/README.md) before publishing
- Please rate the complexity of your PR (i.e. `Review Complexity : Low`, `Review Complexity : Medium`, `Review Complexity : High`). This makes it easier for maintainers to triage the PRs.
- The PR template has a series of review complexity checkboxes `[ ]` that [you can mark as](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/about-task-lists)`[X]` for your convenience
- Consider allowing write access to your branch for faster review
- If your PR becomes stale, don't hesitate to ping the maintainers in the comments
# Pull requests (for collaborators)
- Squash-merge PRs
- Use the following format for the squashed commit title: `<module> : <commit title> (#<issue_number>)`. For example: `utils : fix typo in utils.py (#1234)`
- Optionally, pick a `<module>` from here: https://github.com/ggerganov/llama.cpp/wiki/Modules
# Coding guidelines
- Avoid adding third-party dependencies, extra files, extra headers, etc.
- Always consider cross-compatibility with other operating systems and architectures
- Avoid fancy looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple
- There are no strict rules for the code style, but try to follow the patterns in the code (indentation, spaces, etc.). Vertical alignment makes things more readable and easier to batch edit
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a`
- Naming usually optimizes for common prefix (see https://github.com/ggerganov/ggml/pull/302#discussion_r1243240963)
- Tensors store data in row-major order. We refer to dimension 0 as columns, 1 as rows, 2 as matrices
- Matrix multiplication is unconventional: [`C = ggml_mul_mat(ctx, A, B)`](https://github.com/ggerganov/llama.cpp/blob/880e352277fc017df4d5794f0c21c44e1eae2b84/ggml.h#L1058-L1064) means $C^T = A B^T \Leftrightarrow C = B A^T.$
# Legacy build targets that were renamed in #7809, but we want to build binaries that for them that output a deprecation warning if people try to use them.
# We don't want to clutter things too much, so we only build replacements for the most commonly used binaries.
LEGACY_TARGETS_BUILD= main quantize perplexity embedding server
warn:=$(warning Your arch is announced as x86_64, but it seems to actually be ARM64. Not fixing that can lead to bad performance. For more info see: https://github.com/ggerganov/whisper.cpp/issues/66\#issuecomment-1282546789)
$(error I ERROR:For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via environment variable CUDA_DOCKER_ARCH, e.g. by running "export CUDA_DOCKER_ARCH=compute_XX" on Unix-like systems, where XX is the minimum compute capability that the code needs to run on. A list with compute capabilities can be found here: https://developer.nvidia.com/cuda-gpus )
Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++
> [!IMPORTANT]
[2024 Jun 12] Binaries have been renamed w/ a `llama-` prefix. `main` is now `llama-cli`, `server` is `llama-server`, etc (https://github.com/ggerganov/llama.cpp/pull/7809)
## Recent API changes
- [2024 Jun 26] The source code and CMake build scripts have been restructured https://github.com/ggerganov/llama.cpp/pull/8006
- [2024 Apr 21] `llama_token_to_piece` can now optionally render special tokens https://github.com/ggerganov/llama.cpp/pull/6807
- [2024 Apr 4] State and session file functions reorganized under `llama_state_*` https://github.com/ggerganov/llama.cpp/pull/6341
- [2024 Mar 26] Logits and embeddings API updated for compactness https://github.com/ggerganov/llama.cpp/pull/6122
- [2024 Mar 13] Add `llama_synchronize()` + `llama_context_params.n_ubatch` https://github.com/ggerganov/llama.cpp/pull/6017
- [2024 Mar 8] `llama_kv_cache_seq_rm()` returns a `bool` instead of `void`, and new `llama_n_seq_max()` returns the upper limit of acceptable `seq_id` in batches (relevant when dealing with multiple sequences) https://github.com/ggerganov/llama.cpp/pull/5328
- [2024 Mar 4] Embeddings API updated https://github.com/ggerganov/llama.cpp/pull/5796
- [2024 Mar 3] `struct llama_context_params` https://github.com/ggerganov/llama.cpp/pull/5849
## Hot topics
-**`convert.py` has been deprecated and moved to `examples/convert_legacy_llama.py`, please use `convert_hf_to_gguf.py`** https://github.com/ggerganov/llama.cpp/pull/7430
- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance)[(more info)](https://github.com/ggerganov/llama.cpp/pull/6326)
-[AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text)(MIT)
-[AIKit](https://github.com/sozercan/aikit)(MIT)
-[LARS - The LLM & Advanced Referencing Solution](https://github.com/abgulati/LARS)(AGPL)
*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*
**Tools:**
-[akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
-[crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
**Infrastructure:**
-[Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
**Games:**
-[Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.
## Demo
<details>
<summary>Typical run using LLaMA v2 13B on M2 Ultra</summary>
```
$ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Building a website can be done in 10 simple steps:
Step 1: Find the right website platform.
Step 2: Choose your domain name and hosting plan.
Step 3: Design your website layout.
Step 4: Write your website content and add images.
Step 5: Install security features to protect your site from hackers or spammers
Step 6: Test your website on multiple browsers, mobile devices, operating systems etc…
Step 7: Test it again with people who are not related to you personally – friends or family members will work just fine!
Step 8: Start marketing and promoting the website via social media channels or paid ads
Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etc…
Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
How does a Website Work?
A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit – whether it’s an image or text file (like PDFs). In order for someone else’s browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols – this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
How to
llama_print_timings: load time = 576.45 ms
llama_print_timings: sample time = 283.10 ms / 400 runs ( 0.71 ms per token, 1412.91 tokens per second)
llama_print_timings: prompt eval time = 599.83 ms / 19 tokens ( 31.57 ms per token, 31.68 tokens per second)
llama_print_timings: eval time = 24513.59 ms / 399 runs ( 61.44 ms per token, 16.28 tokens per second)
llama_print_timings: total time = 25431.49 ms
```
</details>
<details>
<summary>Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook</summary>
And here is another demo of running both LLaMA-7B and [whisper.cpp](https://github.com/ggerganov/whisper.cpp) on a single M1 Pro MacBook:
Here are the end-to-end binary build and model conversion steps for most supported models.
### Basic usage
Firstly, you need to get the binary. There are different methods that you can follow:
- Method 1: Clone this repository and build locally, see [how to build](./docs/build.md)
- Method 2: If you are using MacOS or Linux, you can install llama.cpp via [brew, flox or nix](./docs/install.md)
- Method 3: Use a Docker image, see [documentation for Docker](./docs/docker.md)
- Method 4: Download pre-built binary from [releases](https://github.com/ggerganov/llama.cpp/releases)
You can run a basic completion using this command:
```bash
llama-cli -m your_model.gguf -p"I believe the meaning of life is"-n 128
# Output:
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
```
See [this page](./examples/main/README.md) for a full list of parameters.
### Conversation mode
If you want a more ChatGPT-like experience, you can run in conversation mode by passing `-cnv` as a parameter:
```bash
llama-cli -m your_model.gguf -p"You are a helpful assistant"-cnv
# Output:
# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!
```
By default, the chat template will be taken from the input model. If you want to use another chat template, pass `--chat-template NAME` as a parameter. See the list of [supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
```bash
./llama-cli -m your_model.gguf -p"You are a helpful assistant"-cnv--chat-template chatml
```
You can also use your own template via in-prefix, in-suffix and reverse-prompt parameters:
```bash
./llama-cli -m your_model.gguf -p"You are a helpful assistant"-cnv--in-prefix'User: '--reverse-prompt'User:'
```
### Web server
[llama.cpp web server](./examples/server/README.md) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
Example usage:
```bash
./llama-server -m your_model.gguf --port 8080
# Basic web UI can be accessed via browser: http://localhost:8080
> If you prefer basic usage, please consider using conversation mode instead of interactive mode
In this mode, you can always interrupt generation by pressing Ctrl+C and entering one or more lines of text, which will be converted into tokens and appended to the current context. You can also specify a *reverse prompt* with the parameter `-r "reverse prompt string"`. This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt that makes LLaMA emulate a chat between multiple users, say Alice and Bob, and pass `-r "Alice:"`.
Here is an example of a few-shot interaction, invoked with the command
Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README](examples/main/README.md) for the `llama-cli` example program.
The prompt, user inputs, and model generations can be saved and resumed across calls to `./llama-cli` by leveraging `--prompt-cache` and `--prompt-cache-all`. The `./examples/chat-persistent.sh` script demonstrates this with support for long-running, resumable chat sessions. To use this example, you must provide a file to cache the initial chat prompt and a directory to save the chat session, and may optionally provide the same variables as `chat-13B.sh`. The same prompt cache can be reused for new chat sessions. Note that both prompt cache and chat directory are tied to the initial prompt (`PROMPT_TEMPLATE`) and the model file.
`llama.cpp` supports grammars to constrain model output. For example, you can force the model to output JSON only:
```bash
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p'Request: schedule a call at 8pm; Command:'
```
The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).
For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
## Build
Please refer to [Build llama.cpp locally](./docs/build.md)
## Supported backends
| Backend | Target devices |
| --- | --- |
| [Metal](./docs/build.md#metal-build) | Apple Silicon |
| [BLAS](./docs/build.md#blas-build) | All |
| [BLIS](./docs/backend/BLIS.md) | All |
| [SYCL](./docs/backend/SYCL.md) | Intel and Nvidia GPU |
| [MUSA](./docs/build.md#musa) | Moore Threads GPU |
| [CUDA](./docs/build.md#cuda) | Nvidia GPU |
| [hipBLAS](./docs/build.md#hipblas) | AMD GPU |
| [Vulkan](./docs/build.md#vulkan) | GPU |
## Tools
### Prepare and Quantize
> [!NOTE]
> You can use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to quantise your model weights without any setup too. It is synced from `llama.cpp` main every 6 hours.
To obtain the official LLaMA 2 weights please see the <ahref="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a> section. There is also a large selection of pre-quantized `gguf` models available on Hugging Face.
Note: `convert.py` has been moved to `examples/convert_legacy_llama.py` and shouldn't be used for anything other than `Llama/Llama2/Mistral` models and their derivatives.
It does not support LLaMA 3, you can use `convert_hf_to_gguf.py` with LLaMA 3 downloaded from Hugging Face.
To learn more about quantizing model, [read this documentation](./examples/quantize/README.md)
### Perplexity (measuring model quality)
You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
To learn more how to measure perplexity using llama.cpp, [read this documentation](./examples/perplexity/README.md)
## Contributing
- Contributors can open PRs
- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
- Collaborators will be invited based on contributions
- Any help with managing issues and PRs is very appreciated!
- See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
- Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
- Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
- A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
- LLaMA:
-[Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
-[LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
- GPT-3
-[Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- GPT-3.5 / InstructGPT / ChatGPT:
-[Aligning language models to follow instructions](https://openai.com/research/instruction-following)
-[Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
-[**Reporting a vulnerability**](#reporting-a-vulnerability)
## Using llama.cpp securely
### Untrusted models
Be careful when running untrusted models. This classification includes models created by unknown developers or utilizing data obtained from unknown sources.
*Always execute untrusted models within a secure, isolated environment such as a sandbox* (e.g., containers, virtual machines). This helps protect your system from potentially malicious code.
> [!NOTE]
> The trustworthiness of a model is not binary. You must always determine the proper level of caution depending on the specific model and how it matches your use case and risk tolerance.
### Untrusted inputs
Some models accept various input formats (text, images, audio, etc.). The libraries converting these inputs have varying security levels, so it's crucial to isolate the model and carefully pre-process inputs to mitigate script injection risks.
For maximum security when handling untrusted inputs, you may need to employ the following:
* Sandboxing: Isolate the environment where the inference happens.
* Pre-analysis: Check how the model performs by default when exposed to prompt injection (e.g. using [fuzzing for prompt injection](https://github.com/FonduAI/awesome-prompt-injection?tab=readme-ov-file#tools)). This will give you leads on how hard you will have to work on the next topics.
* Updates: Keep both LLaMA C++ and your libraries updated with the latest security patches.
* Input Sanitation: Before feeding data to the model, sanitize inputs rigorously. This involves techniques such as:
* Validation: Enforce strict rules on allowed characters and data types.
* Filtering: Remove potentially malicious scripts or code fragments.
* Encoding: Convert special characters into safe representations.
* Verification: Run tooling that identifies potential script injections (e.g. [models that detect prompt injection attempts](https://python.langchain.com/docs/guides/safety/hugging_face_prompt_injection)).
### Data privacy
To protect sensitive data from potential leaks or unauthorized access, it is crucial to sandbox the model execution. This means running the model in a secure, isolated environment, which helps mitigate many attack vectors.
### Untrusted environments or networks
If you can't run your models in a secure and isolated environment or if it must be exposed to an untrusted network, make sure to take the following security precautions:
* Confirm the hash of any downloaded artifact (e.g. pre-trained model weights) matches a known-good value
* Encrypt your data if sending it over the network.
### Multi-Tenant environments
If you intend to run multiple models in parallel with shared memory, it is your responsibility to ensure the models do not interact or access each other's data. The primary areas of concern are tenant isolation, resource allocation, model sharing and hardware attacks.
1. Tenant Isolation: Models should run separately with strong isolation methods to prevent unwanted data access. Separating networks is crucial for isolation, as it prevents unauthorized access to data or models and malicious users from sending graphs to execute under another tenant's identity.
2. Resource Allocation: A denial of service caused by one model can impact the overall system health. Implement safeguards like rate limits, access controls, and health monitoring.
3. Model Sharing: In a multitenant model sharing design, tenants and users must understand the security risks of running code provided by others. Since there are no reliable methods to detect malicious models, sandboxing the model execution is the recommended approach to mitigate the risk.
4. Hardware Attacks: GPUs or TPUs can also be attacked. [Researches](https://scholar.google.com/scholar?q=gpu+side+channel) has shown that side channel attacks on GPUs are possible, which can make data leak from other models or processes running on the same system at the same time.
## Reporting a vulnerability
Beware that none of the topics under [Using llama.cpp securely](#using-llamacpp-securely) are considered vulnerabilities of LLaMA C++.
<!-- normal version -->
However, If you have discovered a security vulnerability in this project, please report it privately. **Do not disclose it as a public issue.** This gives us time to work with you to fix the issue before public exposure, reducing the chance that the exploit will be used before a patch is released.
Please disclose it as a private [security advisory](https://github.com/ggerganov/llama.cpp/security/advisories/new).
A team of volunteers on a reasonable-effort basis maintains this project. As such, please give us at least 90 days to work on a fix before public exposure.
(time ./bin/llama-cli --model${model_f16}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-f16.log
(time ./bin/llama-cli --model${model_q8_0}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q8_0.log
(time ./bin/llama-cli --model${model_q4_0}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q4_0.log
(time ./bin/llama-cli --model${model_q4_1}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q4_1.log
(time ./bin/llama-cli --model${model_q5_0}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q5_0.log
(time ./bin/llama-cli --model${model_q5_1}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q5_1.log
(time ./bin/llama-cli --model${model_q2_k}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q2_k.log
(time ./bin/llama-cli --model${model_q3_k}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q3_k.log
(time ./bin/llama-cli --model${model_q4_k}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q4_k.log
(time ./bin/llama-cli --model${model_q5_k}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q5_k.log
(time ./bin/llama-cli --model${model_q6_k}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q6_k.log
(time ./bin/llama-cli --model${model_f16}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-f16.log
(time ./bin/llama-cli --model${model_q8_0}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q8_0.log
(time ./bin/llama-cli --model${model_q4_0}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q4_0.log
(time ./bin/llama-cli --model${model_q4_1}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q4_1.log
(time ./bin/llama-cli --model${model_q5_0}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q5_0.log
(time ./bin/llama-cli --model${model_q5_1}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q5_1.log
(time ./bin/llama-cli --model${model_q2_k}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q2_k.log
(time ./bin/llama-cli --model${model_q3_k}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q3_k.log
(time ./bin/llama-cli --model${model_q4_k}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q4_k.log
(time ./bin/llama-cli --model${model_q5_k}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q5_k.log
(time ./bin/llama-cli --model${model_q6_k}-t 1 -ngl 999 -s 1234 -n 256 --ignore-eos-p"I believe the meaning of life is") 2>&1 | tee-a$OUT/${ci}-tg-q6_k.log
message(WARNING "Git index not found in git repository.")
set(GIT_INDEX "")
endif()
else()
message(WARNING "Git repository not found; to enable automatic generation of build info, make sure Git is installed and the project is a Git repository.")
set(GIT_INDEX "")
endif()
# Add a custom command to rebuild build-info.cpp when .git/index changes
options.push_back({"*"," --verbosity N","set specific verbosity level (default: %d)",params.verbosity});
options.push_back({"*"," --verbose-prompt","print a verbose prompt before generation (default: %s)",params.verbose_prompt?"true":"false"});
options.push_back({"*"," --no-display-prompt","don't print prompt at generation (default: %s)",!params.display_prompt?"true":"false"});
options.push_back({"*","-co, --color","colorise output to distinguish prompt and user input from generations (default: %s)",params.use_color?"true":"false"});
options.push_back({"*","-s, --seed SEED","RNG seed (default: %d, use random seed for < 0)",params.seed});
options.push_back({"*","-t, --threads N","number of threads to use during generation (default: %d)",params.n_threads});
options.push_back({"*","-tb, --threads-batch N","number of threads to use during batch and prompt processing (default: same as --threads)"});
options.push_back({"speculative","-td, --threads-draft N","number of threads to use during generation (default: same as --threads)"});
options.push_back({"main infill","-i, --interactive","run in interactive mode (default: %s)",params.interactive?"true":"false"});
options.push_back({"main infill","-if, --interactive-first","run in interactive mode and wait for input right away (default: %s)",params.interactive_first?"true":"false"});
options.push_back({"main infill","-mli, --multiline-input","allows you to write or paste multiple lines without ending each in '\\'"});
options.push_back({"main infill"," --in-prefix-bos","prefix BOS to user inputs, preceding the `--in-prefix` string"});
options.push_back({"main infill"," --in-prefix STRING","string to prefix user inputs with (default: empty)"});
options.push_back({"main infill"," --in-suffix STRING","string to suffix after user inputs with (default: empty)"});
options.push_back({"main"," --no-warmup","skip warming up the model with an empty run"});
options.push_back({"server infill",
" --spm-infill","use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this. (default: %s)",params.spm_infill?"enabled":"disabled"});
options.push_back({"sampling"});
options.push_back({"*"," --samplers SAMPLERS","samplers that will be used for generation in the order, separated by \';\'\n"
options.push_back({"*"," --grammar GRAMMAR","BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '%s')",sparams.grammar.c_str()});
options.push_back({"*"," --grammar-file FNAME","file to read grammar from"});
options.push_back({"*","-ctk, --cache-type-k TYPE","KV cache data type for K (default: %s)",params.cache_type_k.c_str()});
options.push_back({"*","-ctv, --cache-type-v TYPE","KV cache data type for V (default: %s)",params.cache_type_v.c_str()});
options.push_back({"perplexity"});
options.push_back({"perplexity"," --all-logits","return logits for all tokens in the batch (default: %s)",params.logits_all?"true":"false"});
options.push_back({"perplexity"," --hellaswag","compute HellaSwag score over random tasks from datafile supplied with -f"});
options.push_back({"perplexity"," --hellaswag-tasks N","number of tasks to use when computing the HellaSwag score (default: %zu)",params.hellaswag_tasks});
options.push_back({"perplexity"," --winogrande","compute Winogrande score over random tasks from datafile supplied with -f"});
options.push_back({"perplexity"," --winogrande-tasks N","number of tasks to use when computing the Winogrande score (default: %zu)",params.winogrande_tasks});
options.push_back({"perplexity"," --multiple-choice","compute multiple choice score over random tasks from datafile supplied with -f"});
options.push_back({"embedding"," --embd-separator","separator of embendings (default \\n) for example \"<#sep#>\""});
options.push_back({"server"});
options.push_back({"server"," --host HOST","ip address to listen (default: %s)",params.hostname.c_str()});
options.push_back({"server"," --port PORT","port to listen (default: %d)",params.port});
options.push_back({"server"," --path PATH","path to serve static files from (default: %s)",params.public_path.c_str()});
options.push_back({"server"," --embedding(s)","restrict to only support embedding use case; use only with dedicated embedding models (default: %s)",params.embedding?"enabled":"disabled"});
options.push_back({"server"," --api-key KEY","API key to use for authentication (default: none)"});
options.push_back({"server"," --api-key-file FNAME","path to file containing API keys (default: none)"});
options.push_back({"server"," --ssl-key-file FNAME","path to file a PEM-encoded SSL private key"});
options.push_back({"server"," --ssl-cert-file FNAME","path to file a PEM-encoded SSL certificate"});
options.push_back({"server"," --timeout N","server read/write timeout in seconds (default: %d)",params.timeout_read});
options.push_back({"server"," --threads-http N","number of threads used to process HTTP requests (default: %d)",params.n_threads_http});
"how much the prompt of a request must match the prompt of a slot in order to use that slot (default: %.2f, 0.0 = disabled)\n",params.slot_prompt_similarity});
options.push_back({"server"," --lora-init-without-apply","load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: %s)",params.lora_init_without_apply?"enabled":"disabled"});
#ifndef LOG_DISABLE_LOGS
options.push_back({"logging"});
options.push_back({"*"," --simple-io","use basic IO for better compatibility in subprocesses and limited consoles"});
options.push_back({"*","-ld, --logdir LOGDIR","path under which to save YAML logs (no logging if unset)"});
options.push_back({"cvector"," --positive-file FNAME","positive prompts file, one prompt per line (default: '%s')",params.cvector_positive_file.c_str()});
options.push_back({"cvector"," --negative-file FNAME","negative prompts file, one prompt per line (default: '%s')",params.cvector_negative_file.c_str()});
options.push_back({"cvector"," --pca-batch N","batch size used for PCA. Larger batch runs faster, but uses more memory (default: %d)",params.n_pca_batch});
options.push_back({"cvector"," --pca-iter N","number of iterations used for PCA (default: %d)",params.n_pca_iterations});
options.push_back({"cvector"," --method {pca,mean}","dimensionality reduction method to be used (default: pca)"});
options.push_back({"export-lora"});
options.push_back({"export-lora","-m, --model","model path from which to load base model (default '%s')",params.model.c_str()});
options.push_back({"export-lora"," --lora FNAME","path to LoRA adapter (can be repeated to use multiple adapters)"});
options.push_back({"export-lora"," --lora-scaled FNAME S","path to LoRA adapter with user defined scaling S (can be repeated to use multiple adapters)"});
options.push_back({"*","-t, --threads N","number of threads to use during computation (default: %d)",params.n_threads});
fprintf(stderr,"%s: Last-Modified header is different (%s != %s): triggering a new download\n",__func__,last_modified.c_str(),headers.last_modified.c_str());