README.md 846 Bytes
Newer Older
1
2
3
# llama.cpp engine for Dynamo

Usage:
4
5
6
7
8
9
10
11
12
```
# Install ai-dynamo llama.cpp backend (CPU Mode)
pip install "ai-dynamo[llama_cpp]"

# [Optional] To build llama.cpp for CUDA (needs a recent pip)
pip install -r --force-reinstall requirements.gpu.txt

python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args]
```
13

14
15
## Request Migration

16
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
17
18
19
20
21

```bash
python3 -m dynamo.llama_cpp ... --migration-limit=3
```

22
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.