sgl-hicache-example.md 1.87 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Enable SGLang Hierarchical Cache (HiCache)

This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dynamo.

## 1) Start the SGLang worker with HiCache enabled

```bash
13
python -m dynamo.sglang \
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
  --host 0.0.0.0 --port 8000 \
  --page-size 64 \
  --enable-hierarchical-cache \
  --hicache-size 30 \
  --hicache-write-policy write_through \
  --hicache-storage-backend nixl \
  --log-level debug \
  --skip-tokenizer-init
```

- **--enable-hierarchical-cache**: Enables hierarchical KV cache/offload
- **--hicache-size**: HiCache capacity in GB of pinned host memory (upper bound of offloaded KV to CPU)
- **--hicache-write-policy**: Write policy (e.g., `write_through` for synchronous host writes)
- **--hicache-storage-backend**: Host storage backend for HiCache (e.g., `nixl`). NIXL selects the concrete store automatically; see [PR #8488](https://github.com/sgl-project/sglang/pull/8488)


Then, start the frontend:
```bash
python -m dynamo.frontend --http-port 8000
```

## 2) Send a single request

```bash
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
      {
        "role": "user",
        "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
      }
    ],
    "stream": false,
    "max_tokens": 30
  }'
```

## 3) (Optional) Benchmarking

Run the perf script:
```bash
bash -x /workspace/benchmarks/llm/perf.sh \
  --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
  --tensor-parallelism 1 \
  --data-parallelism 1 \
  --concurrency "2,4,8" \
  --input-sequence-length 2048 \
  --output-sequence-length 256
```