sgl-hicache-example.md 1.85 KB
Newer Older
1
<!--
2
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
4
5
6
7
8
9
10
11
12
SPDX-License-Identifier: Apache-2.0
-->

# Enable SGLang Hierarchical Cache (HiCache)

This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dynamo.

## 1) Start the SGLang worker with HiCache enabled

```bash
13
python -m dynamo.sglang \
14
  --model-path Qwen/Qwen3-0.6B \
15
16
17
  --host 0.0.0.0 --port 8000 \
  --page-size 64 \
  --enable-hierarchical-cache \
18
  --hicache-ratio 2 \
19
20
21
22
23
24
25
  --hicache-write-policy write_through \
  --hicache-storage-backend nixl \
  --log-level debug \
  --skip-tokenizer-init
```

- **--enable-hierarchical-cache**: Enables hierarchical KV cache/offload
26
- **--hicache-ratio**: The ratio of the size of host KV cache memory pool to the size of device pool. Lower this number if your machine has less CPU memory.
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
- **--hicache-write-policy**: Write policy (e.g., `write_through` for synchronous host writes)
- **--hicache-storage-backend**: Host storage backend for HiCache (e.g., `nixl`). NIXL selects the concrete store automatically; see [PR #8488](https://github.com/sgl-project/sglang/pull/8488)


Then, start the frontend:
```bash
python -m dynamo.frontend --http-port 8000
```

## 2) Send a single request

```bash
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
42
    "model": "Qwen/Qwen3-0.6B",
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
    "messages": [
      {
        "role": "user",
        "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
      }
    ],
    "stream": false,
    "max_tokens": 30
  }'
```

## 3) (Optional) Benchmarking

Run the perf script:
```bash
58
bash -x $DYNAMO_ROOT/benchmarks/llm/perf.sh \
59
  --model Qwen/Qwen3-0.6B \
60
61
62
63
64
65
  --tensor-parallelism 1 \
  --data-parallelism 1 \
  --concurrency "2,4,8" \
  --input-sequence-length 2048 \
  --output-sequence-length 256
```