README.md 2.84 KB
Newer Older
1
2
3
4
5
6
7
8
# Dynamo model serving recipes

| Model family  | Backend | Mode                | Deployment | Benchmark |
|---------------|---------|---------------------|------------|-----------|
| llama-3-70b   | vllm    | agg                 |     ✓      |     ✓     |
| llama-3-70b   | vllm    | disagg-multi-node   |     ✓      |     ✓     |
| llama-3-70b   | vllm    | disagg-single-node  |     ✓      |     ✓     |
| oss-gpt       | trtllm  | aggregated          |     ✓      |     ✓     |
9
| DeepSeek-R1   | sglang  | disaggregated       |     ✓      |    🚧     |
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88


## Prerequisites

1. Create a namespace and populate NAMESPACE environment variable
This environment variable is used in later steps to deploy and perf-test the model.

```bash
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
```

2. **Dynamo Cloud Platform installed** - Follow [Quickstart Guide](../docs/guides/dynamo_deploy/README.md)

3. **Kubernetes cluster with GPU support**

4. **Container registry access** for vLLM runtime images

5. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
Update the `hf-token-secret.yaml` file with your HuggingFace token.

```bash
kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
```

6. (Optional) Create a shared model cache pvc to store the model weights.
Choose a storage class to create the model cache pvc. You'll need to use this storage class name to update the `storageClass` field in the model-cache/model-cache.yaml file.

```bash
kubectl get storageclass
```

## Running the recipes

Run the recipe to deploy a model:

```bash
./run.sh --model <model> --framework <framework> <deployment-type>
```

Arguments:
  <deployment-type>  Deployment type (e.g., agg, disagg-single-node, disagg-multi-node)

Required Options:
  --model <model>    Model name (e.g., llama-3-70b)
  --framework <fw>   Framework one of VLLM TRTLLM SGLANG (default: VLLM)

Optional:
  --skip-model-cache Skip model downloading (assumes model cache already exists)
  -h, --help         Show this help message

Environment Variables:
  NAMESPACE          Kubernetes namespace (default: dynamo)

Examples:
  ./run.sh --model llama-3-70b --framework vllm agg
  ./run.sh --skip-model-cache --model llama-3-70b --framework vllm agg
  ./run.sh --model llama-3-70b --framework trtllm disagg-single-node
Example:
```bash
./run.sh --model llama-3-70b --framework vllm --deployment-type agg
```


## Dry run mode

To dry run the recipe, add the `--dry-run` flag.

```bash
./run.sh --dry-run --model llama-3-70b --framework vllm agg
```

## (Optional) Running the recipes with model cache
You may need to cache the model weights on a PVC to avoid repeated downloads of the model weights.
 See the [Prerequisites](#prerequisites) section for more details.

```bash
./run.sh --model llama-3-70b --framework vllm --deployment-type agg --skip-model-cache
```