README.md 3.97 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# DeepSeek-R1 Recipes

Production-ready deployments for **DeepSeek-R1** (671B MoE) across multiple backends and hardware configurations.

## Available Configurations

| Configuration | GPUs | Backend | Mode | Description |
|--------------|------|---------|------|-------------|
| [**sglang/disagg-8gpu**](sglang/disagg-8gpu/) | 16x H200 | SGLang | Disaggregated WideEP | TP=8 per worker, single-node |
| [**sglang/disagg-16gpu**](sglang/disagg-16gpu/) | 32x H200 | SGLang | Disaggregated WideEP | TP=16 per worker, multi-node |
| [**trtllm/disagg/wide_ep/gb200**](trtllm/disagg/wide_ep/gb200/) | 36x GB200 | TensorRT-LLM | Disaggregated WideEP | 8 decode + 1 prefill nodes |
| [**vllm/disagg**](vllm/disagg/) | 32x H200 | vLLM | Disaggregated DEP16 | Multi-node, data-expert parallel |

## Prerequisites

16
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
2. **GPU cluster** with H200 or GB200 GPUs matching the configuration requirements
3. **HuggingFace token** with access to DeepSeek models
4. **High-bandwidth networking** — InfiniBand or RoCE recommended for multi-node deployments

## Quick Start

```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}

# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token-here" \
  -n ${NAMESPACE}

# Download model (update storageClassName in model-cache.yaml first!)
# For SGLang deployments:
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download-sglang.yaml -n ${NAMESPACE}

# For vLLM/TRT-LLM deployments:
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}

# Wait for download (this is a large model - may take 1+ hours)
# For SGLang: kubectl wait --for=condition=Complete job/model-download-sglang ...
# For vLLM/TRT-LLM: kubectl wait --for=condition=Complete job/model-download ...
kubectl wait --for=condition=Complete job/model-download-sglang -n ${NAMESPACE} --timeout=7200s

# Deploy (choose one configuration)
kubectl apply -f sglang/disagg-8gpu/deploy.yaml -n ${NAMESPACE}
```

## Test the Deployment

```bash
# Port-forward the frontend (service name varies by deployment)
kubectl port-forward svc/sgl-dsr1-8gpu-frontend 8000:8000 -n ${NAMESPACE}

# Send a test request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'
```

## Model Details

- **Model**: `deepseek-ai/DeepSeek-R1`
- **Architecture**: 671B parameter Mixture-of-Experts (MoE)
- **Active parameters**: ~37B per token
- **Recommended**: FP8 quantization for production deployments

## Hardware Requirements

DeepSeek-R1 is a very large model requiring significant GPU memory:

| Configuration | Min GPU Memory | Recommended |
|--------------|----------------|-------------|
| 16x H200 (SGLang TP=8) | 1.1TB total | H200 SXM (141GB each) |
| 32x H200 (SGLang TP=16, vLLM) | 2.2TB total | H200 SXM (141GB each) |
| 36x GB200 (TRT-LLM) | ~2.5TB total | GB200 NVL72 |

## Notes

- **Model download time**: DeepSeek-R1 is ~1.3TB; expect 1-2 hours for download
- **NCCL errors**: Usually indicate OOM. Reduce `--mem-fraction-static` in worker args
- **Multi-node**: Requires InfiniBand/IBGDA enabled. See [vLLM EP docs](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/)
- **Storage class**: Update `storageClassName` in `model-cache/model-cache.yaml` before deploying

## Backend-Specific Notes

### SGLang
- Uses WideEP (Wide Expert Parallel) for efficient MoE inference
- See [sglang/README.md](sglang/README.md) for SGLang-specific configuration

### TensorRT-LLM
- Requires FP4 quantized checkpoint
- GB200-specific optimizations

### vLLM
- Uses DEP (Data-Expert Parallel) with hybrid load balancing
- See [vllm/disagg/README.md](vllm/disagg/README.md) for detailed setup