"vscode:/vscode.git/clone" did not exist on "4d0380d54a71e336a35ebd1f067d11fdb97bfc10"
Unverified Commit 26dc6281 authored by Hongkuan Zhou's avatar Hongkuan Zhou Committed by GitHub
Browse files

chore: sglang k8s health/live, update doc (#2272)

parent 6fed066b
...@@ -88,14 +88,14 @@ docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.3.2 ...@@ -88,14 +88,14 @@ docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.3.2
### Aggregated Serving ### Aggregated Serving
```bash ```bash
cd $DYNAMO_ROOT/components/backends/sglang cd $DYNAMO_HOME/components/backends/sglang
./launch/agg.sh ./launch/agg.sh
``` ```
### Aggregated Serving with KV Routing ### Aggregated Serving with KV Routing
```bash ```bash
cd $DYNAMO_ROOT/components/backends/sglang cd $DYNAMO_HOME/components/backends/sglang
./launch/agg_router.sh ./launch/agg_router.sh
``` ```
...@@ -119,7 +119,7 @@ Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead ...@@ -119,7 +119,7 @@ Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead
> Disaggregated serving in SGLang currently requires each worker to have the same tensor parallel size [unless you are using an MLA based model](https://github.com/sgl-project/sglang/pull/5922) > Disaggregated serving in SGLang currently requires each worker to have the same tensor parallel size [unless you are using an MLA based model](https://github.com/sgl-project/sglang/pull/5922)
```bash ```bash
cd $DYNAMO_ROOT/components/backends/sglang cd $DYNAMO_HOME/components/backends/sglang
./launch/disagg.sh ./launch/disagg.sh
``` ```
...@@ -129,12 +129,32 @@ You can use this configuration to test out disaggregated serving with dp attenti ...@@ -129,12 +129,32 @@ You can use this configuration to test out disaggregated serving with dp attenti
```bash ```bash
# note this will require 4 GPUs # note this will require 4 GPUs
cd $DYNAMO_ROOT/components/backends/sglang cd $DYNAMO_HOME/components/backends/sglang
./launch/disagg_dp_attn.sh ./launch/disagg_dp_attn.sh
``` ```
When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](docs/expert-distribution-eplb.md). When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](docs/expert-distribution-eplb.md).
### Testing the Deployment
Send a test request to verify your deployment:
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [
{
"role": "user",
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
}
],
"stream": false,
"max_tokens": 30
}'
```
## Request Migration ## Request Migration
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker: You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
......
...@@ -21,7 +21,7 @@ spec: ...@@ -21,7 +21,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
- "exit 0" - 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
initialDelaySeconds: 60 initialDelaySeconds: 60
periodSeconds: 60 periodSeconds: 60
timeoutSeconds: 30 timeoutSeconds: 30
...@@ -31,11 +31,11 @@ spec: ...@@ -31,11 +31,11 @@ spec:
replicas: 1 replicas: 1
resources: resources:
requests: requests:
cpu: "5" cpu: "10"
memory: "10Gi" memory: "10Gi"
limits: limits:
cpu: "5" cpu: "32"
memory: "10Gi" memory: "40Gi"
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: my-registry/sglang-runtime:my-tag image: my-registry/sglang-runtime:my-tag
...@@ -46,24 +46,20 @@ spec: ...@@ -46,24 +46,20 @@ spec:
SGLangDecodeWorker: SGLangDecodeWorker:
envFromSecret: hf-token-secret envFromSecret: hf-token-secret
livenessProbe: livenessProbe:
exec: httpGet:
command: path: /live
- /bin/sh port: 9090
- -c periodSeconds: 5
- "exit 0"
periodSeconds: 60
timeoutSeconds: 30 timeoutSeconds: 30
failureThreshold: 10 failureThreshold: 1
readinessProbe: readinessProbe:
exec: exec:
command: httpGet:
- /bin/sh path: /health
- -c port: 9090
- "exit 0" periodSeconds: 10
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30 timeoutSeconds: 30
failureThreshold: 10 failureThreshold: 60
dynamoNamespace: sglang-agg dynamoNamespace: sglang-agg
componentType: worker componentType: worker
replicas: 1 replicas: 1
...@@ -73,11 +69,24 @@ spec: ...@@ -73,11 +69,24 @@ spec:
memory: "20Gi" memory: "20Gi"
gpu: "1" gpu: "1"
limits: limits:
cpu: "10" cpu: "32"
memory: "20Gi" memory: "80Gi"
gpu: "1" gpu: "1"
envs:
- name: DYN_SYSTEM_ENABLED
value: "true"
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
value: "[\"generate\"]"
- name: DYN_SYSTEM_PORT
value: "9090"
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
startupProbe:
httpGet:
path: /live
port: 9090
periodSeconds: 10
failureThreshold: 60
image: my-registry/sglang-runtime:my-tag image: my-registry/sglang-runtime:my-tag
workingDir: /workspace/components/backends/sglang workingDir: /workspace/components/backends/sglang
args: args:
......
...@@ -21,7 +21,7 @@ spec: ...@@ -21,7 +21,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
- "exit 0" - 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
initialDelaySeconds: 60 initialDelaySeconds: 60
periodSeconds: 60 periodSeconds: 60
timeoutSeconds: 30 timeoutSeconds: 30
...@@ -31,11 +31,11 @@ spec: ...@@ -31,11 +31,11 @@ spec:
replicas: 1 replicas: 1
resources: resources:
requests: requests:
cpu: "5" cpu: "10"
memory: "10Gi" memory: "10Gi"
limits: limits:
cpu: "5" cpu: "32"
memory: "10Gi" memory: "40Gi"
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: my-registry/sglang-runtime:my-tag image: my-registry/sglang-runtime:my-tag
...@@ -46,24 +46,20 @@ spec: ...@@ -46,24 +46,20 @@ spec:
SGLangDecodeWorker: SGLangDecodeWorker:
envFromSecret: hf-token-secret envFromSecret: hf-token-secret
livenessProbe: livenessProbe:
exec: httpGet:
command: path: /live
- /bin/sh port: 9090
- -c periodSeconds: 5
- "exit 0"
periodSeconds: 60
timeoutSeconds: 30 timeoutSeconds: 30
failureThreshold: 10 failureThreshold: 1
readinessProbe: readinessProbe:
exec: exec:
command: httpGet:
- /bin/sh path: /health
- -c port: 9090
- "exit 0" periodSeconds: 10
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30 timeoutSeconds: 30
failureThreshold: 10 failureThreshold: 60
dynamoNamespace: sglang-agg-router dynamoNamespace: sglang-agg-router
componentType: worker componentType: worker
replicas: 1 replicas: 1
...@@ -73,11 +69,24 @@ spec: ...@@ -73,11 +69,24 @@ spec:
memory: "20Gi" memory: "20Gi"
gpu: "1" gpu: "1"
limits: limits:
cpu: "10" cpu: "32"
memory: "20Gi" memory: "80Gi"
gpu: "1" gpu: "1"
envs:
- name: DYN_SYSTEM_ENABLED
value: "true"
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
value: "[\"generate\"]"
- name: DYN_SYSTEM_PORT
value: "9090"
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
startupProbe:
httpGet:
path: /live
port: 9090
periodSeconds: 10
failureThreshold: 60
image: my-registry/sglang-runtime:my-tag image: my-registry/sglang-runtime:my-tag
workingDir: /workspace/components/backends/sglang workingDir: /workspace/components/backends/sglang
args: args:
......
...@@ -21,7 +21,7 @@ spec: ...@@ -21,7 +21,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
- "exit 0" - 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
initialDelaySeconds: 60 initialDelaySeconds: 60
periodSeconds: 60 periodSeconds: 60
timeoutSeconds: 30 timeoutSeconds: 30
...@@ -31,14 +31,14 @@ spec: ...@@ -31,14 +31,14 @@ spec:
replicas: 1 replicas: 1
resources: resources:
requests: requests:
cpu: "5" cpu: "10"
memory: "10Gi" memory: "10Gi"
limits: limits:
cpu: "5" cpu: "32"
memory: "10Gi" memory: "40Gi"
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: my-registry/sglang-runtime:my-tag image: nvcr.io/nvidian/nim-llm-dev/sglang-runtime:hzhou-0804
workingDir: /workspace/components/backends/sglang workingDir: /workspace/components/backends/sglang
command: ["sh", "-c"] command: ["sh", "-c"]
args: args:
...@@ -46,24 +46,20 @@ spec: ...@@ -46,24 +46,20 @@ spec:
SGLangDecodeWorker: SGLangDecodeWorker:
envFromSecret: hf-token-secret envFromSecret: hf-token-secret
livenessProbe: livenessProbe:
exec: httpGet:
command: path: /live
- /bin/sh port: 9090
- -c periodSeconds: 5
- "exit 0"
periodSeconds: 60
timeoutSeconds: 30 timeoutSeconds: 30
failureThreshold: 10 failureThreshold: 1
readinessProbe: readinessProbe:
exec: exec:
command: httpGet:
- /bin/sh path: /health
- -c port: 9090
- "exit 0" periodSeconds: 10
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30 timeoutSeconds: 30
failureThreshold: 10 failureThreshold: 60
dynamoNamespace: sglang-disagg dynamoNamespace: sglang-disagg
componentType: worker componentType: worker
replicas: 1 replicas: 1
...@@ -73,12 +69,25 @@ spec: ...@@ -73,12 +69,25 @@ spec:
memory: "20Gi" memory: "20Gi"
gpu: "1" gpu: "1"
limits: limits:
cpu: "10" cpu: "32"
memory: "20Gi" memory: "80Gi"
gpu: "1" gpu: "1"
envs:
- name: DYN_SYSTEM_ENABLED
value: "true"
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
value: "[\"generate\"]"
- name: DYN_SYSTEM_PORT
value: "9090"
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: my-registry/sglang-runtime:my-tag startupProbe:
httpGet:
path: /live
port: 9090
periodSeconds: 10
failureThreshold: 60
image: nvcr.io/nvidian/nim-llm-dev/sglang-runtime:hzhou-0804
workingDir: /workspace/components/backends/sglang workingDir: /workspace/components/backends/sglang
args: args:
- "python3" - "python3"
...@@ -101,24 +110,20 @@ spec: ...@@ -101,24 +110,20 @@ spec:
SGLangPrefillWorker: SGLangPrefillWorker:
envFromSecret: hf-token-secret envFromSecret: hf-token-secret
livenessProbe: livenessProbe:
exec: httpGet:
command: path: /live
- /bin/sh port: 9090
- -c periodSeconds: 5
- "exit 0"
periodSeconds: 60
timeoutSeconds: 30 timeoutSeconds: 30
failureThreshold: 10 failureThreshold: 1
readinessProbe: readinessProbe:
exec: exec:
command: httpGet:
- /bin/sh path: /health
- -c port: 9090
- "exit 0" periodSeconds: 10
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30 timeoutSeconds: 30
failureThreshold: 10 failureThreshold: 60
dynamoNamespace: sglang-disagg dynamoNamespace: sglang-disagg
componentType: worker componentType: worker
replicas: 1 replicas: 1
...@@ -128,12 +133,25 @@ spec: ...@@ -128,12 +133,25 @@ spec:
memory: "20Gi" memory: "20Gi"
gpu: "1" gpu: "1"
limits: limits:
cpu: "10" cpu: "32"
memory: "20Gi" memory: "80Gi"
gpu: "1" gpu: "1"
envs:
- name: DYN_SYSTEM_ENABLED
value: "true"
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
value: "[\"generate\"]"
- name: DYN_SYSTEM_PORT
value: "9090"
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: my-registry/sglang-runtime:my-tag startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
failureThreshold: 60
image: nvcr.io/nvidian/nim-llm-dev/sglang-runtime:hzhou-0804
workingDir: /workspace/components/backends/sglang workingDir: /workspace/components/backends/sglang
args: args:
- "python3" - "python3"
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment