Unverified Commit 6e568d45 authored by kYLe's avatar kYLe Committed by GitHub
Browse files

docs: Modify AI configurator command in README (#5820)


Signed-off-by: default avatarkYLe <kylhuang@nvidia.com>
Signed-off-by: default avatarKyle Huang <kylhuang@nvidia.com>
Co-authored-by: default avatarDmitry Tokarev <dtokarev@nvidia.com>
parent df4cd191
...@@ -37,7 +37,7 @@ kubectl apply -f agg_router.yaml --namespace ${NAMESPACE} ...@@ -37,7 +37,7 @@ kubectl apply -f agg_router.yaml --namespace ${NAMESPACE}
4. Testing the deployment and run benchmarks 4. Testing the deployment and run benchmarks
After deployment, forward the frontend service to access the API: After deployment, forward the frontend service to access the API:
```sh ```sh
kubectl port-forward deployment/vllm-agg-router-frontend 8000:8000 -n ${NAMESPACE} kubectl port-forward svc/vllm-agg-router-frontend 8000:8000 -n ${NAMESPACE}
``` ```
and use following request to test the deployed model and use following request to test the deployed model
```sh ```sh
...@@ -65,9 +65,9 @@ pip3 install aiconfigurator ...@@ -65,9 +65,9 @@ pip3 install aiconfigurator
``` ```
2. Assume we have 2 GPU nodes with 16 H200 in total, and we want to deploy Llama 3.1-70B-Instruct model with an optimal disaggregated serving configuration. Run AI configurator for this model 2. Assume we have 2 GPU nodes with 16 H200 in total, and we want to deploy Llama 3.1-70B-Instruct model with an optimal disaggregated serving configuration. Run AI configurator for this model
```sh ```sh
aiconfigurator cli --model LLAMA3.1_70B --total_gpus 16 --system h200_sxm aiconfigurator cli default --model LLAMA3.1_70B --total_gpus 16 --system h200_sxm
``` ```
and from the output, you can see the Pareto curve with suggest P/D settings and from the output, you can see the Pareto curve with the suggested P/D settings
![text](images/pareto.png) ![text](images/pareto.png)
3. Start the serving with 1 prefill worker with tensor parallelism 4 and 1 decoding worker with tensor parallelism 8 as AI Configurator suggested. Update the `my-tag` in `disagg_router.yaml` with the latest Dynamo version and your local cache folder path and run following command. 3. Start the serving with 1 prefill worker with tensor parallelism 4 and 1 decoding worker with tensor parallelism 8 as AI Configurator suggested. Update the `my-tag` in `disagg_router.yaml` with the latest Dynamo version and your local cache folder path and run following command.
![text](images/settings.png) ![text](images/settings.png)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment