1.`submit_disagg.sh` - Main entry point for submitting benchmark jobs for disaggregated configurations. This includes WideEP optimization for DEP>=16.
1.`submit_disagg.sh` - Main entry point for submitting benchmark jobs for disaggregated configurations. This includes WideEP optimization for DEP>=16.
2.`submit_agg.sh` - Main entry point for submitting benchmark jobs for aggregated configurations.
2.`submit_agg.sh` - Main entry point for submitting benchmark jobs for aggregated configurations.
3.`post_process.py` - Scan the genai-perf results to produce a json with entries to each config point.
3.`post_process.py` - Scan the aiperf results to produce a json with entries to each config point.
4.`plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
4.`plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
The above jobs use genAI-perf tool to benchmark each configuration point across different concurrency values. These get stored in `dynamo_disagg-bm-8150-1024/<config-setup>/genai_perf_artifacts` and `dynamo_agg-bm-8150-1024/<config-setup>/genai_perf_artifacts` for disaggregated and aggregated respectively.
The above jobs use aiperf tool to benchmark each configuration point across different concurrency values. These get stored in `dynamo_disagg-bm-8150-1024/<config-setup>/aiperf_artifacts` and `dynamo_agg-bm-8150-1024/<config-setup>/aiperf_artifacts` for disaggregated and aggregated respectively.
After your benchmarking jobs have completed, you can use the `post_process.py` script to aggregate and summarize the results from the generated genai_perf_artifacts.
After your benchmarking jobs have completed, you can use the `post_process.py` script to aggregate and summarize the results from the generated aiperf_artifacts.
To run the post-processing script, use:
To run the post-processing script, use:
...
@@ -149,6 +149,6 @@ Refer to [Beyond the Buzz: A Pragmatic Take on Inference Disaggregation](https:/
...
@@ -149,6 +149,6 @@ Refer to [Beyond the Buzz: A Pragmatic Take on Inference Disaggregation](https:/
## Known Issues
## Known Issues
- Some jobs may time out if genai-perf requires more time to complete all concurrency levels.
- Some jobs may time out if aiperf requires more time to complete all concurrency levels.
- Workers may encounter out-of-memory (OOM) errors during inference, especially with larger configurations.
- Workers may encounter out-of-memory (OOM) errors during inference, especially with larger configurations.
- Configurations affected by these issues will result in missing data points on the performance plot.
- Configurations affected by these issues will result in missing data points on the performance plot.
The Dynamo container includes [GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html), NVIDIA's tool for benchmarking generative AI models. This tool helps measure throughput, latency, and other performance metrics for your deployment.
The Dynamo container includes [AIPerf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/aiperf/README.html), NVIDIA's tool for benchmarking generative AI models. This tool helps measure throughput, latency, and other performance metrics for your deployment.
**Run the following benchmark from inside the container** (after completing the deployment steps above):
**Run the following benchmark from inside the container** (after completing the deployment steps above):
...
@@ -413,7 +413,7 @@ The Dynamo container includes [GenAI-Perf](https://docs.nvidia.com/deeplearning/
...
@@ -413,7 +413,7 @@ The Dynamo container includes [GenAI-Perf](https://docs.nvidia.com/deeplearning/
mkdir-p /tmp/benchmark-results
mkdir-p /tmp/benchmark-results
# Run the benchmark - this command tests the deployment with high-concurrency synthetic workload
# Run the benchmark - this command tests the deployment with high-concurrency synthetic workload
genai-perf profile \
aiperf profile \
--model openai/gpt-oss-120b \
--model openai/gpt-oss-120b \
--tokenizer /model \
--tokenizer /model \
--endpoint-type chat \
--endpoint-type chat \
...
@@ -434,9 +434,7 @@ genai-perf profile \
...
@@ -434,9 +434,7 @@ genai-perf profile \
--num-dataset-entries 8000 \
--num-dataset-entries 8000 \
--random-seed 100 \
--random-seed 100 \
--artifact-dir /tmp/benchmark-results \
--artifact-dir /tmp/benchmark-results \
--\
-v\
-v\
--max-threads 500 \
-H'Authorization: Bearer NOT USED'\
-H'Authorization: Bearer NOT USED'\
-H'Accept: text/event-stream'
-H'Accept: text/event-stream'
```
```
...
@@ -457,13 +455,13 @@ Key parameters you can adjust:
...
@@ -457,13 +455,13 @@ Key parameters you can adjust:
-`--output-tokens-mean`: Average output length (tests decode throughput)
-`--output-tokens-mean`: Average output length (tests decode throughput)
-`--request-count`: Total number of requests for the benchmark
-`--request-count`: Total number of requests for the benchmark
### Installing GenAI-Perf Outside the Container
### Installing AIPerf Outside the Container
If you prefer to run benchmarks from outside the container:
If you prefer to run benchmarks from outside the container:
```bash
```bash
# Install GenAI-Perf
# Install AIPerf
pip install genai-perf
pip install aiperf
# Then run the same benchmark command, adjusting the tokenizer path if needed
# Then run the same benchmark command, adjusting the tokenizer path if needed
```
```
...
@@ -520,4 +518,4 @@ flowchart TD
...
@@ -520,4 +518,4 @@ flowchart TD
-**Production Deployment**: For multi-node deployments, see the [Multi-node Guide](../../../examples/basics/multinode/README.md)
-**Production Deployment**: For multi-node deployments, see the [Multi-node Guide](../../../examples/basics/multinode/README.md)
-**Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
-**Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
-**Monitoring**: Set up Prometheus and Grafana for production monitoring
-**Monitoring**: Set up Prometheus and Grafana for production monitoring
-**Performance Benchmarking**: Use GenAI-Perf to measure and optimize your deployment performance
-**Performance Benchmarking**: Use AIPerf to measure and optimize your deployment performance