Unverified Commit a7badb85 authored by Harshini Komali's avatar Harshini Komali Committed by GitHub
Browse files

feat: Replace genai-perf with aiperf in components/backends (#3528)


Signed-off-by: default avatarlkomali <lkomali@nvidia.com>
parent 13fc3c65
...@@ -30,14 +30,14 @@ set -e ...@@ -30,14 +30,14 @@ set -e
warmup_model $head_node $head_port $SERVED_MODEL_NAME $MODEL_PATH "${chosen_isl}x${chosen_osl}x10000x10000x250" warmup_model $head_node $head_port $SERVED_MODEL_NAME $MODEL_PATH "${chosen_isl}x${chosen_osl}x10000x10000x250"
set +e set +e
genai_perf_warmup_workers=$(python3 -c "print(max(${DP:-0}, ${prefill_workers:-0}, ${decode_workers:-0}))") aiperf_warmup_workers=$(python3 -c "print(max(${DP:-0}, ${prefill_workers:-0}, ${decode_workers:-0}))")
IFS='x' read -r -a concurrency_list <<< "$chosen_concurrencies" IFS='x' read -r -a concurrency_list <<< "$chosen_concurrencies"
profile_folder="/logs/gap_isl_${chosen_isl}_osl_${chosen_osl}" profile_folder="/logs/gap_isl_${chosen_isl}_osl_${chosen_osl}"
mkdir -p $profile_folder mkdir -p $profile_folder
tmp_work_dir=$(mktemp -d -t genai-perf-XXXXXXXX) tmp_work_dir=$(mktemp -d -t aiperf-XXXXXXXX)
for concurrency in ${concurrency_list[@]}; do for concurrency in ${concurrency_list[@]}; do
export_folder="${tmp_work_dir}/concurrency_${concurrency}" export_folder="${tmp_work_dir}/concurrency_${concurrency}"
mkdir -p $export_folder mkdir -p $export_folder
...@@ -46,7 +46,7 @@ for concurrency in ${concurrency_list[@]}; do ...@@ -46,7 +46,7 @@ for concurrency in ${concurrency_list[@]}; do
echo "Run benchmark for concurrency $concurrency; ISL $chosen_isl; OSL $chosen_osl" echo "Run benchmark for concurrency $concurrency; ISL $chosen_isl; OSL $chosen_osl"
command=( command=(
genai-perf profile aiperf profile
-m ${SERVED_MODEL_NAME} -m ${SERVED_MODEL_NAME}
--tokenizer ${MODEL_PATH} --tokenizer ${MODEL_PATH}
--endpoint-type chat --endpoint-type chat
...@@ -55,7 +55,7 @@ for concurrency in ${concurrency_list[@]}; do ...@@ -55,7 +55,7 @@ for concurrency in ${concurrency_list[@]}; do
--streaming --streaming
--concurrency ${concurrency} --concurrency ${concurrency}
--warmup-request-count $(( 2*genai_perf_warmup_workers )) --warmup-request-count $(( 2*aiperf_warmup_workers ))
--request-count $(( 5*concurrency )) --request-count $(( 5*concurrency ))
--synthetic-input-tokens-mean ${chosen_isl} --synthetic-input-tokens-stddev 0 --synthetic-input-tokens-mean ${chosen_isl} --synthetic-input-tokens-stddev 0
...@@ -69,13 +69,11 @@ for concurrency in ${concurrency_list[@]}; do ...@@ -69,13 +69,11 @@ for concurrency in ${concurrency_list[@]}; do
--tokenizer-trust-remote-code --tokenizer-trust-remote-code
--num-dataset-entries 3000 --num-dataset-entries 3000
--
--max-threads ${concurrency}
) )
set -e set -e
${command[@]} ${command[@]}
set +e set +e
cp $export_folder/*/*_genai_perf.json $profile_folder cp $export_folder/*/*_aiperf.json $profile_folder
done done
...@@ -271,7 +271,7 @@ args: ...@@ -271,7 +271,7 @@ args:
## Benchmarking ## Benchmarking
To benchmark your deployment with GenAI-Perf, see this utility script: [perf.sh](../../../../benchmarks/llm/perf.sh) To benchmark your deployment with AIPerf, see this utility script: [perf.sh](../../../../benchmarks/llm/perf.sh)
Configure the `model` name and `host` based on your deployment. Configure the `model` name and `host` based on your deployment.
......
...@@ -38,7 +38,7 @@ Please note that: ...@@ -38,7 +38,7 @@ Please note that:
1. `submit_disagg.sh` - Main entry point for submitting benchmark jobs for disaggregated configurations. This includes WideEP optimization for DEP>=16. 1. `submit_disagg.sh` - Main entry point for submitting benchmark jobs for disaggregated configurations. This includes WideEP optimization for DEP>=16.
2. `submit_agg.sh` - Main entry point for submitting benchmark jobs for aggregated configurations. 2. `submit_agg.sh` - Main entry point for submitting benchmark jobs for aggregated configurations.
3. `post_process.py` - Scan the genai-perf results to produce a json with entries to each config point. 3. `post_process.py` - Scan the aiperf results to produce a json with entries to each config point.
4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization. 4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide. For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
...@@ -117,9 +117,9 @@ export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4" ...@@ -117,9 +117,9 @@ export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
## Post-Processing Results ## Post-Processing Results
The above jobs use genAI-perf tool to benchmark each configuration point across different concurrency values. These get stored in `dynamo_disagg-bm-8150-1024/<config-setup>/genai_perf_artifacts` and `dynamo_agg-bm-8150-1024/<config-setup>/genai_perf_artifacts` for disaggregated and aggregated respectively. The above jobs use aiperf tool to benchmark each configuration point across different concurrency values. These get stored in `dynamo_disagg-bm-8150-1024/<config-setup>/aiperf_artifacts` and `dynamo_agg-bm-8150-1024/<config-setup>/aiperf_artifacts` for disaggregated and aggregated respectively.
After your benchmarking jobs have completed, you can use the `post_process.py` script to aggregate and summarize the results from the generated genai_perf_artifacts. After your benchmarking jobs have completed, you can use the `post_process.py` script to aggregate and summarize the results from the generated aiperf_artifacts.
To run the post-processing script, use: To run the post-processing script, use:
...@@ -149,6 +149,6 @@ Refer to [Beyond the Buzz: A Pragmatic Take on Inference Disaggregation](https:/ ...@@ -149,6 +149,6 @@ Refer to [Beyond the Buzz: A Pragmatic Take on Inference Disaggregation](https:/
## Known Issues ## Known Issues
- Some jobs may time out if genai-perf requires more time to complete all concurrency levels. - Some jobs may time out if aiperf requires more time to complete all concurrency levels.
- Workers may encounter out-of-memory (OOM) errors during inference, especially with larger configurations. - Workers may encounter out-of-memory (OOM) errors during inference, especially with larger configurations.
- Configurations affected by these issues will result in missing data points on the performance plot. - Configurations affected by these issues will result in missing data points on the performance plot.
...@@ -40,7 +40,7 @@ if [ "${enable_attention_dp}" = "false" ]; then ...@@ -40,7 +40,7 @@ if [ "${enable_attention_dp}" = "false" ]; then
fi fi
full_logdir=${sub_dir} full_logdir=${sub_dir}
artifacts_dir=${full_logdir}/genai_perf_artifacts artifacts_dir=${full_logdir}/aiperf_artifacts
mkdir -p ${artifacts_dir} mkdir -p ${artifacts_dir}
......
...@@ -124,7 +124,7 @@ def extract_throughput_data(csv_path: str) -> Tuple[Optional[float], Optional[fl ...@@ -124,7 +124,7 @@ def extract_throughput_data(csv_path: str) -> Tuple[Optional[float], Optional[fl
Extract throughput data from CSV file Extract throughput data from CSV file
Args: Args:
csv_path: Path to profile_export_genai_perf.csv csv_path: Path to profile_export_aiperf.csv
Returns: Returns:
Tuple of (output_token_throughput, output_token_throughput_per_user) Tuple of (output_token_throughput, output_token_throughput_per_user)
...@@ -184,10 +184,10 @@ def process_directory(dir_path: str) -> Optional[List[Dict[str, Any]]]: ...@@ -184,10 +184,10 @@ def process_directory(dir_path: str) -> Optional[List[Dict[str, Any]]]:
Dictionary containing extracted data, or None if processing failed Dictionary containing extracted data, or None if processing failed
""" """
dir_path_obj = Path(dir_path) dir_path_obj = Path(dir_path)
artifacts_path = dir_path_obj / "genai_perf_artifacts" artifacts_path = dir_path_obj / "aiperf_artifacts"
if not artifacts_path.exists(): if not artifacts_path.exists():
print(f"Warning: No genai_perf_artifacts directory found in {dir_path}") print(f"Warning: No aiperf_artifacts directory found in {dir_path}")
return None return None
# Parse deployment configuration # Parse deployment configuration
...@@ -205,7 +205,7 @@ def process_directory(dir_path: str) -> Optional[List[Dict[str, Any]]]: ...@@ -205,7 +205,7 @@ def process_directory(dir_path: str) -> Optional[List[Dict[str, Any]]]:
csv_files = [] csv_files = []
for item in artifacts_path.iterdir(): for item in artifacts_path.iterdir():
if item.is_dir(): if item.is_dir():
csv_path = item / "profile_export_genai_perf.csv" csv_path = item / "profile_export_aiperf.csv"
if csv_path.exists(): if csv_path.exists():
csv_files.append(str(csv_path)) csv_files.append(str(csv_path))
......
...@@ -54,8 +54,8 @@ set -x ...@@ -54,8 +54,8 @@ set -x
config_file=${log_path}/config.yaml config_file=${log_path}/config.yaml
# install genai-perf # install aiperf
pip install genai-perf pip install aiperf
# Create artifacts root directory if it doesn't exist # Create artifacts root directory if it doesn't exist
if [ ! -d "${artifacts_dir}" ]; then if [ ! -d "${artifacts_dir}" ]; then
...@@ -153,7 +153,7 @@ for concurrency in ${concurrency_list}; do ...@@ -153,7 +153,7 @@ for concurrency in ${concurrency_list}; do
num_prompts=$((concurrency * multi_round)) num_prompts=$((concurrency * multi_round))
echo "Benchmarking with concurrency ${concurrency} ... ${num_prompts} prompts" echo "Benchmarking with concurrency ${concurrency} ... ${num_prompts} prompts"
mkdir -p ${log_path}/concurrency_${concurrency} mkdir -p ${log_path}/concurrency_${concurrency}
genai-perf profile \ aiperf profile \
--model ${model} \ --model ${model} \
--tokenizer ${model_path} \ --tokenizer ${model_path} \
--endpoint-type chat \ --endpoint-type chat \
...@@ -174,9 +174,7 @@ for concurrency in ${concurrency_list}; do ...@@ -174,9 +174,7 @@ for concurrency in ${concurrency_list}; do
--num-dataset-entries ${num_prompts} \ --num-dataset-entries ${num_prompts} \
--random-seed 100 \ --random-seed 100 \
--artifact-dir ${artifacts_dir} \ --artifact-dir ${artifacts_dir} \
-- \
-v \ -v \
--max-threads ${concurrency} \
-H 'Authorization: Bearer NOT USED' \ -H 'Authorization: Bearer NOT USED' \
-H 'Accept: text/event-stream' -H 'Accept: text/event-stream'
echo "Benchmark with concurrency ${concurrency} done" echo "Benchmark with concurrency ${concurrency} done"
......
...@@ -196,7 +196,7 @@ NOTE: To send a request to a multi-node deployment, target the node which is run ...@@ -196,7 +196,7 @@ NOTE: To send a request to a multi-node deployment, target the node which is run
### Benchmarking ### Benchmarking
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh) `model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
...@@ -236,7 +236,7 @@ NOTE: To send a request to a multi-node deployment, target the node which is run ...@@ -236,7 +236,7 @@ NOTE: To send a request to a multi-node deployment, target the node which is run
## Benchmarking ## Benchmarking
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh) `model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
## Multimodal support ## Multimodal support
......
...@@ -402,9 +402,9 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" ...@@ -402,9 +402,9 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
``` ```
## Benchmarking ## Benchmarking
### Performance Testing with GenAI-Perf ### Performance Testing with AIPerf
The Dynamo container includes [GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html), NVIDIA's tool for benchmarking generative AI models. This tool helps measure throughput, latency, and other performance metrics for your deployment. The Dynamo container includes [AIPerf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/aiperf/README.html), NVIDIA's tool for benchmarking generative AI models. This tool helps measure throughput, latency, and other performance metrics for your deployment.
**Run the following benchmark from inside the container** (after completing the deployment steps above): **Run the following benchmark from inside the container** (after completing the deployment steps above):
...@@ -413,7 +413,7 @@ The Dynamo container includes [GenAI-Perf](https://docs.nvidia.com/deeplearning/ ...@@ -413,7 +413,7 @@ The Dynamo container includes [GenAI-Perf](https://docs.nvidia.com/deeplearning/
mkdir -p /tmp/benchmark-results mkdir -p /tmp/benchmark-results
# Run the benchmark - this command tests the deployment with high-concurrency synthetic workload # Run the benchmark - this command tests the deployment with high-concurrency synthetic workload
genai-perf profile \ aiperf profile \
--model openai/gpt-oss-120b \ --model openai/gpt-oss-120b \
--tokenizer /model \ --tokenizer /model \
--endpoint-type chat \ --endpoint-type chat \
...@@ -434,9 +434,7 @@ genai-perf profile \ ...@@ -434,9 +434,7 @@ genai-perf profile \
--num-dataset-entries 8000 \ --num-dataset-entries 8000 \
--random-seed 100 \ --random-seed 100 \
--artifact-dir /tmp/benchmark-results \ --artifact-dir /tmp/benchmark-results \
-- \
-v \ -v \
--max-threads 500 \
-H 'Authorization: Bearer NOT USED' \ -H 'Authorization: Bearer NOT USED' \
-H 'Accept: text/event-stream' -H 'Accept: text/event-stream'
``` ```
...@@ -457,13 +455,13 @@ Key parameters you can adjust: ...@@ -457,13 +455,13 @@ Key parameters you can adjust:
- `--output-tokens-mean`: Average output length (tests decode throughput) - `--output-tokens-mean`: Average output length (tests decode throughput)
- `--request-count`: Total number of requests for the benchmark - `--request-count`: Total number of requests for the benchmark
### Installing GenAI-Perf Outside the Container ### Installing AIPerf Outside the Container
If you prefer to run benchmarks from outside the container: If you prefer to run benchmarks from outside the container:
```bash ```bash
# Install GenAI-Perf # Install AIPerf
pip install genai-perf pip install aiperf
# Then run the same benchmark command, adjusting the tokenizer path if needed # Then run the same benchmark command, adjusting the tokenizer path if needed
``` ```
...@@ -520,4 +518,4 @@ flowchart TD ...@@ -520,4 +518,4 @@ flowchart TD
- **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](../../../examples/basics/multinode/README.md) - **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](../../../examples/basics/multinode/README.md)
- **Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization - **Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
- **Monitoring**: Set up Prometheus and Grafana for production monitoring - **Monitoring**: Set up Prometheus and Grafana for production monitoring
- **Performance Benchmarking**: Use GenAI-Perf to measure and optimize your deployment performance - **Performance Benchmarking**: Use AIPerf to measure and optimize your deployment performance
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment