Prometheus:# NOTE: this is set on Prometheus to ensure a service is created for the Prometheus component. This is a workaround and should be managed differently.
Prometheus:# NOTE: this is set on Prometheus to ensure a service is created for the Prometheus component. This is a workaround and should be managed differently.
Prometheus:# NOTE: this is set on Prometheus to ensure a service is created for the Prometheus component. This is a workaround and should be managed differently.
Prometheus:# NOTE: this is set on Prometheus to ensure a service is created for the Prometheus component. This is a workaround and should be managed differently.
**User-friendly error messages**: If you forget the `/data/` prefix, the script will show a helpful error message with the correct path and example commands.
@@ -33,7 +33,7 @@ The framework is a wrapper around `genai-perf` that:
...
@@ -33,7 +33,7 @@ The framework is a wrapper around `genai-perf` that:
**Default sequence lengths**: Input: 2000 tokens, Output: 256 tokens (configurable with `--isl` and `--osl`)
**Default sequence lengths**: Input: 2000 tokens, Output: 256 tokens (configurable with `--isl` and `--osl`)
**Important**: The `--model` parameter configures GenAI-Perf for benchmarking and provides logging context. The actual model loaded is determined by your deployment manifests. Only one model can be benchmarked at a time across all inputs to ensure fair comparison. The default `--model` value in the benchmarking script is `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`, but it must match the model in the manifest(s) and the model deployed at the endpoint(s).
**Important**: The `--model` parameter configures GenAI-Perf for benchmarking and provides logging context. The actual model loaded is determined by your deployment manifests. Only one model can be benchmarked at a time across all inputs to ensure fair comparison. The default `--model` value in the benchmarking script is `Qwen/Qwen3-0.6B`, but it must match the model in the manifest(s) and the model deployed at the endpoint(s).
## Prerequisites
## Prerequisites
...
@@ -103,7 +103,7 @@ REQUIRED:
...
@@ -103,7 +103,7 @@ REQUIRED:
OPTIONS:
OPTIONS:
-h, --help Show help message and examples
-h, --help Show help message and examples
-m, --model MODEL Model name for GenAI-Perf configuration and logging (default: deepseek-ai/DeepSeek-R1-Distill-Llama-8B)
-m, --model MODEL Model name for GenAI-Perf configuration and logging (default: Qwen/Qwen3-0.6B)
NOTE: This must match the model configured in your deployment manifests and endpoints
NOTE: This must match the model configured in your deployment manifests and endpoints
-s, --std STDDEV Input sequence standard deviation (default: 10)
-s, --std STDDEV Input sequence standard deviation (default: 10)
...
@@ -130,6 +130,23 @@ The script automatically:
...
@@ -130,6 +130,23 @@ The script automatically:
4.**Generates** comparison plots using your custom labels in `./benchmarks/results/plots/`
4.**Generates** comparison plots using your custom labels in `./benchmarks/results/plots/`
5.**Cleans up** deployments when complete
5.**Cleans up** deployments when complete
### GPU Resource Usage
**Important**: Models are deployed and benchmarked **sequentially**, not in parallel. This means:
-**One deployment at a time**: Each DynamoGraphDeployment is deployed, benchmarked, and cleaned up before the next one starts
-**Full GPU access**: Each deployment gets exclusive access to all available GPUs during its benchmark run
-**Resource isolation**: No resource conflicts between different deployment configurations
-**Fair comparison**: Each configuration is tested under identical resource conditions
This sequential approach ensures:
-**Accurate performance measurements** without interference between deployments
-**Consistent resource allocation** for fair comparison across different configurations
-**Simplified resource management** without complex GPU scheduling
-**Reliable cleanup** between benchmark runs
If you need to benchmark multiple configurations simultaneously, consider using separate Kubernetes namespaces or running benchmarks on different clusters.
### Results Clearing Behavior
### Results Clearing Behavior
**Important**: The benchmark script automatically clears the output directory before each run to ensure clean, reproducible results. This means:
**Important**: The benchmark script automatically clears the output directory before each run to ensure clean, reproducible results. This means:
...
@@ -155,7 +172,7 @@ For direct control over the benchmark workflow:
...
@@ -155,7 +172,7 @@ For direct control over the benchmark workflow:
@@ -24,6 +24,21 @@ We assume there is no piggy-backed prefill requests in the decode engine. Even i
...
@@ -24,6 +24,21 @@ We assume there is no piggy-backed prefill requests in the decode engine. Even i
The script will first detect the number of available GPUs on the current nodes (multi-node engine not supported yet). Then, it will profile the prefill and decode performance with different TP sizes. For prefill, since there is no in-flight batching (assume isl is long enough to saturate the GPU), the script directly measures the TTFT for a request with given isl without kv-reusing. For decode, since the ITL (or iteration time) is relevant with how many requests are in-flight, the script will measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL.
The script will first detect the number of available GPUs on the current nodes (multi-node engine not supported yet). Then, it will profile the prefill and decode performance with different TP sizes. For prefill, since there is no in-flight batching (assume isl is long enough to saturate the GPU), the script directly measures the TTFT for a request with given isl without kv-reusing. For decode, since the ITL (or iteration time) is relevant with how many requests are in-flight, the script will measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL.
### GPU Resource Usage
**Important**: Profiling tests different tensor parallelism (TP) configurations **sequentially**, not in parallel. This means:
-**One TP configuration at a time**: Each tensor parallelism size (TP1, TP2, TP4, TP8, etc.) is tested individually
-**Full GPU access**: Each TP configuration gets exclusive access to all available GPUs during its profiling run
-**Resource isolation**: No interference between different TP configurations during testing
-**Accurate measurements**: Each configuration is profiled under identical resource conditions
This sequential approach ensures:
-**Precise performance profiling** without resource conflicts
-**Consistent GPU allocation** for fair comparison across TP sizes
-**Reliable cleanup** between different TP configuration tests
-**Accurate SLA compliance verification** for each configuration
After the profiling finishes, two plots will be generated in the `output-dir`. For example, here are the profiling results for `examples/llm/configs/disagg.yaml`:
After the profiling finishes, two plots will be generated in the `output-dir`. For example, here are the profiling results for `examples/llm/configs/disagg.yaml`:
> **Note**: All paths must start with `/data/` for security reasons. If you forget this prefix, the script will show a helpful error message with the correct path.
3.**Set the config path for the profiling job:**
3.**Set the config path for the profiling job:**
```bash
```bash
export DGD_CONFIG_FILE=/workspace/profiling_results/disagg.yaml # or your custom path
export DGD_CONFIG_FILE=/workspace/profiling_results/disagg.yaml # or your custom path