description="Profile the TTFT and ITL of the Prefill and Decode engine with different parallelization mapping. When profiling prefill we mock/fix decode,when profiling decode we mock/fix prefill."
)
parser.add_argument(
"--namespace",
type=str,
default="dynamo-sla-profiler",
help="Kubernetes namespace to deploy the DynamoGraphDeployment",
)
parser.add_argument(
"--backend",
type=str,
default="vllm",
choices=["vllm","sglang","trtllm"],
help="backend type, currently support [vllm, sglang, trtllm]",
)
parser.add_argument(
"--config",
type=str,
required=True,
help="Path to the DynamoGraphDeployment config file",
)
parser.add_argument(
"--output-dir",
type=str,
default="profiling_results",
help="Path to the output results directory",
)
parser.add_argument(
"--min-num-gpus-per-engine",
type=int,
default=1,
help="minimum number of GPUs per engine",
)
parser.add_argument(
"--max-num-gpus-per-engine",
type=int,
default=8,
help="maximum number of GPUs per engine",
)
parser.add_argument(
"--skip-existing-results",
action="store_true",
help="Skip TP sizes that already have results in the output directory",
)
parser.add_argument(
"--force-rerun",
action="store_true",
help="Force re-running all tests even if results already exist (overrides --skip-existing-results)",
backend: String (backend type, currently support [vllm, sglang, trtllm], default: vllm)
config: String (path to the DynamoGraphDeployment config file)
max_context_length: Int (maximum context length supported by the served model, default: 16384)
is_moe_model: Boolean (enable MoE (Mixture of Experts) model support, use TEP for prefill and DEP for decode, default: False)
hardware:
min_num_gpus_per_engine: Int (minimum number of GPUs per engine, default: 1)
max_num_gpus_per_engine: Int (maximum number of GPUs per engine, default: 8)
num_gpus_per_node: Int (number of GPUs per node for MoE models - this will be the granularity when searching for the best TEP/DEP size, default: 8)
sweep:
skip_existing_results: Boolean (skip TP sizes that already have results in the output directory, default: False)
force_rerun: Boolean (force re-running all tests even if results already exist (overrides --skip-existing-results), default: False)
prefill_interpolation_granularity: Int (how many samples to benchmark to interpolate TTFT under different ISL, default: 16)
decode_interpolation_granularity: Int (how many samples to benchmark to interpolate ITL under different active kv cache size and decode context length, default: 6)
use_ai_configurator: Boolean (use ai-configurator to estimate benchmarking results instead of running actual deployment, default: False)
aic_system: String (target system for use with aiconfigurator, default: None)
aic_model_name: String (aiconfigurator name of the target model, default: None)
aic_backend: String (aiconfigurator backend of the target model, if not provided, will use args.backend, default: "")
aic_backend_version: String (specify backend version when using aiconfigurator to estimate perf, default: None)
dry_run: Boolean (dry run the profile job, default: False)
sla:
isl: Int (target input sequence length, default: 3000)
osl: Int (target output sequence length, default: 500)
ttft: Int (target Time To First Token in ms, default: 50)
itl: Int (target Inter Token Latency in ms, default: 10)
planner: (planner-bypass arguments, use hyphens or underscores)
i.e., planner-min-endpoint: 2 # or planner_min_endpoint: 2 (both work)
"""
# Step 1: Pre-parse to check if --profile-config is provided
# Step 3: Create main parser with config-aware defaults
parser=argparse.ArgumentParser(
description="Profile the TTFT and ITL of the Prefill and Decode engine with different parallelization mapping. When profiling prefill we mock/fix decode,when profiling decode we mock/fix prefill."
)
parser.add_argument(
"--profile-config",
type=str,
help="Configuration as Python dict literal, YAML, or JSON string. CLI args override config values. "
**For MoE models**, use `profile_sla_moe_job.yaml` with TEP/DEP configuration instead.
**For MoE models**, use `profile_sla_moe_job.yaml` with TEP/DEP configuration instead.
If you want to automatically deploy the optimized DGD with planner after profiling, add `--deploy-after-profile` to the profiling job. It will deploy the DGD with the engine of the optimized parallelization mapping found for the SLA targets.
### Advanced Configuration
### Advanced Configuration
-**Model caching**: For large models, create a multi-attach PVC to cache the model. See [recipes](../../recipes/README.md) for details.
-**Model caching**: For large models, create a multi-attach PVC to cache the model. See [recipes](../../recipes/README.md) for details.
For MoE models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml` instead.
For MoE models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml` instead.
To automatically deploy the optimized DGD with planner after profiling, add `--deploy-after-profile` to the profiling job. It will deploy the DGD with the engine of the optimized parallelization mapping found for the SLA targets.
> **Time Investment**: This profiling process is comprehensive and typically takes **2-4 hours** to complete. The script systematically tests multiple tensor parallelism configurations and load conditions to find optimal performance settings.
> **Time Investment**: This profiling process is comprehensive and typically takes **2-4 hours** to complete. The script systematically tests multiple tensor parallelism configurations and load conditions to find optimal performance settings.