feat: Add performance sweeps for DeepSeek R1 on GB200 (#2387)

1327e3bb · Tanmay Verma · GitHub · 7e4eec26 · 1327e3bb · 1327e3bb
Unverified Commit 1327e3bb authored Aug 12, 2025 by Tanmay Verma Committed by GitHub Aug 12, 2025
14 changed files
--- a/components/backends/trtllm/README.md
+++ b/components/backends/trtllm/README.md
@@ -43,6 +43,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 - [Client](#client)
 - [Benchmarking](#benchmarking)
 - [Multimodal Support](#multimodal-support)
+- [Performance Sweep](#performance-sweep)
 ## Feature Support Matrix
@@ -420,3 +421,7 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
 ### Supported Multimodal Models
 Multimodel models listed [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/inputs/utils.py#L221) are supported by dynamo.
+## Performance Sweep
+For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](./performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
--- a/components/backends/trtllm/performance_sweeps/README.md
+++ b/components/backends/trtllm/performance_sweeps/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TensorRT-LLM Benchmark Scripts for DeepSeek R1 model
+This directory contains scripts for benchmarking TensorRT-LLM performance with Dynamo using SLURM job scheduler.
+## ⚠️ DISCLAIMER
+**These scripts are currently not QA'ed and are provided for demonstration purposes only.**
+Please note that:
+- These scripts have not undergone formal quality assurance testing
+- They were executed on GB200 systems
+- They are intended for demonstration and educational purposes
+- Use at your own risk in production environments
+- Always review and test scripts thoroughly before running in your specific environment
+- We are actively working on refining the configuration sweeps.
+## Scripts Overview
+### Core Scripts
+1. `submit.sh` - Main entry point for submitting benchmark jobs for disaggregated configurations. This includes WideEP optimization for DEP>=16.
+2. `submit_agg.sh` - Main entry point for submitting benchmark jobs for aggregated configurations.
+3. `post_process.py` - Scan the genai-perf results to produce a json with entries to each config point.
+4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
+For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
+## Usage
+### Prerequisites
+Before running the scripts, ensure you have:
+1. Access to a SLURM cluster
+2. Container image of Dynamo with TensorRT-LLM built using instructions from [here](https://github.com/ai-dynamo/dynamo/tree/main/components/backends/trtllm#build-docker).
+3. Model files accessible on the cluster
+4. Required environment variables set
+### Setup
+Within the login node of the cluster, set the following variables
+```bash
+# Set partition manually based on your slurm cluster's partition names
+export SLURM_PARTITION=""
+# Set account manually if this command doesn't work on your cluster
+export SLURM_ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
+# Set a job name for your benchmarking runs
+export SLURM_JOB_NAME=""
+# NOTE: IMAGE must be set manually for now
+# To build an iamge, see the steps here:
+# https://github.com/ai-dynamo/dynamo/tree/main/components/backends/trtllm#build-docker
+export IMAGE="<dynamo_trtllm_image>"
+# NOTE: In general, Deepseek R1 is very large, so it is recommended to
+# pre-download the model weights and save them in some shared location,
+# NFS storage, HF_CACHE, etc. and modify the `--model-path` below
+# to reuse the pre-downloaded weights instead.
+#
+# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
+# https://huggingface.co/nvidia/DeepSeek-R1-FP4
+#
+# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
+# https://huggingface.co/deepseek-ai/DeepSeek-R1
+export MODEL_PATH="<path_to_model_weights>"
+# The name the model will be served/queried under, matching what's
+# returned by the /v1/models endpoint.
+#
+# By default this is inferred from MODEL_PATH, but when using locally downloaded
+# model weights, it can be nice to have explicit control over the name.
+export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
+```
+## Launching benchmarking sweeps for different configurations
+### Aggregated
+```bash
+# Queues the SLURM jobs for aggregated configurations for DeepSeek R1.
+./submit_agg.sh
+```
+### Disaggregated (Includes WideEP) - MTP off
+```bash
+# Queues the SLURM jobs for disaggregated configurations for DeepSeek R1 without MTP
+./submit.sh mtp=off all
+```
+### Disaggregated (Includes WideEP) - MTP on
+```bash
+# Queues the SLURM jobs for disaggregated configurations for DeepSeek R1 with MTP
+./submit.sh mtp=on all
+```
+## Post-Processing Results
+The above jobs use genAI-perf tool to benchmark each configuration point across different concurrency values. These get stored in `dynamo_disagg-bm-8150-1024/<config-setup>/genai_perf_artifacts` and `dynamo_agg-bm-8150-1024/<config-setup>/genai_perf_artifacts` for disaggregated and aggregated respectively.
+After your benchmarking jobs have completed, you can use the `post_process.py` script to aggregate and summarize the results from the generated genai_perf_artifacts.
+To run the post-processing script, use:
+### Aggregated
+```bash
+python3 post_process.py dynamo_agg-bm-8150-1024 --output-file agg_result.json
+```
+### Disaggregated
+```bash
+python3 post_process.py dynamo_disagg-bm-8150-1024 --output-file disagg_result.json
+```
+## Ploting Performance
+You can now use the `plot_performance_comparison.py` like below to observe the performance.
+```bash
+python3 plot_performance_comparison.py dynamo_agg-bm-8150-1024/agg_result.json dynamo_disagg-bm-8150-1024/disagg_result.json -o performance_plot.png
+```
+This script will produce a scatter plot of all the configuration points with each concurrency on a Output Throughput per GPU vs Output Throughput per User. It will also include the roofline pareto line for both aggregated and disaggregated setups.
+Refer to [Beyond the Buzz: A Pragmatic Take on Inference Disaggregation](https://arxiv.org/html/2506.05508v1) to learn how to interpret these plots.
+## Known Issues
+- Some jobs may time out if genai-perf requires more time to complete all concurrency levels.
+- Workers may encounter out-of-memory (OOM) errors during inference, especially with larger configurations.
+- Configurations affected by these issues will result in missing data points on the performance plot.
--- a/components/backends/trtllm/performance_sweeps/benchmark.slurm
+++ b/components/backends/trtllm/performance_sweeps/benchmark.slurm
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+MULTI_ROUND="${MULTI_ROUND:-8}"
+# set MOUNT_DIR
+MOUNT_DIR="${MOUNT_DIR:-${PWD}}"
+CONTAINER_NAME=disaggr-test
+STREAMING=true
+CTX_GPU_FRAC=0.75
+CACHE_TRANSCEIVER_MAX_NUM_TOKENS=8448
+num_ctx_servers=$1
+ctx_tp_size=$2
+ctx_batch_size=$3
+ctx_max_num_tokens=$4
+ctx_enable_attention_dp=$5
+num_gen_servers=$6
+gen_tp_size=$7
+gen_batch_size=$8
+gen_max_num_tokens=$9
+gen_enable_attention_dp=${10}
+gen_gpu_memory_fraction=${11}
+eplb_num_slots=${12}
+mtp_size=${13}
+concurrency_list=${14}
+gen_nodes=${15}
+kind=${16}
+model_path=${17}
+served_model_name=${18}
+image=${19}
+isl=${20}
+osl=${21}
+ctx_max_seq_len=$((${isl} + 203))
+gen_max_seq_len=$((${isl} + ${osl} + 203))
+WORK_DIR=${MOUNT_DIR}
+LOG_DIR=$WORK_DIR/${kind}-bm-${isl}-${osl}
+SCRIPTS_DIR=${WORK_DIR}/
+set_clock_cmd="bash ${SCRIPTS_DIR}/set_clock.sh"
+mkdir -p ${LOG_DIR}
+echo "trying to submit job"
+sub_dir=${LOG_DIR}/ctx${num_ctx_servers}_gen${num_gen_servers}_dep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
+echo "concurrency_list: ${concurrency_list}"
+ctx_gpus=$((num_ctx_servers * ctx_tp_size))
+gen_gpus=$((num_gen_servers * gen_tp_size))
+echo "enable_attention_dp: ${ctx_enable_attention_dp}, ${gen_enable_attention_dp}, gpu_memory_fraction: ${gen_gpu_memory_fraction}"
+enable_pdl=false
+if [ "${gen_enable_attention_dp}" = "false" ]; then
+    enable_pdl=true
+    echo "enable_pdl: ${enable_pdl}"
+    sub_dir=${LOG_DIR}/ctx${num_ctx_servers}_gen${num_gen_servers}_tep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
+fi
+full_logdir=${sub_dir}
+artifacts_dir=${full_logdir}/genai_perf_artifacts
+mkdir -p ${artifacts_dir}
+# Set clock
+srun ${set_clock_cmd}
+container_mounts=${MOUNT_DIR}:${MOUNT_DIR},${model_path}:${model_path}
+# start the container
+srun -l --container-image=${image} \
+        --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts} \
+        --mpi=pmix \
+        echo "Container up."
+# generate the yaml file
+srun -l --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts} \
+        --mpi=pmix --overlap \
+	-n 1 -N 1 \
+        python3 ${SCRIPTS_DIR}/scripts/gen_yaml.py --config ${full_logdir}/config.yaml \
+            --model ${model_path} \
+            --num_ctx_servers ${num_ctx_servers} \
+            --ctx_tp_size ${ctx_tp_size} \
+            --ctx_batch_size ${ctx_batch_size} \
+            --ctx_max_num_tokens ${ctx_max_num_tokens} \
+            --ctx_max_seq_len ${ctx_max_seq_len} \
+            --ctx_free_gpu_memory_fraction ${CTX_GPU_FRAC} \
+            --cache_transceiver_max_num_tokens ${CACHE_TRANSCEIVER_MAX_NUM_TOKENS} \
+            --num_gen_servers ${num_gen_servers} \
+            --gen_tp_size ${gen_tp_size} \
+            --gen_batch_size ${gen_batch_size} \
+            --gen_max_num_tokens ${gen_max_num_tokens} \
+            --gen_max_seq_len ${gen_max_seq_len} \
+            --gen_gpu_memory_fraction ${gen_gpu_memory_fraction} \
+            --eplb_num_slots ${eplb_num_slots} \
+            $(if [ "${gen_enable_attention_dp}" = "true" ]; then echo "--gen_enable_attention_dp"; fi) \
+            $(if [ "${ctx_enable_attention_dp}" = "true" ]; then echo "--ctx_enable_attention_dp"; fi) \
+            $(if [ "${mtp_size}" -gt 0 ]; then echo "--mtp_size ${mtp_size}"; fi)
+echo "YAML file generated."
+nsys_on=""
+# nsys_on=${full_logdir}
+nodes=($(scontrol show hostnames "$SLURM_JOB_NODELIST"))
+export HEAD_NODE="${nodes[0]}"
+export HEAD_NODE_IP="$(hostname -i)"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+# Create a temporary file to store PIDs
+PID_FILE=$(mktemp)
+trap 'cleanup_and_exit' EXIT
+cleanup_and_exit() {
+    if [ -f "$PID_FILE" ]; then
+        echo "Cleaning up spawned processes..."
+        while read -r pid; do
+            if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
+                echo "Sending TERM to process $pid"
+                kill -TERM "$pid" 2>/dev/null
+                sleep 2
+                if kill -0 "$pid" 2>/dev/null; then
+                    echo "Process $pid still running, sending KILL"
+                    kill -KILL "$pid" 2>/dev/null
+                fi
+            fi
+        done < "$PID_FILE"
+        rm -f "$PID_FILE"
+    fi
+}
+# start the server
+srun -l --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts} \
+        --mpi=pmix --overlap -N 1 -n 1 \
+	--oversubscribe \
+	--overlap \
+	--container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE \
+        -w ${nodes[0]} \
+        bash ${SCRIPTS_DIR}/scripts/start_frontend.sh &> ${full_logdir}/output_server.log &
+SERVER_PID=$!
+echo "$SERVER_PID" >> "$PID_FILE"
+# wait for the server to start
+sleep 10
+PREFILL_COUNT=$(grep 'prefill_count:' "${full_logdir}/instance_config.yaml" | awk '{print $2}')
+if [ -z "$PREFILL_COUNT" ]; then
+    echo "Error: Failed to extract prefill_count from instance_config.yaml"
+    exit 1
+fi
+echo "Prefill Count: $PREFILL_COUNT"
+# start the prefill workers
+prefill_pids=()
+for ((i=1; i<=PREFILL_COUNT; i++)); do
+  echo "Running Prefill Worker: ${i}"
+  node_idx=$((i-1))
+  echo "Running Prefill Nodes: ${nodes[node_idx]}"
+  srun -l --container-name=${CONTAINER_NAME} \
+      --container-mounts=${container_mounts} \
+      --mpi=pmix --overlap -w ${nodes[node_idx]} \
+      --oversubscribe \
+      --overlap \
+      --ntasks 4 \
+      --nodes 1 \
+      bash ${SCRIPTS_DIR}/scripts/start_worker.sh ${full_logdir}/prefill_config.yaml "${enable_pdl}" ${ctx_gpus} ${nsys_on} ${served_model_name} ${model_path} 'prefill' &> ${full_logdir}/output_workers.log &
+  prefill_pids+=($!)
+  echo "$!" >> "$PID_FILE"
+done
+DECODE_COUNT=$(grep 'decode_count:' "${full_logdir}/instance_config.yaml" | awk '{print $2}')
+if [ -z "$DECODE_COUNT" ]; then
+    echo "Error: Failed to extract decode_count from instance_config.yaml"
+    exit 1
+fi
+echo "Decode Count: $DECODE_COUNT"
+num_gen_nodes=$((gen_nodes/num_gen_servers))
+decode_start_idx=$PREFILL_COUNT
+for ((i=1; i<=DECODE_COUNT; i++)); do
+  echo "Running Decode Worker: ${i}"
+  decode_node_list=()
+  for ((j=0; j<num_gen_nodes; j++)); do
+    node_idx=$((decode_start_idx + (i-1)*num_gen_nodes + j))
+    decode_node_list+=("${nodes[node_idx]}")
+  done
+  decode_nodes_csv=$(IFS=, ; echo "${decode_node_list[*]}")
+  echo "Running Decode Nodes: ${decode_nodes_csv}"
+  srun -l --container-name=${CONTAINER_NAME} \
+      --container-mounts=${container_mounts} \
+      --mpi=pmix \
+      -w ${decode_nodes_csv} \
+      --nodes ${num_gen_nodes} \
+      --ntasks $gen_tp_size \
+      --oversubscribe \
+      --overlap \
+      bash ${SCRIPTS_DIR}/scripts/start_worker.sh ${full_logdir}/decode_config.yaml "${enable_pdl}" ${ctx_gpus} ${nsys_on} ${served_model_name} ${model_path} 'decode' &> ${full_logdir}/output_workers.log &
+  echo "$!" >> "$PID_FILE"
+done
+total_gpus=$((ctx_gpus + gen_gpus))
+# start the loadgen
+srun -l --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts},${artifacts_dir}:${artifacts_dir} \
+        --mpi=pmix --overlap -N 1 -n 1 \
+	-w ${nodes[0]} \
+        bash ${SCRIPTS_DIR}/scripts/bench.sh ${served_model_name} ${MULTI_ROUND} ${num_gen_servers} "${concurrency_list}" ${STREAMING} ${full_logdir} ${total_gpus} ${artifacts_dir} ${model_path} ${isl} ${osl} ${kind} > ${full_logdir}/bench.log 2>&1
+# Wait for all background processes to complete
+wait
+# Cleanup will be handled by the EXIT trap
--- a/components/backends/trtllm/performance_sweeps/benchmark_agg.slurm
+++ b/components/backends/trtllm/performance_sweeps/benchmark_agg.slurm
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+MULTI_ROUND="${MULTI_ROUND:-8}"
+MOUNT_DIR="${MOUNT_DIR:-${PWD}}"
+CONTAINER_NAME=aggr-test
+STREAMING=true
+GPU_FRAC=0.8
+tp_size=$1
+ep_size=$2
+max_batch_size=$3
+max_num_tokens=$4
+enable_attention_dp=$5
+concurrency_list=$6
+mtp_size=$7
+kind=$8
+isl=$9
+osl=${10}
+model_path=${11}
+served_model_name=${12}
+image=${13}
+echo "tp_size=$tp_size ep_size=$ep_size max_batch_size=$max_batch_size max_num_tokens=$max_num_tokens enable_attention_dp=$enable_attention_dp concurrency_list=$concurrency_list mtp_size=$mtp_size kind=$kind isl=$isl osl=$osl model_path=$model_path served_model_name=$served_model_name image=$image"
+max_seq_len=$((${isl} + ${osl}))
+WORK_DIR=${MOUNT_DIR}
+LOG_DIR=$WORK_DIR/${kind}-bm-${isl}-${osl}
+SCRIPTS_DIR=${WORK_DIR}/
+set_clock_cmd="bash ${SCRIPTS_DIR}/set_clock.sh"
+mkdir -p ${LOG_DIR}
+echo "trying to submit job"
+sub_dir=${LOG_DIR}/ctx0_gen1_dep${tp_size}_batch${max_batch_size}_mtp${mtp_size}
+if [ "${enable_attention_dp}" = "false" ]; then
+    sub_dir=${LOG_DIR}/ctx0_gen1_tep${tp_size}_batch${max_batch_size}_mtp${mtp_size}
+fi
+full_logdir=${sub_dir}
+artifacts_dir=${full_logdir}/genai_perf_artifacts
+mkdir -p ${artifacts_dir}
+set_clock_cmd="bash ${SCRIPTS_DIR}/set_clock.sh"
+srun ${set_clock_cmd}
+container_mounts=${MOUNT_DIR}:${MOUNT_DIR},${model_path}:${model_path}
+# start the container
+srun -l --container-image=${image} \
+        --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts} \
+        --mpi=pmix \
+        echo "Container up."
+nodes=($(scontrol show hostnames "$SLURM_JOB_NODELIST"))
+export HEAD_NODE="${nodes[0]}"
+export HEAD_NODE_IP="$(hostname -i)"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+# Create a temporary file to store PIDs
+PID_FILE=$(mktemp)
+trap 'cleanup_and_exit' EXIT
+cleanup_and_exit() {
+    if [ -f "$PID_FILE" ]; then
+        echo "Cleaning up spawned processes..."
+        while read -r pid; do
+            if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
+                echo "Sending TERM to process $pid"
+                kill -TERM "$pid" 2>/dev/null
+                sleep 2
+                if kill -0 "$pid" 2>/dev/null; then
+                    echo "Process $pid still running, sending KILL"
+                    kill -KILL "$pid" 2>/dev/null
+                fi
+            fi
+        done < "$PID_FILE"
+        rm -f "$PID_FILE"
+    fi
+}
+# start the server
+srun -l --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts} \
+        --mpi=pmix --overlap -N 1 -n 1 \
+	--oversubscribe \
+	--overlap \
+	--container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE \
+        -w ${nodes[0]} \
+        bash ${SCRIPTS_DIR}/scripts/start_frontend.sh  &> ${full_logdir}/output_server.log &
+SERVER_PID=$!
+echo "$SERVER_PID" >> "$PID_FILE"
+# wait for the server to start
+sleep 10
+# start the workers
+srun -l --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts} \
+        --mpi=pmix --overlap \
+        --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE \
+        bash -x ${WORK_DIR}/scripts/start_agg_worker.sh ${model_path} ${max_batch_size} ${max_num_tokens} ${tp_size} ${ep_size} ${enable_attention_dp} ${GPU_FRAC} ${max_seq_len} ${mtp_size} ${served_model_name} &> ${full_logdir}/output_workers.log &
+WORKERS_PID=$!
+echo "$WORKERS_PID" >> "$PID_FILE"
+# start the loadgen
+srun -l --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts},${artifacts_dir}:${artifacts_dir} \
+        --mpi=pmix --overlap -N 1 -n 1 \
+        --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE \
+	-w ${nodes[0]} \
+        bash ${SCRIPTS_DIR}/scripts/bench.sh ${served_model_name} ${MULTI_ROUND} 1 "${concurrency_list}" ${STREAMING} ${full_logdir} ${tp_size} ${artifacts_dir} ${model_path} ${isl} ${osl} ${kind} > ${full_logdir}/bench.log 2>&1
+# Wait for all background processes to complete
+wait
+# Cleanup will be handled by the EXIT trap
\ No newline at end of file
--- a/components/backends/trtllm/performance_sweeps/plot_performance_comparison.py
+++ b/components/backends/trtllm/performance_sweeps/plot_performance_comparison.py
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+Performance Comparison Plotter
+This script takes two JSON files containing performance data and creates a scatter plot
+comparing output_token_throughput_per_user vs output_token_throughput_per_gpu.
+Points from different files are colored differently, and Pareto lines are added for each dataset.
+"""
+import argparse
+import json
+from typing import Dict, List, Optional, Tuple
+import matplotlib.pyplot as plt
+def load_json_data(file_path: str) -> List[Dict]:
+    """Load JSON data from file."""
+    with open(file_path, "r") as f:
+        return json.load(f)
+def extract_plot_data(data: List[Dict]) -> Tuple[List[float], List[float]]:
+    """Extract x and y coordinates for plotting from JSON data."""
+    x_coords = [entry["output_token_throughput_per_user"] for entry in data]
+    y_coords = [entry["output_token_throughput_per_gpu"] for entry in data]
+    return x_coords, y_coords
+def compute_pareto_frontier(
+    x_coords: List[float], y_coords: List[float]
+) -> Tuple[List[float], List[float]]:
+    """
+    Compute the Pareto frontier for a set of points.
+    The Pareto frontier connects only the roofline points (actual optimal points from data).
+    """
+    if not x_coords or not y_coords:
+        return [], []
+    # Combine coordinates into points
+    points = list(zip(x_coords, y_coords))
+    # Find the true Pareto optimal points (non-dominated points)
+    pareto_points = []
+    for i, (x1, y1) in enumerate(points):
+        is_dominated = False
+        # Check if this point is dominated by any other point
+        for j, (x2, y2) in enumerate(points):
+            if i != j:
+                # Point 2 dominates point 1 if it's better in at least one dimension and not worse in any
+                if (x2 >= x1 and y2 > y1) or (x2 > x1 and y2 >= y1):
+                    is_dominated = True
+                    break
+        if not is_dominated:
+            pareto_points.append((x1, y1))
+    # Sort Pareto points by x-coordinate (user throughput)
+    pareto_points.sort(key=lambda p: p[0])
+    # Unzip the Pareto points
+    if pareto_points:
+        pareto_x, pareto_y = zip(*pareto_points)
+        return list(pareto_x), list(pareto_y)
+    else:
+        return [], []
+def find_max_difference_point(
+    pareto_x1: List[float],
+    pareto_y1: List[float],
+    pareto_x2: List[float],
+    pareto_y2: List[float],
+    user_throughput_threshold: Optional[float] = None,
+) -> Tuple[Optional[float], Optional[float], Optional[float], str]:
+    """
+    Find the point with maximum relative difference between the two rooflines.
+    If threshold is specified, only considers points above the threshold.
+    Returns the user throughput, the two GPU throughputs, and the difference multiplier.
+    """
+    if not pareto_x1 or not pareto_x2:
+        return None, None, None, ""
+    # Apply threshold if specified
+    if user_throughput_threshold is not None:
+        # Filter Pareto points above the threshold
+        pareto_points1 = [
+            (x, y)
+            for x, y in zip(pareto_x1, pareto_y1)
+            if x >= user_throughput_threshold
+        ]
+        pareto_points2 = [
+            (x, y)
+            for x, y in zip(pareto_x2, pareto_y2)
+            if x >= user_throughput_threshold
+        ]
+        if not pareto_points1 or not pareto_points2:
+            return None, None, None, ""
+        pareto_x1_filtered: List[float]
+        pareto_y1_filtered: List[float]
+        pareto_x2_filtered: List[float]
+        pareto_y2_filtered: List[float]
+        # Unzip the filtered points into separate x and y lists
+        pareto_x1_filtered = [point[0] for point in pareto_points1]
+        pareto_y1_filtered = [point[1] for point in pareto_points1]
+        pareto_x2_filtered = [point[0] for point in pareto_points2]
+        pareto_y2_filtered = [point[1] for point in pareto_points2]
+    else:
+        pareto_x1_filtered, pareto_y1_filtered = pareto_x1, pareto_y1
+        pareto_x2_filtered, pareto_y2_filtered = pareto_x2, pareto_y2
+    # Find the point with maximum relative difference between rooflines
+    max_diff_ratio: float = 0.0
+    max_diff_user = None
+    max_diff_gpu1 = None
+    max_diff_gpu2 = None
+    # For each point in the first roofline, find the closest point in the second roofline
+    # and calculate the relative difference
+    for i, (x1, y1) in enumerate(zip(pareto_x1_filtered, pareto_y1_filtered)):
+        # Find the closest point in the second roofline to this x-coordinate
+        closest_idx2 = 0
+        min_distance = float("inf")
+        for j, x2 in enumerate(pareto_x2_filtered):
+            distance = abs(x2 - x1)
+            if distance < min_distance:
+                min_distance = distance
+                closest_idx2 = j
+        y2 = pareto_y2_filtered[closest_idx2]
+        # Calculate the relative difference
+        if y2 > 0:  # Avoid division by zero
+            if y1 > y2:
+                ratio = y1 / y2
+            else:
+                ratio = y2 / y1
+            # Update if this is the maximum ratio found
+            if ratio > max_diff_ratio:
+                max_diff_ratio = ratio
+                max_diff_user = x1
+                max_diff_gpu1 = y1
+                max_diff_gpu2 = y2
+    if max_diff_user is None:
+        return None, None, None, ""
+    # Create the label
+    if max_diff_gpu1 is not None and max_diff_gpu2 is not None:
+        label = f"{max_diff_ratio:.1f}x better\nUser: {max_diff_user:.1f}\nGPU1: {max_diff_gpu1:.1f}\nGPU2: {max_diff_gpu2:.1f}"
+    else:
+        label = "No valid GPU data"
+    return max_diff_user, max_diff_gpu1, max_diff_gpu2, label
+def plot_performance_comparison(
+    file1_path: str,
+    file2_path: str,
+    output_path: Optional[str] = None,
+    user_throughput_threshold: Optional[float] = None,
+):
+    """Create the performance comparison plot."""
+    # Load data from both files
+    data1 = load_json_data(file1_path)
+    data2 = load_json_data(file2_path)
+    # Extract the "kind" field from the data to use as labels
+    kind1 = data1[0]["kind"] if data1 and "kind" in data1[0] else file1_path
+    kind2 = data2[0]["kind"] if data2 and "kind" in data2[0] else file2_path
+    # Extract plotting coordinates
+    x1, y1 = extract_plot_data(data1)
+    x2, y2 = extract_plot_data(data2)
+    # Compute Pareto frontiers
+    pareto_x1, pareto_y1 = compute_pareto_frontier(x1, y1)
+    pareto_x2, pareto_y2 = compute_pareto_frontier(x2, y2)
+    # Find the point where rooflines differ the most
+    max_diff_user, max_diff_gpu1, max_diff_gpu2, diff_label = find_max_difference_point(
+        pareto_x1, pareto_y1, pareto_x2, pareto_y2, user_throughput_threshold
+    )
+    # Create the plot
+    plt.figure(figsize=(12, 8))
+    # Plot scatter points
+    plt.scatter(
+        x1, y1, c="blue", alpha=0.6, s=40, label=f"{kind1} ({len(data1)} points)"
+    )
+    plt.scatter(
+        x2, y2, c="red", alpha=0.6, s=40, label=f"{kind2} ({len(data2)} points)"
+    )
+    # Plot Pareto lines (roofline)
+    if pareto_x1 and pareto_y1:
+        plt.plot(
+            pareto_x1,
+            pareto_y1,
+            "b-",
+            linewidth=3,
+            alpha=0.9,
+            label=f"{kind1} Roofline ({len(pareto_x1)} points)",
+        )
+        # Highlight Pareto points
+        plt.scatter(
+            pareto_x1,
+            pareto_y1,
+            c="blue",
+            s=80,
+            alpha=0.9,
+            edgecolors="white",
+            linewidth=1,
+            zorder=5,
+        )
+    if pareto_x2 and pareto_y2:
+        plt.plot(
+            pareto_x2,
+            pareto_y2,
+            "r-",
+            linewidth=3,
+            alpha=0.9,
+            label=f"{kind2} Roofline ({len(pareto_x2)} points)",
+        )
+        # Highlight Pareto points
+        plt.scatter(
+            pareto_x2,
+            pareto_y2,
+            c="red",
+            s=80,
+            alpha=0.9,
+            edgecolors="white",
+            linewidth=1,
+            zorder=5,
+        )
+    # Mark the point where rooflines differ the most
+    if (
+        max_diff_user is not None
+        and max_diff_gpu1 is not None
+        and max_diff_gpu2 is not None
+    ):
+        # Plot vertical line at the user throughput where difference is maximum
+        plt.axvline(
+            x=max_diff_user,
+            color="purple",
+            linestyle="--",
+            alpha=0.7,
+            linewidth=2,
+            label="Max Difference Point",
+        )
+        # Mark the points on both rooflines
+        plt.scatter(
+            max_diff_user,
+            max_diff_gpu1,
+            c="blue",
+            s=150,
+            alpha=1.0,
+            edgecolors="purple",
+            linewidth=3,
+            zorder=10,
+            marker="*",
+        )
+        plt.scatter(
+            max_diff_user,
+            max_diff_gpu2,
+            c="red",
+            s=150,
+            alpha=1.0,
+            edgecolors="purple",
+            linewidth=3,
+            zorder=10,
+            marker="*",
+        )
+        # Add annotation with the difference information
+        plt.annotate(
+            diff_label,
+            xy=(max_diff_user, max(max_diff_gpu1, max_diff_gpu2)),
+            xytext=(max_diff_user + 10, max(max_diff_gpu1, max_diff_gpu2) + 50),
+            arrowprops=dict(arrowstyle="->", color="purple", alpha=0.7),
+            bbox=dict(boxstyle="round,pad=0.5", facecolor="yellow", alpha=0.8),
+            fontsize=10,
+            fontweight="bold",
+        )
+    # Customize the plot
+    plt.xlabel("Output Token Throughput per User", fontsize=12)
+    plt.ylabel("Output Token Throughput per GPU", fontsize=12)
+    plt.title(
+        "Performance Comparison: Throughput per GPU vs Throughput per User",
+        fontsize=14,
+        fontweight="bold",
+    )
+    plt.legend(fontsize=10)
+    plt.grid(True, alpha=0.3)
+    # Add some statistics as text
+    if user_throughput_threshold is not None:
+        # Format the statistics with proper conditional handling
+        x1_max_str = f"{max(x1):.1f}" if len(x1) > 0 else "N/A"
+        y1_max_str = f"{max(y1):.1f}" if len(y1) > 0 else "N/A"
+        x2_max_str = f"{max(x2):.1f}" if len(x2) > 0 else "N/A"
+        y2_max_str = f"{max(y2):.1f}" if len(y2) > 0 else "N/A"
+        stats_text = f"""
+Statistics (Max Difference Point: User Throughput ≥ {user_throughput_threshold}):
+{kind1}: {len(data1)} points, max per-user: {x1_max_str}, max per-gpu: {y1_max_str}
+{kind2}: {len(data2)} points, max per-user: {x2_max_str}, max per-gpu: {y2_max_str}
+        """
+    else:
+        # Format the statistics with proper conditional handling
+        x1_max_str = f"{max(x1):.1f}" if len(x1) > 0 else "N/A"
+        y1_max_str = f"{max(y1):.1f}" if len(y1) > 0 else "N/A"
+        x2_max_str = f"{max(x2):.1f}" if len(x2) > 0 else "N/A"
+        y2_max_str = f"{max(y2):.1f}" if len(y2) > 0 else "N/A"
+        stats_text = f"""
+Statistics:
+{kind1}: {len(data1)} points, max per-user: {x1_max_str}, max per-gpu: {y1_max_str}
+{kind2}: {len(data2)} points, max per-user: {x2_max_str}, max per-gpu: {y2_max_str}
+        """
+    plt.text(
+        0.02,
+        0.02,
+        stats_text.strip(),
+        transform=plt.gca().transAxes,
+        fontsize=9,
+        verticalalignment="bottom",
+        bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.8),
+    )
+    # Adjust layout and save/show
+    plt.tight_layout()
+    if output_path:
+        plt.savefig(output_path, dpi=300, bbox_inches="tight")
+        print(f"Plot saved to {output_path}")
+    plt.show()
+def main():
+    parser = argparse.ArgumentParser(
+        description="Plot performance comparison between two JSON files"
+    )
+    parser.add_argument("file1", help="Path to first JSON file")
+    parser.add_argument("file2", help="Path to second JSON file")
+    parser.add_argument(
+        "--output", "-o", help="Output file path for the plot (optional)"
+    )
+    parser.add_argument(
+        "--threshold",
+        "-t",
+        type=float,
+        help="Minimum user throughput threshold (filters data points below this value)",
+    )
+    args = parser.parse_args()
+    try:
+        plot_performance_comparison(args.file1, args.file2, args.output, args.threshold)
+    except FileNotFoundError as e:
+        print(f"Error: File not found - {e}")
+    except json.JSONDecodeError as e:
+        print(f"Error: Invalid JSON format - {e}")
+    except Exception as e:
+        print(f"Error: {e}")
+if __name__ == "__main__":
+    main()
--- a/components/backends/trtllm/performance_sweeps/post_process.py
+++ b/components/backends/trtllm/performance_sweeps/post_process.py
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+Post-process script for performance sweep results.
+This script processes directories containing performance sweep results and extracts:
+- Output Token Throughput (tokens/sec)
+- Output Token Throughput Per User (tokens/sec/user)
+- Deployment configuration (kind, model, total_gpus)
+- Concurrency levels
+It creates a JSON file for each subdirectory with the pattern ctx*_gen*_*
+"""
+import argparse
+import csv
+import json
+import os
+import re
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+def parse_directory_config(dir_name: str) -> Dict[str, str]:
+    """
+    Parse configuration parameters from directory name
+    Args:
+        dir_name: Directory name like 'ctx1_gen3_tep4_batch128_eplb0_mtp0'
+    Returns:
+        Dictionary containing parsed configuration parameters
+    """
+    config = {}
+    # Parse ctx and gen workers
+    ctx_match = re.search(r"ctx(\d+)", dir_name)
+    if ctx_match:
+        config["ctx_workers"] = ctx_match.group(1)
+    gen_match = re.search(r"gen(\d+)", dir_name)
+    if gen_match:
+        config["gen_workers"] = gen_match.group(1)
+    # Parse batch size
+    batch_match = re.search(r"batch(\d+)", dir_name)
+    if batch_match:
+        config["batch_size"] = batch_match.group(1)
+    # Parse eplb (expert load balancing)
+    eplb_match = re.search(r"eplb(\d+)", dir_name)
+    if eplb_match:
+        config["eplb"] = eplb_match.group(1)
+    # Parse mtp mode
+    mtp_match = re.search(r"mtp(\d+)", dir_name)
+    if mtp_match:
+        config["mtp_mode"] = mtp_match.group(1)
+    # Parse tep (tensor expert parallel) mode
+    tep_match = re.search(r"tep(\d+)", dir_name)
+    if tep_match:
+        config["tep_mode"] = tep_match.group(1)
+    # Parse dep mode
+    dep_match = re.search(r"dep(\d+)", dir_name)
+    if dep_match:
+        config["dep_mode"] = dep_match.group(1)
+    return config
+def find_ctx_gen_directories(base_path: str) -> List[str]:
+    """
+    Find all subdirectories that match the pattern ctx*_gen*_*
+    Args:
+        base_path: Base directory to search in
+    Returns:
+        List of directory paths matching the pattern
+    """
+    directories: List[str] = []
+    base_path_obj = Path(base_path)
+    if not base_path_obj.exists():
+        print(f"Error: Base path {base_path_obj} does not exist")
+        return directories
+    for item in base_path_obj.iterdir():
+        if item.is_dir() and re.match(r"ctx\d+_gen\d+_.*", item.name):
+            directories.append(str(item))
+    return directories
+def parse_deployment_config(config_path: str) -> Dict[str, str]:
+    """
+    Parse deployment configuration from JSON file
+    Args:
+        config_path: Path to deployment_config.json
+    Returns:
+        Dictionary containing kind, model, and total_gpus
+    """
+    try:
+        with open(config_path, "r") as f:
+            config = json.load(f)
+        return {
+            "kind": config.get("kind", ""),
+            "model": config.get("model", ""),
+            "total_gpus": config.get("total_gpus", ""),
+        }
+    except (FileNotFoundError, json.JSONDecodeError) as e:
+        print(f"Warning: Could not parse deployment config at {config_path}: {e}")
+        return {"kind": "", "model": "", "total_gpus": ""}
+def extract_throughput_data(csv_path: str) -> Tuple[Optional[float], Optional[float]]:
+    """
+    Extract throughput data from CSV file
+    Args:
+        csv_path: Path to profile_export_genai_perf.csv
+    Returns:
+        Tuple of (output_token_throughput, output_token_throughput_per_user)
+    """
+    try:
+        with open(csv_path, "r") as f:
+            reader = csv.reader(f)
+            output_token_throughput = None
+            output_token_throughput_per_user = None
+            for row in reader:
+                if len(row) >= 2:
+                    if row[0] == "Output Token Throughput (tokens/sec)":
+                        # Handle comma-separated numbers in quotes
+                        value_str = row[1].strip('"').replace(",", "")
+                        output_token_throughput = float(value_str)
+                    elif row[0] == "Output Token Throughput Per User (tokens/sec/user)":
+                        # This metric appears in the first section with percentiles
+                        # We need to get the average value (second column)
+                        value_str = row[1].strip('"').replace(",", "")
+                        output_token_throughput_per_user = float(value_str)
+            return output_token_throughput, output_token_throughput_per_user
+    except (FileNotFoundError, ValueError, IndexError) as e:
+        print(f"Warning: Could not parse CSV at {csv_path}: {e}")
+        return None, None
+def extract_concurrency_from_path(dir_path: str) -> Optional[int]:
+    """
+    Extract concurrency value from directory path
+    Args:
+        dir_path: Path to directory containing concurrency in name
+    Returns:
+        Concurrency value as integer, or None if not found
+    """
+    # Extract the number after 'concurrency'
+    match = re.search(r"concurrency(\d+)", dir_path, re.IGNORECASE)
+    if match:
+        return int(match.group(1))
+    return None
+def process_directory(dir_path: str) -> Optional[List[Dict[str, Any]]]:
+    """
+    Process a single directory and extract all required data
+    Args:
+        dir_path: Path to the directory to process
+    Returns:
+        Dictionary containing extracted data, or None if processing failed
+    """
+    dir_path_obj = Path(dir_path)
+    artifacts_path = dir_path_obj / "genai_perf_artifacts"
+    if not artifacts_path.exists():
+        print(f"Warning: No genai_perf_artifacts directory found in {dir_path}")
+        return None
+    # Parse deployment configuration
+    config_path = artifacts_path / "deployment_config.json"
+    if not config_path.exists():
+        print(f"Warning: No deployment_config.json found in {artifacts_path}")
+        return None
+    deployment_config = parse_deployment_config(str(config_path))
+    # Parse directory configuration
+    dir_config = parse_directory_config(dir_path_obj.name)
+    # Find CSV files in subdirectories
+    csv_files = []
+    for item in artifacts_path.iterdir():
+        if item.is_dir():
+            csv_path = item / "profile_export_genai_perf.csv"
+            if csv_path.exists():
+                csv_files.append(str(csv_path))
+    if not csv_files:
+        print(f"Warning: No CSV files found in {artifacts_path}")
+        return None
+    # Extract throughput data from each CSV file
+    results = []
+    for csv_file in csv_files:
+        output_throughput, output_throughput_per_user = extract_throughput_data(
+            csv_file
+        )
+        # Extract concurrency from the CSV file path
+        csv_path_obj = Path(csv_file)
+        concurrency = extract_concurrency_from_path(csv_path_obj.parent.name)
+        if output_throughput is not None and concurrency is not None:
+            # Safely validate and convert total_gpus
+            total_gpus = 1  # safe default
+            try:
+                if "total_gpus" not in deployment_config:
+                    print(
+                        "Warning: 'total_gpus' key missing in deployment config, using default value 1"
+                    )
+                else:
+                    total_gpus = int(deployment_config["total_gpus"])
+                    if total_gpus <= 0:
+                        print(
+                            f"Warning: Invalid total_gpus value '{deployment_config['total_gpus']}', using default value 1"
+                        )
+                        total_gpus = 1
+            except (ValueError, TypeError) as e:
+                print(
+                    f"Warning: Could not convert total_gpus '{deployment_config.get('total_gpus', 'missing')}' to int: {e}, using default value 1"
+                )
+                total_gpus = 1
+            result = {
+                "concurrency": concurrency,
+                "output_token_throughput": output_throughput,
+                "output_token_throughput_per_user": output_throughput_per_user,
+                "output_token_throughput_per_gpu": output_throughput / total_gpus,
+                "model": deployment_config["model"],
+                "kind": deployment_config["kind"],
+                "total_gpus": deployment_config["total_gpus"],
+                "ctx_workers": dir_config.get("ctx_workers", ""),
+                "gen_workers": dir_config.get("gen_workers", ""),
+                "batch_size": dir_config.get("batch_size", ""),
+                "eplb": dir_config.get("eplb", ""),
+                "mtp_mode": dir_config.get("mtp_mode", ""),
+                "tep_mode": dir_config.get("tep_mode", ""),
+                "dep_mode": dir_config.get("dep_mode", ""),
+            }
+            results.append(result)
+    return results
+def main():
+    parser = argparse.ArgumentParser(
+        description="Post-process performance sweep results"
+    )
+    parser.add_argument(
+        "base_path", help="Base directory containing performance sweep results"
+    )
+    parser.add_argument(
+        "--output-dir", help="Output directory for JSON file (default: same as input)"
+    )
+    parser.add_argument(
+        "--output-file",
+        default="performance_sweep_results.json",
+        help="Output JSON filename (default: performance_sweep_results.json)",
+    )
+    args = parser.parse_args()
+    # Find all ctx*_gen*_* directories
+    directories = find_ctx_gen_directories(args.base_path)
+    if not directories:
+        print(
+            f"No directories matching pattern 'ctx*_gen*_*' found in {args.base_path}"
+        )
+        return
+    print(f"Found {len(directories)} directories to process:")
+    for dir_path in directories:
+        print(f"  - {os.path.basename(dir_path)}")
+    # Collect all results from all directories
+    all_results: List[Dict[str, Any]] = []
+    skipped_directories = []
+    # Process each directory
+    for dir_path in directories:
+        print(f"\nProcessing {os.path.basename(dir_path)}...")
+        results = process_directory(dir_path)
+        if results is None or not results:
+            print(f"  Skipping {os.path.basename(dir_path)} - no valid data found")
+            skipped_directories.append(os.path.basename(dir_path))
+            continue
+        # Add directory name to each result for identification
+        for result in results:
+            result["directory"] = os.path.basename(dir_path)
+        all_results.extend(results)
+        # Print summary for this directory
+        print(f"  Found {len(results)} results:")
+        for result in results:
+            print(
+                f"    Concurrency {result['concurrency']}: "
+                f"{result['output_token_throughput_per_gpu']:.2f} tokens/sec/gpu, "
+                f"{result['output_token_throughput_per_user']:.2f} tokens/sec/user"
+            )
+    if not all_results:
+        print("No valid data found in any directory")
+        return
+    # Create output directory and file
+    output_dir = args.output_dir if args.output_dir else args.base_path
+    os.makedirs(output_dir, exist_ok=True)
+    output_file = os.path.join(output_dir, args.output_file)
+    with open(output_file, "w") as f:
+        json.dump(all_results, f, indent=2)
+    print(
+        f"\nCreated {output_file} with {len(all_results)} total results from {len(directories)} directories"
+    )
+    # Print summary of skipped directories
+    if skipped_directories:
+        print(f"\nSkipped directories with no valid data ({len(skipped_directories)}):")
+        for skipped_dir in skipped_directories:
+            print(f"  - {skipped_dir}")
+    else:
+        print(f"\nAll {len(directories)} directories had valid data.")
+if __name__ == "__main__":
+    main()
--- a/components/backends/trtllm/performance_sweeps/scripts/bench.sh
+++ b/components/backends/trtllm/performance_sweeps/scripts/bench.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# Add error handling
+set -e
+set -u
+trap 'echo "Error occurred at line $LINENO"; exit 1' ERR
+model=$1
+multi_round=$2
+num_gen_servers=$3
+concurrency_list=$4
+streaming=$5
+log_path=$6
+total_gpus=$7
+artifacts_dir=$8
+model_path=$9
+isl=${10}
+osl=${11}
+kind=${12}
+if [ "$#" -ne 12 ]; then
+    echo "Error: Expected 12 arguments, got $#"
+    echo "Usage: $0 <model> <multi_round> <num_gen_servers> <concurrency_list> <streaming> <log_path> <total_gpus> <artifacts_dir> <model_path> <isl> <osl> <kind>"
+    exit 1
+fi
+echo "Arguments:"
+echo "  model: $model"
+echo "  multi_round: $multi_round"
+echo "  num_gen_servers: $num_gen_servers"
+echo "  concurrency_list: $concurrency_list"
+echo "  streaming: $streaming"
+echo "  log_path: $log_path"
+echo "  total_gpus: $total_gpus"
+echo "  artifacts_dir: $artifacts_dir"
+echo "  model_path: $model_path"
+echo "  isl: $isl"
+echo "  osl: $osl"
+echo "  kind: $kind"
+# check process id is not 0
+if [[ ${SLURM_PROCID} != "0" ]]; then
+    echo "Process id is ${SLURM_PROCID} for loadgen, exiting"
+    exit 0
+fi
+set -x
+config_file=${log_path}/config.yaml
+# install genai-perf
+pip install genai-perf
+# Create artifacts root directory if it doesn't exist
+if [ ! -d "${artifacts_dir}" ]; then
+    mkdir -p "${artifacts_dir}"
+fi
+hostname=$HEAD_NODE_IP
+port=8000
+echo "Hostname: ${hostname}, Port: ${port}"
+apt update
+apt install curl
+# try client
+do_get_logs(){
+    worker_log_path=$1
+    output_folder=$2
+    grep -a "'num_ctx_requests': 0, 'num_ctx_tokens': 0" ${worker_log_path} > ${output_folder}/gen_only.txt || true
+    grep -a "'num_generation_tokens': 0" ${worker_log_path} > ${output_folder}/ctx_only.txt || true
+}
+# The configuration is dumped to a JSON file which hold details of the OAI service
+# being benchmarked.
+deployment_config=$(cat << EOF
+{
+  "kind": "${kind}",
+  "model": "${model}",
+  "total_gpus": "${total_gpus}"
+}
+EOF
+)
+mkdir -p "${artifacts_dir}"
+if [ -f "${artifacts_dir}/deployment_config.json" ]; then
+  echo "Deployment configuration already exists. Overwriting..."
+  rm -f "${artifacts_dir}/deployment_config.json"
+fi
+echo "${deployment_config}" > "${artifacts_dir}/deployment_config.json"
+# TODO: This is a temporary fix to check if the server is up.
+# We should use a more robust health check mechanism.
+# Loop up to 50 times
+for ((i=1; i<=50; i++)); do
+    # Run curl and capture response and HTTP code
+    response=$(curl -s -w "\n%{http_code}" "${hostname}:${port}/v1/chat/completions" \
+      -H "Content-Type: application/json" \
+      -d "{
+        \"model\": \"${model}\",
+        \"messages\": [
+           {
+            \"role\": \"user\",
+            \"content\": \"Tell me a story as if we were playing dungeons and dragons.\"
+           }
+        ],
+        \"stream\": true,
+        \"max_tokens\": 30
+      }")
+    # Extract HTTP code
+    http_code=$(echo "$response" | tail -n1)
+    if [ "$http_code" = "200" ]; then
+        echo "Success on attempt $i"
+        # Optional: Print the response body (excluding HTTP code)
+        echo "$response" | sed '$d'
+        break
+    else
+        echo "Attempt $i failed (HTTP $http_code)."
+        # Wait: 100 seconds after first failure, 10 seconds after subsequent
+        if [ "$i" -eq 1 ]; then
+            sleep 300
+        else
+            sleep 10
+        fi
+    fi
+done
+if [ "$http_code" != "200" ]; then
+    echo "Server did not respond correctly after 50 attempts."
+    exit 1
+fi
+curl -v  -w "%{http_code}" "${hostname}:${port}/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": "'${model}'",
+  "messages": [
+  {
+    "role": "user",
+    "content": "Tell me a story as if we were playing dungeons and dragons."
+  }
+  ],
+  "stream": true,
+  "max_tokens": 30
+}'
+cp ${log_path}/output_workers.log ${log_path}/workers_start.log
+echo "Starting benchmark..."
+for concurrency in ${concurrency_list}; do
+    concurrency=$((concurrency * num_gen_servers))
+    num_prompts=$((concurrency * multi_round))
+    echo "Benchmarking with concurrency ${concurrency} ... ${num_prompts} prompts"
+    mkdir -p ${log_path}/concurrency_${concurrency}
+    genai-perf profile \
+    	--model ${model} \
+    	--tokenizer ${model_path} \
+    	--endpoint-type chat \
+    	--endpoint /v1/chat/completions \
+    	--streaming \
+    	--url ${hostname}:${port} \
+    	--synthetic-input-tokens-mean ${isl} \
+    	--synthetic-input-tokens-stddev 0 \
+    	--output-tokens-mean ${osl} \
+    	--output-tokens-stddev 0 \
+    	--extra-inputs max_tokens:${osl} \
+    	--extra-inputs min_tokens:${osl} \
+    	--extra-inputs ignore_eos:true \
+	    --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
+    	--concurrency ${concurrency} \
+    	--request-count $(($concurrency*10)) \
+    	--warmup-request-count $(($concurrency*2)) \
+	    --num-dataset-entries ${num_prompts} \
+    	--random-seed 100 \
+    	--artifact-dir ${artifacts_dir} \
+    	-- \
+    	-v \
+    	--max-threads ${concurrency} \
+    	-H 'Authorization: Bearer NOT USED' \
+    	-H 'Accept: text/event-stream'
+    echo "Benchmark with concurrency ${concurrency} done"
+    do_get_logs ${log_path}/output_workers.log ${log_path}/concurrency_${concurrency}
+    echo -n "" > ${log_path}/output_workers.log
+done
+job_id=${SLURM_JOB_ID}
+if [ -n "${job_id}" ]; then
+    echo "${SLURM_JOB_NODELIST}" > ${log_path}/job_${job_id}.txt
+fi
--- a/components/backends/trtllm/performance_sweeps/scripts/gen_yaml.py
+++ b/components/backends/trtllm/performance_sweeps/scripts/gen_yaml.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+import argparse
+import os
+import re
+from typing import Any, Dict, List
+import yaml
+def process_node_and_task() -> tuple[int, List[str], List[str]]:
+    """
+    Process SLURM node and task environment variables.
+    Returns:
+        tuple: (max_tasks_per_node, nodes, task_nodes)
+    """
+    slurm_job_nodelist = os.getenv("SLURM_JOB_NODELIST", "")
+    print(f"SLURM_JOB_NODELIST: {slurm_job_nodelist}")
+    if not slurm_job_nodelist:
+        raise ValueError("Environment variable SLURM_JOB_NODELIST not found.")
+    slurm_tasks_per_node = os.getenv("SLURM_TASKS_PER_NODE", "")
+    print(f"SLURM_TASKS_PER_NODE: {slurm_tasks_per_node}")
+    if not slurm_tasks_per_node:
+        raise ValueError("Environment variable SLURM_TASKS_PER_NODE not found.")
+    # Generate list of nodes
+    if "[" in slurm_job_nodelist:
+        # Handle nodelist with range format (e.g., "ptyche[0065-0066]" or "nvl72029-T[15,18]")
+        node_prefix = slurm_job_nodelist.split("[")[0]  # Extract everything before '['
+        node_range_match = re.search(r"\[(.*?)\]", slurm_job_nodelist)
+        if node_range_match is None:
+            raise ValueError(
+                f"Invalid nodelist format: expected range in brackets but found '{slurm_job_nodelist}'"
+            )
+        node_range = node_range_match.group(1)
+        nodes = []
+        for part in node_range.split(","):
+            if "-" in part:
+                start, end = part.split("-")
+                # Get the width of the number format from the first number
+                width = len(start)
+                # Convert to integers after getting the width
+                start, end = int(start), int(end)
+                # Format numbers with leading zeros
+                nodes.extend(
+                    [
+                        f"{node_prefix}{str(i).zfill(width)}"
+                        for i in range(start, end + 1)
+                    ]
+                )
+            else:
+                # Preserve the original format for single numbers
+                nodes.append(f"{node_prefix}{part}")
+    else:
+        # Handle single node format (e.g., "ptyche0065")
+        nodes = [slurm_job_nodelist]
+    print(f"Nodes: {nodes}")
+    # Generate tasks per node
+    tasks_per_node = []
+    for part in slurm_tasks_per_node.split(","):
+        if "(x" in part:
+            count, repeat = map(int, re.findall(r"\d+", part))
+            tasks_per_node.extend([count] * repeat)
+        else:
+            tasks_per_node.append(int(part))
+    print(f"Tasks per node: {tasks_per_node}")
+    if len(tasks_per_node) != len(nodes):
+        raise ValueError(
+            f"Number of nodes and tasks per node do not match. Number of nodes: {len(nodes)}, Number of tasks per node: {len(tasks_per_node)}"
+        )
+    max_tasks_per_node = max(tasks_per_node)
+    task_nodes = []
+    for node, tasks in zip(nodes, tasks_per_node):
+        task_nodes.extend([node] * tasks)
+    return max_tasks_per_node, nodes, task_nodes
+def generate_urls(
+    ctx_or_gen: str,
+    num_instances: int,
+    tensor_parallel_size: int,
+    pipeline_parallel_size: int,
+    max_tasks_per_node: int,
+    nodes: List[str],
+    task_nodes: List[str],
+    node_to_port: Dict[str, int],
+    task_nodes_offset: int = 0,
+) -> tuple[List[str], int]:
+    """
+    Generate URLs for context or generation servers.
+    Returns:
+        tuple: (urls, updated_task_nodes_offset)
+    """
+    urls: List[str] = []
+    for instance in range(num_instances):
+        tasks_needed = tensor_parallel_size * pipeline_parallel_size
+        if (task_nodes_offset + tasks_needed) > len(task_nodes):
+            print(f"{ctx_or_gen} urls so far: {urls}")
+            raise ValueError(
+                f"For {ctx_or_gen} instance {instance}, there are not enough tasks available. task_nodes_offset: {task_nodes_offset}, tasks_needed: {tasks_needed}, len(task_nodes): {len(task_nodes)}"
+            )
+        min_node = (tasks_needed + max_tasks_per_node - 1) // max_tasks_per_node
+        instance_nodes = set(
+            task_nodes[task_nodes_offset : task_nodes_offset + tasks_needed]
+        )
+        if len(instance_nodes) > min_node:
+            raise ValueError(
+                f"Tasks for a instance {instance} of {ctx_or_gen} instances use more node than expected. Nodes used: {instance_nodes}, number of nodes expected: {min_node}, max_tasks_per_node: {max_tasks_per_node}"
+            )
+        node = task_nodes[task_nodes_offset]
+        port = node_to_port[node]
+        node_to_port[node] += 1
+        task_nodes_offset += tasks_needed
+        urls.append(f"{node}:{port}")
+    print(f"{ctx_or_gen} urls: {urls}")
+    return urls, task_nodes_offset
+def gen_config_file(
+    config_path: str,
+    decode_config_path: str,
+    instance_config_path: str,
+    model_path: str,
+    num_ctx_servers: int,
+    ctx_tp_size: int,
+    ctx_batch_size: int,
+    ctx_max_num_tokens: int,
+    ctx_max_seq_len: int,
+    ctx_free_gpu_memory_fraction: float,
+    ctx_enable_attention_dp: bool,
+    num_gen_servers: int,
+    gen_tp_size: int,
+    gen_batch_size: int,
+    gen_max_num_tokens: int,
+    gen_max_seq_len: int,
+    gen_enable_attention_dp: bool,
+    gen_gpu_memory_fraction: float,
+    eplb_num_slots: int,
+    mtp_size: int = 0,
+    worker_start_port: int = 8001,
+    server_port: int = 8000,
+    cache_transceiver_max_num_tokens: int = 4608,
+) -> None:
+    """
+    Generate configuration YAML file for disaggregated inference.
+    Args:
+        config_path: Path to save the config file
+        model_path: Path to the model
+        num_ctx_servers: Number of context servers
+        ctx_tp_size: Tensor parallel size for context servers
+        ctx_batch_size: Batch size for context servers
+        ctx_max_num_tokens: Max number of tokens for context servers
+        ctx_max_seq_len: Max sequence length for context servers
+        ctx_free_gpu_memory_fraction: Free GPU memory fraction for context servers
+        ctx_enable_attention_dp: Enable attention DP for context servers
+        num_gen_servers: Number of generation servers
+        gen_tp_size: Tensor parallel size for generation servers
+        gen_batch_size: Batch size for generation servers
+        gen_max_num_tokens: Max number of tokens for generation servers
+        gen_enable_attention_dp: Enable attention DP for generation servers
+        gen_gpu_memory_fraction: GPU memory fraction for generation servers
+        eplb_num_slots: Number of slots for eplb
+        worker_start_port: Start port for workers
+        server_port: Server port
+    """
+    gen_cuda_graph_batch_sizes = [
+        1,
+        2,
+        4,
+        8,
+        16,
+        32,
+        64,
+        128,
+        256,
+        512,
+        768,
+        1024,
+        2048,
+        gen_batch_size,
+    ]
+    gen_moe_backend = "CUTLASS"
+    if gen_tp_size >= 16 and gen_enable_attention_dp:
+        gen_moe_backend = "WIDEEP"
+    if not gen_enable_attention_dp:
+        gen_moe_backend = "TRTLLM"
+    prefill_config: Dict[str, Any] = {
+        "max_batch_size": ctx_batch_size,
+        "max_num_tokens": ctx_max_num_tokens,
+        "max_seq_len": ctx_max_seq_len,
+        "tensor_parallel_size": ctx_tp_size,
+        "moe_expert_parallel_size": ctx_tp_size,
+        "enable_attention_dp": ctx_enable_attention_dp,
+        "pipeline_parallel_size": 1,
+        "print_iter_log": True,
+        "disable_overlap_scheduler": True,
+        "kv_cache_config": {
+            "enable_block_reuse": False,
+            "free_gpu_memory_fraction": ctx_free_gpu_memory_fraction,
+            "dtype": "fp8",
+        },
+        "cache_transceiver_config": {
+            "max_tokens_in_buffer": cache_transceiver_max_num_tokens,
+            "backend": "default",
+        },
+    }
+    decode_config: Dict[str, Any] = {
+        "tensor_parallel_size": gen_tp_size,
+        "moe_expert_parallel_size": gen_tp_size,
+        "enable_attention_dp": gen_enable_attention_dp,
+        "pipeline_parallel_size": 1,
+        "max_batch_size": gen_batch_size,
+        "max_num_tokens": gen_max_num_tokens,
+        "max_seq_len": gen_max_seq_len,
+        "cuda_graph_config": {
+            "enable_padding": True,
+            "batch_sizes": gen_cuda_graph_batch_sizes,
+        },
+        "print_iter_log": True,
+        "kv_cache_config": {
+            "enable_block_reuse": False,
+            "free_gpu_memory_fraction": gen_gpu_memory_fraction,
+            "dtype": "fp8",
+        },
+        "moe_config": {
+            "backend": gen_moe_backend,
+        },
+        "cache_transceiver_config": {
+            "max_tokens_in_buffer": cache_transceiver_max_num_tokens,
+            "backend": "default",
+        },
+        "stream_interval": 20,
+    }
+    if gen_tp_size == 8 and not gen_enable_attention_dp:
+        decode_config["allreduce_strategy"] = "MNNVL"
+    if eplb_num_slots > 0:
+        moe_load_balancer_file = os.path.join(
+            os.path.dirname(config_path), "moe_load_balancer.yaml"
+        )
+        moe_load_balancer_config = {
+            "num_slots": eplb_num_slots,
+            "layer_updates_per_iter": 1,
+        }
+        with open(moe_load_balancer_file, "w") as f:
+            yaml.dump(
+                moe_load_balancer_config, f, default_flow_style=False, sort_keys=False
+            )
+        decode_config["moe_config"]["load_balancer"] = moe_load_balancer_file
+    if mtp_size > 0:
+        prefill_config["speculative_config"] = {
+            "decoding_type": "MTP",
+            "num_nextn_predict_layers": mtp_size,
+        }
+        decode_config["speculative_config"] = {
+            "decoding_type": "MTP",
+            "num_nextn_predict_layers": mtp_size,
+        }
+    counts = {"prefill_count": num_ctx_servers, "decode_count": num_gen_servers}
+    with open(instance_config_path, "w") as f:
+        yaml.dump(counts, f, default_flow_style=False, sort_keys=False)
+    # Write config to file
+    with open(config_path, "w") as f:
+        yaml.dump(prefill_config, f, default_flow_style=False, sort_keys=False)
+    with open(decode_config_path, "w") as f:
+        yaml.dump(decode_config, f, default_flow_style=False, sort_keys=False)
+# gen main and args
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", type=str, default="/tmp/config.yaml")
+    parser.add_argument("--model", type=str, required=True, help="Path to the model")
+    parser.add_argument(
+        "--num_ctx_servers", type=int, required=True, help="Number of context servers"
+    )
+    parser.add_argument(
+        "--ctx_tp_size",
+        type=int,
+        required=True,
+        help="Tensor parallel size for context servers",
+    )
+    parser.add_argument(
+        "--ctx_batch_size",
+        type=int,
+        required=True,
+        help="Batch size for context servers",
+    )
+    parser.add_argument(
+        "--ctx_max_num_tokens",
+        type=int,
+        required=True,
+        help="Max number of tokens for context servers",
+    )
+    parser.add_argument(
+        "--ctx_max_seq_len",
+        type=int,
+        required=True,
+        help="Max sequence length for context servers",
+    )
+    parser.add_argument(
+        "--ctx_free_gpu_memory_fraction",
+        type=float,
+        required=True,
+        help="Free GPU memory fraction for context servers",
+    )
+    parser.add_argument(
+        "--ctx_enable_attention_dp",
+        dest="ctx_enable_attention_dp",
+        action="store_true",
+        help="Enable attention DP for context servers",
+    )
+    parser.add_argument(
+        "--num_gen_servers",
+        type=int,
+        required=True,
+        help="Number of generation servers",
+    )
+    parser.add_argument(
+        "--gen_tp_size",
+        type=int,
+        required=True,
+        help="Tensor parallel size for generation servers",
+    )
+    parser.add_argument(
+        "--gen_batch_size",
+        type=int,
+        required=True,
+        help="Batch size for generation servers",
+    )
+    parser.add_argument(
+        "--gen_max_num_tokens",
+        type=int,
+        required=True,
+        help="Max number of tokens for generation servers",
+    )
+    parser.add_argument(
+        "--gen_max_seq_len",
+        type=int,
+        required=True,
+        help="Max sequence length for generation servers",
+    )
+    parser.add_argument(
+        "--gen_enable_attention_dp",
+        dest="gen_enable_attention_dp",
+        action="store_true",
+        help="Enable attention DP for generation servers",
+    )
+    parser.add_argument(
+        "--gen_gpu_memory_fraction",
+        type=float,
+        required=True,
+        help="GPU memory fraction for generation servers",
+    )
+    parser.add_argument(
+        "--eplb_num_slots", type=int, default=0, help="Number of slots for eplb"
+    )
+    parser.add_argument(
+        "--mtp_size", type=int, default=0, help="Number of nextn layers for MTP"
+    )
+    parser.add_argument(
+        "--worker_start_port", type=int, default=8336, help="Start port for workers"
+    )
+    parser.add_argument("--server_port", type=int, default=8333, help="Server port")
+    parser.add_argument(
+        "--cache_transceiver_max_num_tokens",
+        type=int,
+        default=4608,
+        help="Max number of tokens for cache transceiver",
+    )
+    args = parser.parse_args()
+    prefill_config = args.config.replace("config.yaml", "prefill_config.yaml")
+    decode_config = args.config.replace("config.yaml", "decode_config.yaml")
+    instance_config = args.config.replace("config.yaml", "instance_config.yaml")
+    gen_config_file(
+        prefill_config,
+        decode_config,
+        instance_config,
+        args.model,
+        args.num_ctx_servers,
+        args.ctx_tp_size,
+        args.ctx_batch_size,
+        args.ctx_max_num_tokens,
+        args.ctx_max_seq_len,
+        args.ctx_free_gpu_memory_fraction,
+        args.ctx_enable_attention_dp,
+        args.num_gen_servers,
+        args.gen_tp_size,
+        args.gen_batch_size,
+        args.gen_max_num_tokens,
+        args.gen_max_seq_len,
+        args.gen_enable_attention_dp,
+        args.gen_gpu_memory_fraction,
+        args.eplb_num_slots,
+        args.mtp_size,
+        args.worker_start_port,
+        args.server_port,
+        args.cache_transceiver_max_num_tokens,
+    )
--- a/components/backends/trtllm/performance_sweeps/scripts/start_agg_worker.sh
+++ b/components/backends/trtllm/performance_sweeps/scripts/start_agg_worker.sh
+#! /bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+model_path=$1
+max_batch=$2
+max_num_tokens=$3
+tp_size=$4
+ep_size=$5
+enable_attention_dp=$6
+gpu_fraction=$7
+max_seq_len=$8
+mtp=$9
+model_name=${10}
+# Validate all required parameters
+if [ -z "${model_path}" ] || [ -z "${max_batch}" ] || [ -z "${max_num_tokens}" ] || [ -z "${tp_size}" ] || [ -z "${ep_size}" ] || [ -z "${enable_attention_dp}" ] || [ -z "${gpu_fraction}" ] || [ -z "${max_seq_len}" ] || [ -z "${mtp}" ] || [ -z "${model_name}" ]; then
+    echo "Error: Missing required arguments"
+    echo "Usage: $0 model_path max_batch max_num_tokens tp_size ep_size enable_attention_dp gpu_fraction max_seq_len mtp model_name"
+    echo ""
+    echo "Parameters:"
+    echo "  model_path: Path to the model"
+    echo "  max_batch: Maximum batch size (integer)"
+    echo "  max_num_tokens: Maximum number of tokens (integer)"
+    echo "  tp_size: Tensor parallel size (integer)"
+    echo "  ep_size: Expert parallel size (integer)"
+    echo "  enable_attention_dp: Enable attention data parallel (true/false)"
+    echo "  gpu_fraction: GPU memory fraction (float 0.0-1.0)"
+    echo "  max_seq_len: Maximum sequence length (integer)"
+    echo "  mtp: MTP size (integer)"
+    echo "  model_name: Name of the model to serve"
+    exit 1
+fi
+# Validate numeric parameters
+if ! [[ "${max_batch}" =~ ^[0-9]+$ ]]; then
+    echo "Error: max_batch must be a positive integer, got: ${max_batch}"
+    exit 1
+fi
+if ! [[ "${max_num_tokens}" =~ ^[0-9]+$ ]]; then
+    echo "Error: max_num_tokens must be a positive integer, got: ${max_num_tokens}"
+    exit 1
+fi
+if ! [[ "${tp_size}" =~ ^[0-9]+$ ]]; then
+    echo "Error: tp_size must be a positive integer, got: ${tp_size}"
+    exit 1
+fi
+if ! [[ "${ep_size}" =~ ^[0-9]+$ ]]; then
+    echo "Error: ep_size must be a positive integer, got: ${ep_size}"
+    exit 1
+fi
+if ! [[ "${gpu_fraction}" =~ ^[0-9]*\.?[0-9]+$ ]] || (( $(echo "${gpu_fraction} <= 0" | bc -l) )) || (( $(echo "${gpu_fraction} > 1" | bc -l) )); then
+    echo "Error: gpu_fraction must be a float between 0.0 and 1.0, got: ${gpu_fraction}"
+    exit 1
+fi
+if ! [[ "${max_seq_len}" =~ ^[0-9]+$ ]]; then
+    echo "Error: max_seq_len must be a positive integer, got: ${max_seq_len}"
+    exit 1
+fi
+if ! [[ "${mtp}" =~ ^[0-9]+$ ]]; then
+    echo "Error: mtp must be a positive integer, got: ${mtp}"
+    exit 1
+fi
+# Validate enable_attention_dp is true or false
+if [ "${enable_attention_dp}" != "true" ] && [ "${enable_attention_dp}" != "false" ]; then
+    echo "Error: enable_attention_dp must be 'true' or 'false', got: ${enable_attention_dp}"
+    exit 1
+fi
+# echo all parameters
+echo "model_path: ${model_path}"
+echo "max_batch: ${max_batch}"
+echo "max_num_tokens: ${max_num_tokens}"
+echo "tp_size: ${tp_size}"
+echo "ep_size: ${ep_size}"
+echo "enable_attention_dp: ${enable_attention_dp}"
+echo "gpu_fraction: ${gpu_fraction}"
+echo "max_seq_len: ${max_seq_len}"
+echo "mtp: ${mtp}"
+# check enable_attention_dp is true or false
+if [ ${enable_attention_dp} == "true" ]; then
+    enable_attention_dp_flag="true"
+    moe_backend="CUTLASS"
+else
+    enable_attention_dp_flag="false"
+    moe_backend="TRTLLM"
+fi
+extra_llm_api_file=/tmp/extra-llm-api-config.yml
+if [ ${mtp} -gt 0 ]; then
+cat << EOF > ${extra_llm_api_file}
+tensor_parallel_size: ${tp_size}
+moe_expert_parallel_size: ${ep_size}
+max_batch_size: ${max_batch}
+max_num_tokens: ${max_num_tokens}
+max_seq_len: ${max_seq_len}
+trust_remote_code: true
+cuda_graph_config:
+    enable_padding: true
+    max_batch_size: ${max_batch}
+kv_cache_config:
+    dtype: fp8
+    free_gpu_memory_fraction: ${gpu_fraction}
+    enable_block_reuse: false
+print_iter_log: true
+enable_attention_dp: ${enable_attention_dp_flag}
+stream_interval: 10
+speculative_config:
+  decoding_type: MTP
+  num_nextn_predict_layers: ${mtp}
+moe_config:
+    backend: ${moe_backend}
+    max_num_tokens: 37376
+EOF
+else
+cat << EOF > ${extra_llm_api_file}
+tensor_parallel_size: ${tp_size}
+moe_expert_parallel_size: ${ep_size}
+max_batch_size: ${max_batch}
+max_num_tokens: ${max_num_tokens}
+max_seq_len: ${max_seq_len}
+trust_remote_code: true
+cuda_graph_config:
+    enable_padding: true
+    max_batch_size: ${max_batch}
+enable_attention_dp: ${enable_attention_dp_flag}
+print_iter_log: true
+kv_cache_config:
+    dtype: fp8
+    free_gpu_memory_fraction: ${gpu_fraction}
+    enable_block_reuse: false
+stream_interval: 10
+moe_config:
+    backend: ${moe_backend}
+    max_num_tokens: 37376
+EOF
+fi
+echo "extra_llm_api_file generated: ${extra_llm_api_file}"
+cat ${extra_llm_api_file}
+echo "TRT_LLM_VERSION: $TRT_LLM_VERSION"
+echo "TRT_LLM_GIT_COMMIT: $TRT_LLM_GIT_COMMIT"
+# start the server
+trtllm-llmapi-launch python3 -m dynamo.trtllm --model-path $model_path --served-model-name $model_name --extra-engine-args ${extra_llm_api_file}
--- a/components/backends/trtllm/performance_sweeps/scripts/start_frontend.sh
+++ b/components/backends/trtllm/performance_sweeps/scripts/start_frontend.sh
+#! /bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+echo "commit id: $TRT_LLM_GIT_COMMIT"
+echo "ucx info: $(ucx_info -v)"
+echo "hostname: $(hostname)"
+hostname=$(hostname)
+short_hostname=$(echo "$hostname" | awk -F'.' '{print $1}')
+echo "short_hostname: ${short_hostname}"
+# Start NATS
+nats-server -js &
+# Start etcd
+etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
+# Wait for NATS/etcd to startup
+sleep 3
+# Start OpenAI Frontend which will dynamically discover workers when they startup
+# NOTE: This is a blocking call.
+python3 -m dynamo.frontend --http-port 8000
--- a/components/backends/trtllm/performance_sweeps/scripts/start_worker.sh
+++ b/components/backends/trtllm/performance_sweeps/scripts/start_worker.sh
+#! /bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+config_file=$1
+enable_pdl=$2
+ctx_gpus=$3
+model_name=$4
+model_path=$5
+disaggregation_mode=$6
+unset UCX_TLS
+echo "config_file: ${config_file}, enable_pdl: ${enable_pdl}, ctx_gpus: ${ctx_gpus}, disaggregation_mode: ${disaggregation_mode}"
+export TLLM_LOG_LEVEL=INFO
+export TRTLLM_MOE_ENABLE_ALLTOALL_WITHOUT_ALLGATHER=1
+if [ "${enable_pdl}" = "true" ]; then
+    export TRTLLM_ENABLE_PDL=1
+fi
+trtllm-llmapi-launch python3 -m dynamo.trtllm --model-path $model_path --served-model-name $model_name --disaggregation-mode $disaggregation_mode  --extra-engine-args $config_file
--- a/components/backends/trtllm/performance_sweeps/set_clock.sh
+++ b/components/backends/trtllm/performance_sweeps/set_clock.sh
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+set -x
+# nvidia-smi -lgc 1965" ### This way doesn't work on GB200-NVL. bug 5069775
+hostname
+nvidia-smi
+MAX_GPU_CLOCK=$(nvidia-smi -q -d CLOCK | grep -m 1 -A 1 Max | awk '/Graphics/ {print $3}')
+MAX_MEM_CLOCK=$(nvidia-smi -q -d CLOCK | grep -m 1 -A 4 Max | awk '/Memory/ {print $3}')
+echo "Setting application clock to Mem Clock: $MAX_MEM_CLOCK and GPU Clock: $MAX_GPU_CLOCK."
+sudo nvidia-smi -rgc
+sudo nvidia-smi -ac $MAX_MEM_CLOCK,$MAX_GPU_CLOCK
\ No newline at end of file
--- a/components/backends/trtllm/performance_sweeps/submit.sh
+++ b/components/backends/trtllm/performance_sweeps/submit.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+if [[ -z ${MODEL_PATH} ]]; then
+    echo "ERROR: MODEL_PATH was not set."
+    echo "ERROR: MODEL_PATH must be set to either the HuggingFace ID or locally " \
+         "downloaded path to the model weights. Since Deepseek R1 is large, it is " \
+         "recommended to pre-download them to a shared location and provide the path."
+    exit 1
+fi
+if [[ -z ${SERVED_MODEL_NAME} ]]; then
+    echo "ERROR: SERVED_MODEL_NAME was not set."
+    exit 1
+fi
+IMAGE="${IMAGE:-""}"
+# For GB200, we use 4 tasks per node.
+NTASKS_PER_NODE="${NTASKS_PER_NODE:-4}"
+ISL="${ISL:-8150}"
+OSL="${OSL:-1024}"
+# Build slurm_args step-by-step with validation and defaults
+slurm_args="--time=04:00:00"
+# Add partition if set
+if [[ -n "${SLURM_PARTITION:-}" ]]; then
+    slurm_args="${slurm_args} --partition=${SLURM_PARTITION}"
+fi
+# Add account if set
+if [[ -n "${SLURM_ACCOUNT:-}" ]]; then
+    slurm_args="${slurm_args} --account=${SLURM_ACCOUNT}"
+fi
+# Add job name with sensible default
+if [[ -n "${SLURM_JOB_NAME:-}" ]]; then
+    slurm_args="${slurm_args} --job-name=${SLURM_JOB_NAME}"
+fi
+# Usage Instructions
+usage() {
+    echo "Usage: $0 <mtp_mode> <mode> [ctx_num] [gen_num] [gen_tp_size] [gen_batch_size] [gen_max_num_tokens] [gen_gpu_memory_fraction] [gen_eplb_num_slots] [gen_mtp_size] [gen_concurrency_list]"
+    echo ""
+    echo "MTP Modes:"
+    echo "  mtp=off - Run without Multi-Token Prediction (gen_mtp_size=0)"
+    echo "  mtp=on  - Run with Multi-Token Prediction (gen_mtp_size=1,2,3)"
+    echo ""
+    echo "Execution Modes:"
+    echo "  all - Run all predefined GPU configurations (4, 8, 16, 32 GPUs)"
+    echo "  tep - Run Tensor-Expert Parallel mode (attention_dp=false)"
+    echo "  dep - Run Data-Expert Parallel mode (attention_dp=true)"
+    echo "  4GPU, 8GPU, 16GPU, 32GPU - Run specific GPU configurations"
+    echo ""
+    echo "Parameters for tep/dep modes:"
+    echo "  ctx_num: Number of context nodes"
+    echo "  gen_num: Number of generation nodes"
+    echo "  gen_tp_size: Generation tensor parallel size"
+    echo "  gen_batch_size: Generation batch size"
+    echo "  gen_max_num_tokens: Generation max number of tokens"
+    echo "  gen_gpu_memory_fraction: GPU memory fraction (0.7-0.95)"
+    echo "  gen_mtp_size: Multi-Token Prediction size (0 for mtp=off, 1-3 for mtp=on)"
+    echo "  gen_eplb_num_slots: Expert load balancing slots (0, 256, 288)"
+    echo "  gen_concurrency_list: Concurrency values (space-separated, quoted)"
+    echo ""
+    echo "Examples:"
+    echo "  $0 mtp=off all                                    # Run all MTP0 predefined combinations"
+    echo "  $0 mtp=on all                                     # Run all MTP predefined combinations"
+    echo "  $0 mtp=off tep 1 3 4 128 128 0.9 0 0 \"1 2 4 8\" # Run MTP0 TEP with specific config"
+    echo "  $0 mtp=on dep 2 3 8 256 256 0.8 0 256 \"256 512 1024\" # Run MTP DEP with specific config"
+    exit 1
+}
+# Run single task
+run_single() {
+    local ctx_num=$1
+    local gen_num=$2
+    local gen_tp_size=$3
+    local gen_batch_size=$4
+    local gen_max_num_tokens=$5
+    local gen_enable_attention_dp=$6
+    local gen_gpu_memory_fraction=$7
+    local gen_mtp_size=$8
+    local gen_eplb_num_slots=$9
+    local gen_concurrency_list=${10}
+    # TODO: expose kind to the command line
+    local kind="dynamo_disagg"
+    gen_nodes=$(((gen_tp_size + 3)/4 * gen_num))
+    total_nodes=$((ctx_num + gen_nodes))
+    total_tasks=$((total_nodes * 4))
+    set -x
+    sbatch --nodes=${total_nodes} --ntasks=${total_tasks} --ntasks-per-node=${NTASKS_PER_NODE} --segment=${total_nodes} ${slurm_args} benchmark.slurm ${ctx_num} 4 1 8448 true ${gen_num} ${gen_tp_size} ${gen_batch_size} ${gen_max_num_tokens} ${gen_enable_attention_dp} ${gen_gpu_memory_fraction} ${gen_eplb_num_slots} ${gen_mtp_size} "${gen_concurrency_list}" ${gen_nodes} ${kind} ${MODEL_PATH} ${SERVED_MODEL_NAME} ${IMAGE} ${ISL} ${OSL}
+    set +x
+}
+# MTP0 Configuration (gen_mtp_size=0)
+run_4_gpus_mtp0() {
+    echo "Running 4 GPUs MTP0 combinations..."
+    run_single 1 4 4 16 16 false "0.9" 0 0 "1 2 4 8 16 24 "
+    run_single 1 3 4 32 32 false "0.9" 0 0 "32 48"
+    run_single 1 2 4 64 64 false "0.9" 0 0 "64 96"
+    run_single 2 3 4 128 128 false "0.9" 0 0 "128 192"
+    run_single 3 2 4 64 64 true "0.8" 0 0 "256 384"
+    run_single 2 1 4 128 128 true "0.8" 0 0 "512 768"
+}
+run_8_gpus_mtp0() {
+    echo "Running 8 GPUs MTP0 combinations..."
+    run_single 1 4 8 16 16 false "0.9" 0 0 "1 2 4 8 16 24"
+    run_single 1 2 8 32 32 false "0.9" 0 0 "32 48"
+    run_single 2 3 8 64 64 false "0.9" 0 0 "64 96"
+    run_single 1 1 8 128 128 false "0.9" 0 0 "128 192"
+    run_single 3 2 8 32 32 true "0.8" 0 0 "256 384"
+    run_single 3 1 8 64 64 true "0.8" 0 0 "512 768"
+    run_single 4 1 8 128 128 true "0.8" 0 0 "1024 1536"
+    run_single 6 1 8 256 256 true "0.8" 0 0 "2048 3072"
+}
+run_16_gpus_mtp0() {
+    echo "Running 16 GPUs MTP0 combinations..."
+    run_single 1 1 16 8 8 true "0.8" 0 0 "16 32 64 128 192" # 5
+    run_single 2 1 16 16 16 true "0.8" 0 0 "256 384"        # 6
+    run_single 4 1 16 32 32 true "0.8" 0 0 "512 768"        # 8
+    run_single 6 1 16 64 64 true "0.8" 0 0 "1024 1536"      # 10
+    run_single 9 1 16 128 128 true "0.8" 0 0 "2048 3072"    # 13
+    run_single 12 1 16 256 256 true "0.8" 0 288 "4096 6144" # 16
+}
+run_32_gpus_mtp0() {
+    echo "Running 32 GPUs MTP0 combinations..."
+    run_single 1 1 32 4 4 true "0.7" 0 0 "32 64 128 192"   # 9
+    run_single 2 1 32 8 8 true "0.7" 0 0 "256 384"         # 10
+    run_single 4 1 32 16 16 true "0.7" 0 0 "512 768"       # 12
+    run_single 7 1 32 32 32 true "0.7" 0 0 "1024 1536"     # 15
+}
+# MTP Configuration (gen_mtp_size=1,2,3)
+run_4_gpus_mtp() {
+    echo "Running 4 GPUs MTP combinations..."
+    run_single 1 4 4 8 32 false "0.9" 3 0 "1 2 4 8 12"
+    run_single 1 3 4 16 64 false "0.9" 3 0 "16 24"
+    run_single 1 2 4 32 128 false "0.9" 3 0 "32 48"
+    run_single 2 3 4 16 64 true "0.8" 3 0 "64 96"
+    run_single 1 1 4 32 128 true "0.8" 3 0 "128 192"
+    run_single 2 1 4 64 256 true "0.8" 2 0 "256 384"
+    run_single 5 2 4 128 512 true "0.8" 1 0 "512 768"
+}
+run_8_gpus_mtp() {
+    echo "Running 8 GPUs MTP combinations..."
+    run_single 1 4 8 8 32 false "0.9" 3 0 "1 2 4 8 12"
+    run_single 1 2 8 16 64 false "0.9" 3 0 "16 24"
+    run_single 1 1 8 32 128 false "0.9" 3 0 "32 48"
+    run_single 1 1 8 8 32 true "0.8" 3 0 "64 96"
+    run_single 3 2 8 16 64 true "0.8" 3 0 "128 192"
+    run_single 5 2 8 32 128 true "0.8" 3 0 "256 384"
+    run_single 4 1 8 64 256 true "0.8" 2 0 "512 768"
+    run_single 6 1 8 128 256 true "0.8" 1 0 "1024 1536"
+    run_single 7 1 8 256 512 true "0.8" 1 0 "2048 3072"
+}
+run_16_gpus_mtp() {
+    echo "Running 16 GPUs MTP combinations..."
+    run_single 1 1 16 4 16 true "0.8" 3 0 "16 32 64 96" # 5
+    run_single 2 1 16 8 32 true "0.8" 3 0 "128 192"       # 6
+    run_single 4 1 16 16 64 true "0.8" 3 0 "256 384"      # 8
+    run_single 6 1 16 32 128 true "0.8" 3 0 "512 768"    # 10
+    run_single 9 1 16 64 256 true "0.8" 2 256 "1024 1536" # 13
+    run_single 11 1 16 128 256 true "0.8" 1 288 "2048 3072" # 15
+}
+run_32_gpus_mtp() {
+    echo "Running 32 GPUs MTP combinations..."
+    run_single 1 1 32 16 64 true "0.7" 3 0 "32 48" # 9
+    run_single 2 1 32 16 64 true "0.7" 3 0 "64 96" # 10
+    run_single 3 1 32 4 16 true "0.7" 3 0 "128 192" # 11
+    run_single 5 1 32 8 32 true "0.7" 3 0 "256 384" # 13
+    run_single 8 1 32 16 64 true "0.7" 3 288 "512 768" # 16
+}
+# Main function
+main() {
+    local mtp_mode=$1
+    local mode=$2
+    # Validate MTP mode
+    if [[ "$mtp_mode" != "mtp=off" && "$mtp_mode" != "mtp=on" ]]; then
+        echo "Error: Invalid MTP mode '$mtp_mode'. Must be 'mtp=off' or 'mtp=on'"
+        usage
+    fi
+    case $mode in
+        "all")
+            echo "Running all GPU configurations for $mtp_mode mode..."
+            if [[ "$mtp_mode" == "mtp=off" ]]; then
+                run_4_gpus_mtp0
+                run_8_gpus_mtp0
+                run_16_gpus_mtp0
+                run_32_gpus_mtp0
+            else
+                run_4_gpus_mtp
+                run_8_gpus_mtp
+                run_16_gpus_mtp
+                run_32_gpus_mtp
+            fi
+            ;;
+        "4GPU")
+            echo "Running 4 GPUs combinations for $mtp_mode mode..."
+            if [[ "$mtp_mode" == "mtp=off" ]]; then
+                run_4_gpus_mtp0
+            else
+                run_4_gpus_mtp
+            fi
+            ;;
+        "8GPU")
+            echo "Running 8 GPUs combinations for $mtp_mode mode..."
+            if [[ "$mtp_mode" == "mtp=off" ]]; then
+                run_8_gpus_mtp0
+            else
+                run_8_gpus_mtp
+            fi
+            ;;
+        "16GPU")
+            echo "Running 16 GPUs combinations for $mtp_mode mode..."
+            if [[ "$mtp_mode" == "mtp=off" ]]; then
+                run_16_gpus_mtp0
+            else
+                run_16_gpus_mtp
+            fi
+            ;;
+        "32GPU")
+            echo "Running 32 GPUs combinations for $mtp_mode mode..."
+            if [[ "$mtp_mode" == "mtp=off" ]]; then
+                run_32_gpus_mtp0
+            else
+                run_32_gpus_mtp
+            fi
+            ;;
+        "tep")
+            if [ $# -ne 11 ]; then
+                echo "Error: TEP mode requires 11 additional parameters (including mtp_mode)"
+                usage
+            fi
+            local ctx_num=$3
+            local gen_num=$4
+            local gen_tp_size=$5
+            local gen_batch_size=$6
+            local gen_max_num_tokens=$7
+            local gen_gpu_memory_fraction=$8
+            local gen_mtp_size=$9
+            local gen_eplb_num_slots=${10}
+            local gen_concurrency_list=${11}
+            echo "Running TEP mode ($mtp_mode) with ctx_num=$ctx_num, gen_num=$gen_num, gen_tp_size=$gen_tp_size, gen_batch_size=$gen_batch_size, gen_max_num_tokens=$gen_max_num_tokens, gen_gpu_memory_fraction=$gen_gpu_memory_fraction, gen_mtp_size=$gen_mtp_size, gen_eplb_num_slots=$gen_eplb_num_slots, gen_concurrency_list=\"$gen_concurrency_list\""
+            # TEP mode: Use false to disable attention dp
+            run_single $ctx_num $gen_num $gen_tp_size $gen_batch_size $gen_max_num_tokens false $gen_gpu_memory_fraction $gen_mtp_size $gen_eplb_num_slots "$gen_concurrency_list"
+            ;;
+        "dep")
+            if [ $# -ne 11 ]; then
+                echo "Error: DEP mode requires 11 additional parameters (including mtp_mode)"
+                usage
+            fi
+            local ctx_num=$3
+            local gen_num=$4
+            local gen_tp_size=$5
+            local gen_batch_size=$6
+            local gen_max_num_tokens=$7
+            local gen_gpu_memory_fraction=$8
+            local gen_mtp_size=$9
+            local gen_eplb_num_slots=${10}
+            local gen_concurrency_list=${11}
+            echo "Running DEP mode ($mtp_mode) with ctx_num=$ctx_num, gen_num=$gen_num, gen_tp_size=$gen_tp_size, gen_batch_size=$gen_batch_size, gen_max_num_tokens=$gen_max_num_tokens, gen_gpu_memory_fraction=$gen_gpu_memory_fraction, gen_mtp_size=$gen_mtp_size, gen_eplb_num_slots=$gen_eplb_num_slots, gen_concurrency_list=\"$gen_concurrency_list\""
+            run_single $ctx_num $gen_num $gen_tp_size $gen_batch_size $gen_max_num_tokens true $gen_gpu_memory_fraction $gen_mtp_size $gen_eplb_num_slots "$gen_concurrency_list"
+            ;;
+        *)
+            echo "Error: Unknown mode '$mode'"
+            usage
+            ;;
+    esac
+}
+# Check parameters
+if [ $# -eq 0 ]; then
+    usage
+fi
+# Run main function
+main "$@"
\ No newline at end of file
--- a/components/backends/trtllm/performance_sweeps/submit_agg.sh
+++ b/components/backends/trtllm/performance_sweeps/submit_agg.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+if [[ -z ${MODEL_PATH} ]]; then
+    echo "ERROR: MODEL_PATH was not set."
+    echo "ERROR: MODEL_PATH must be set to either the HuggingFace ID or locally " \
+         "downloaded path to the model weights. Since Deepseek R1 is large, it is " \
+         "recommended to pre-download them to a shared location and provide the path."
+    exit 1
+fi
+if [[ -z ${SERVED_MODEL_NAME} ]]; then
+    echo "ERROR: SERVED_MODEL_NAME was not set."
+    exit 1
+fi
+IMAGE="${IMAGE:-""}"
+ISL="${ISL:-8150}"
+OSL="${OSL:-1024}"
+# For GB200, we use 4 tasks per node.
+NTASKS_PER_NODE="${NTASKS_PER_NODE:-4}"
+kind='dynamo_agg'
+common_args="${kind} ${ISL} ${OSL} ${MODEL_PATH} ${SERVED_MODEL_NAME} ${IMAGE}"
+# Build slurm_args step-by-step with validation and defaults
+slurm_args="--time=04:00:00"
+# Add partition if set
+if [[ -n "${SLURM_PARTITION:-}" ]]; then
+    slurm_args="${slurm_args} --partition=${SLURM_PARTITION}"
+fi
+# Add account if set
+if [[ -n "${SLURM_ACCOUNT:-}" ]]; then
+    slurm_args="${slurm_args} --account=${SLURM_ACCOUNT}"
+fi
+# Add job name if set
+if [[ -n "${SLURM_JOB_NAME:-}" ]]; then
+    slurm_args="${slurm_args} --job-name=${SLURM_JOB_NAME}"
+fi
+# tep4
+max_batch=1024
+tp_size=4
+ep_size=${tp_size}
+enable_attention_dp=false
+mtp=0
+nodes_count=$(( (tp_size + NTASKS_PER_NODE - 1) / NTASKS_PER_NODE ))
+concurrency_list="1 2 4 8 16 32 64 128 256 512 1024 2048"
+max_num_tokens=$(( ((mtp+1)*max_batch+ISL+128+63)/64*64 ))
+sbatch --nodes=${nodes_count} --ntasks=${tp_size} --ntasks-per-node=${NTASKS_PER_NODE} ${slurm_args} benchmark_agg.slurm ${tp_size} ${ep_size} ${max_batch} ${max_num_tokens} ${enable_attention_dp} "${concurrency_list}" ${mtp} ${common_args}
+# dep4
+max_batch=1024
+tp_size=4
+ep_size=${tp_size}
+enable_attention_dp=true
+mtp=0
+nodes_count=$((tp_size/NTASKS_PER_NODE))
+concurrency_list="32 64 128 256 512 1024"
+max_num_tokens=$(( ((mtp+1)*max_batch+ISL+128+63)/64*64 ))
+sbatch --nodes=${nodes_count} --ntasks=${tp_size} --ntasks-per-node=${NTASKS_PER_NODE} ${slurm_args} benchmark_agg.slurm ${tp_size} ${ep_size} ${max_batch} ${max_num_tokens} ${enable_attention_dp} "${concurrency_list}" ${mtp} ${common_args}
+concurrency_list="2048 4096"
+max_num_tokens=$(( ((mtp+1)*max_batch+ISL+128+63)/64*64 ))
+sbatch --nodes=${nodes_count} --ntasks=${tp_size} --ntasks-per-node=${NTASKS_PER_NODE} ${slurm_args} benchmark_agg.slurm ${tp_size} ${ep_size} ${max_batch} ${max_num_tokens} ${enable_attention_dp} "${concurrency_list}" ${mtp} ${common_args}
+# tep8
+max_batch=1024
+tp_size=8
+ep_size=${tp_size}
+enable_attention_dp=false
+mtp=0
+nodes_count=$((tp_size/NTASKS_PER_NODE))
+concurrency_list="1 2 4 8 16 32 64 128 256 512 1024 2048"
+max_num_tokens=$(( ((mtp+1)*max_batch+ISL+128+63)/64*64 ))
+sbatch --nodes=${nodes_count} --ntasks=${tp_size} --ntasks-per-node=${NTASKS_PER_NODE} ${slurm_args} benchmark_agg.slurm ${tp_size} ${ep_size} ${max_batch} ${max_num_tokens} ${enable_attention_dp} "${concurrency_list}" ${mtp} ${common_args}
+# dep8
+max_batch=1024
+tp_size=8
+ep_size=${tp_size}
+enable_attention_dp=true
+mtp=0
+nodes_count=$((tp_size/NTASKS_PER_NODE))
+concurrency_list="32 64 128 256 512 1024"
+max_num_tokens=$(( ((mtp+1)*max_batch+ISL+128+63)/64*64 ))
+sbatch --nodes=${nodes_count} --ntasks=${tp_size} --ntasks-per-node=${NTASKS_PER_NODE} ${slurm_args} benchmark_agg.slurm ${tp_size} ${ep_size} ${max_batch} ${max_num_tokens} ${enable_attention_dp} "${concurrency_list}" ${mtp} ${common_args}
+concurrency_list="2048 4096"
+max_num_tokens=$(( ((mtp+1)*max_batch+ISL+128+63)/64*64 ))
+sbatch --nodes=${nodes_count} --ntasks=${tp_size} --ntasks-per-node=${NTASKS_PER_NODE} ${slurm_args} benchmark_agg.slurm ${tp_size} ${ep_size} ${max_batch} ${max_num_tokens} ${enable_attention_dp} "${concurrency_list}" ${mtp} ${common_args}
+# New: dep8 concurrency greater than 4096 as a separate group
+concurrency_list="6144 8192"
+max_num_tokens=$(( ((mtp+1)*max_batch+ISL+128+63)/64*64 ))
+sbatch --nodes=${nodes_count} --ntasks=${tp_size} --ntasks-per-node=${NTASKS_PER_NODE} ${slurm_args} benchmark_agg.slurm  ${tp_size} ${ep_size} ${max_batch} ${max_num_tokens} ${enable_attention_dp} "${concurrency_list}" ${mtp} ${common_args}
\ No newline at end of file