Unverified Commit 88a6f9da authored by Xiaoyu Zhang's avatar Xiaoyu Zhang Committed by GitHub
Browse files

bench_serving support PD Disaggregation (#11542)

parent cb8ed2c0
...@@ -17,6 +17,10 @@ For the design details, please refer to [link](https://docs.google.com/document/ ...@@ -17,6 +17,10 @@ For the design details, please refer to [link](https://docs.google.com/document/
Currently, we support Mooncake and NIXL as the transfer engine. Currently, we support Mooncake and NIXL as the transfer engine.
## Profiling in PD Disaggregation Mode
When you need to profile prefill or decode workers in PD disaggregation mode, please refer to the [Profile In PD Disaggregation Mode](https://docs.sglang.ai/developer_guide/benchmark_and_profiling.html#profile-in-pd-disaggregation-mode) section in the Benchmark and Profiling guide. Due to torch profiler limitations, prefill and decode workers must be profiled separately using dedicated command-line options.
## Router Integration ## Router Integration
For deploying PD disaggregation at scale with load balancing and fault tolerance, SGLang provides a router. The router can distribute requests between prefill and decode instances using various routing policies. For detailed information on setting up routing with PD disaggregation, including configuration options and deployment patterns, see the [SGLang Router documentation](router.md#mode-3-prefill-decode-disaggregation). For deploying PD disaggregation at scale with load balancing and fault tolerance, SGLang provides a router. The router can distribute requests between prefill and decode instances using various routing policies. For detailed information on setting up routing with PD disaggregation, including configuration options and deployment patterns, see the [SGLang Router documentation](router.md#mode-3-prefill-decode-disaggregation).
......
...@@ -47,6 +47,48 @@ Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both serv ...@@ -47,6 +47,48 @@ Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both serv
For more details, please refer to [Bench Serving Guide](./bench_serving.md). For more details, please refer to [Bench Serving Guide](./bench_serving.md).
### Profile In PD Disaggregation Mode
When profiling in PD disaggregation mode, prefill and decode workers **must be profiled separately** due to torch profiler limitations. The `bench_serving` command provides dedicated options for this:
#### Profile Prefill Workers
```bash
# set trace path
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
# start prefill and decode servers (see PD disaggregation docs for setup)
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30001 --base-gpu-id 1
# start router
python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
# send profiling request targeting prefill workers
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile --pd-separated --profile-prefill-url http://127.0.0.1:30000
```
#### Profile Decode Workers
```bash
# send profiling request targeting decode workers
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile --pd-separated --profile-decode-url http://127.0.0.1:30001
```
#### Important Notes
- `--profile-prefill-url` and `--profile-decode-url` are **mutually exclusive** - you cannot profile both at the same time
- Both options support multiple worker URLs for multi-instance setups:
```bash
# Profile multiple prefill workers
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --profile --pd-separated --profile-prefill-url http://127.0.0.1:30000 http://127.0.0.1:30002
# Profile multiple decode workers
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --profile --pd-separated --profile-decode-url http://127.0.0.1:30001 http://127.0.0.1:30003
```
- Make sure `SGLANG_TORCH_PROFILER_DIR` is set on all worker nodes before starting the servers
- For more details on setting up PD disaggregation, see [PD Disaggregation Guide](../advanced_features/pd_disaggregation.md)
### Profile a server with `sglang.bench_offline_throughput` ### Profile a server with `sglang.bench_offline_throughput`
```bash ```bash
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
......
...@@ -622,6 +622,48 @@ async def async_request_profile(api_url: str) -> RequestFuncOutput: ...@@ -622,6 +622,48 @@ async def async_request_profile(api_url: str) -> RequestFuncOutput:
return output return output
def _build_profile_urls(
profile_prefill_url: Optional[List[str]],
profile_decode_url: Optional[List[str]],
) -> List[Tuple[str, str]]:
"""Build profile URLs list from prefill/decode URL arguments.
Returns:
List of (worker_type, url) tuples. e.g., [("Prefill-0", "http://..."), ("Decode-0", "http://...")]
"""
profile_urls = []
if profile_prefill_url:
for idx, url in enumerate(profile_prefill_url):
profile_urls.append((f"Prefill-{idx}", url))
if profile_decode_url:
for idx, url in enumerate(profile_decode_url):
profile_urls.append((f"Decode-{idx}", url))
return profile_urls
async def _call_profile_pd(profile_urls: List[Tuple[str, str]], mode: str) -> None:
"""Call profile endpoint (start/stop) on PD separated workers.
Args:
profile_urls: List of (worker_type, url) tuples
mode: "start" or "stop"
"""
endpoint = "/start_profile" if mode == "start" else "/stop_profile"
action = "Starting" if mode == "start" else "Stopping"
action_past = "started" if mode == "start" else "stopped"
print(f"{action} profiler...")
for worker_type, url in profile_urls:
profile_output = await async_request_profile(api_url=url + endpoint)
if profile_output.success:
print(f"Profiler {action_past} for {worker_type} worker at {url}")
else:
print(
f"Failed to {mode} profiler for {worker_type} worker at {url}: {profile_output.error}"
)
def get_model(pretrained_model_name_or_path: str) -> str: def get_model(pretrained_model_name_or_path: str) -> str:
if os.getenv("SGLANG_USE_MODELSCOPE", "false").lower() == "true": if os.getenv("SGLANG_USE_MODELSCOPE", "false").lower() == "true":
import huggingface_hub.constants import huggingface_hub.constants
...@@ -1675,6 +1717,8 @@ async def benchmark( ...@@ -1675,6 +1717,8 @@ async def benchmark(
use_trace_timestamps: bool = False, use_trace_timestamps: bool = False,
mooncake_slowdown_factor=1.0, mooncake_slowdown_factor=1.0,
mooncake_num_rounds=1, mooncake_num_rounds=1,
profile_prefill_url: Optional[List[str]] = None,
profile_decode_url: Optional[List[str]] = None,
): ):
if backend in ASYNC_REQUEST_FUNCS: if backend in ASYNC_REQUEST_FUNCS:
request_func = ASYNC_REQUEST_FUNCS[backend] request_func = ASYNC_REQUEST_FUNCS[backend]
...@@ -1764,8 +1808,22 @@ async def benchmark( ...@@ -1764,8 +1808,22 @@ async def benchmark(
time.sleep(1.0) time.sleep(1.0)
# Build profile URLs for PD separated mode (do this once at the beginning)
pd_profile_urls = []
if profile and pd_separated:
pd_profile_urls = _build_profile_urls(profile_prefill_url, profile_decode_url)
if not pd_profile_urls:
print(
"Warning: PD separated mode requires --profile-prefill-url or --profile-decode-url"
)
print("Skipping profiler start. Please specify worker URLs for profiling.")
# Start profiler # Start profiler
if profile: if profile:
if pd_separated:
if pd_profile_urls:
await _call_profile_pd(pd_profile_urls, "start")
else:
print("Starting profiler...") print("Starting profiler...")
profile_output = await async_request_profile( profile_output = await async_request_profile(
api_url=base_url + "/start_profile" api_url=base_url + "/start_profile"
...@@ -1820,8 +1878,14 @@ async def benchmark( ...@@ -1820,8 +1878,14 @@ async def benchmark(
# Stop profiler # Stop profiler
if profile: if profile:
if pd_separated:
if pd_profile_urls:
await _call_profile_pd(pd_profile_urls, "stop")
else:
print("Stopping profiler...") print("Stopping profiler...")
profile_output = await async_request_profile(api_url=base_url + "/stop_profile") profile_output = await async_request_profile(
api_url=base_url + "/stop_profile"
)
if profile_output.success: if profile_output.success:
print("Profiler stopped") print("Profiler stopped")
...@@ -2204,6 +2268,8 @@ def run_benchmark(args_: argparse.Namespace): ...@@ -2204,6 +2268,8 @@ def run_benchmark(args_: argparse.Namespace):
use_trace_timestamps=args.use_trace_timestamps, use_trace_timestamps=args.use_trace_timestamps,
mooncake_slowdown_factor=args.mooncake_slowdown_factor, mooncake_slowdown_factor=args.mooncake_slowdown_factor,
mooncake_num_rounds=args.mooncake_num_rounds, mooncake_num_rounds=args.mooncake_num_rounds,
profile_prefill_url=getattr(args, "profile_prefill_url", None),
profile_decode_url=getattr(args, "profile_decode_url", None),
) )
) )
...@@ -2429,6 +2495,30 @@ if __name__ == "__main__": ...@@ -2429,6 +2495,30 @@ if __name__ == "__main__":
action="store_true", action="store_true",
help="Benchmark PD disaggregation server", help="Benchmark PD disaggregation server",
) )
# Create a mutually exclusive group for profiling URLs
# In PD separated mode, prefill and decode workers must be profiled separately
profile_url_group = parser.add_mutually_exclusive_group()
profile_url_group.add_argument(
"--profile-prefill-url",
type=str,
nargs="*",
default=None,
help="URL(s) of the prefill worker(s) for profiling in PD separated mode. "
"Can specify multiple URLs: --profile-prefill-url http://localhost:30000 http://localhost:30001. "
"NOTE: Cannot be used together with --profile-decode-url. "
"In PD separated mode, prefill and decode workers must be profiled separately.",
)
profile_url_group.add_argument(
"--profile-decode-url",
type=str,
nargs="*",
default=None,
help="URL(s) of the decode worker(s) for profiling in PD separated mode. "
"Can specify multiple URLs: --profile-decode-url http://localhost:30010 http://localhost:30011. "
"NOTE: Cannot be used together with --profile-prefill-url. "
"In PD separated mode, prefill and decode workers must be profiled separately.",
)
parser.add_argument( parser.add_argument(
"--flush-cache", "--flush-cache",
action="store_true", action="store_true",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment