docs: update wideep documentation to precompile DeepGemm kernels beforehand (#4096)

Signed-off-by: Tushar Sharma <tusharma@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

docs: update wideep documentation to precompile DeepGemm kernels beforehand (#4096)
Signed-off-by: Tushar Sharma <tusharma@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
51c103b2 · Tushar Sharma · GitHub · ba51c683 · 51c103b2 · 51c103b2
Unverified Commit 51c103b2 authored Nov 04, 2025 by Tushar Sharma Committed by GitHub Nov 04, 2025
Showing with 40 additions and 0 deletions

docs/backends/sglang/dsr1-wideep-gb200.md docs/backends/sglang/dsr1-wideep-gb200.md +21 -0

docs/backends/sglang/dsr1-wideep-h100.md docs/backends/sglang/dsr1-wideep-h100.md +19 -0

No files found.
--- a/docs/backends/sglang/dsr1-wideep-gb200.md
+++ b/docs/backends/sglang/dsr1-wideep-gb200.md
@@ -48,6 +48,8 @@ docker run \
    dynamo-wideep-gb200:latest
 ```

+In each container, you should be in the /sgl-workspace/dynamo/examples/backends/sglang directory.
+
 3. Run the ingress and prefill worker

 ```bash
@@ -104,6 +106,25 @@ python3 -m dynamo.sglang \

 On the other prefill nodes (this example has 2 total prefill nodes), run the same command but change `--node-rank` to 1

+> [!IMPORTANT]
+> If you encounter random CPU recv timeout issues during the warm-up phase in multi-GPU or multi-node setups, they are likely caused by DeepGEMM kernel compilation overhead.
+> To avoid these non-deterministic timeouts, it's strongly recommended to precompile the DeepGEMM kernels before launching the SGLang engine. This ensures all kernels are cached and ready, preventing long initialization delays or distributed timeout errors. To precompile and use cached kernels, please execute the following commands:
+
+```bash
+# 1. Precompile DeepGEMM kernels
+export SGLANG_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
+python3 -m sglang.compile_deep_gemm <ServerArgs>
+
+# 2. Launch the engine with the same cache directory
+export SGLANG_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
+python3 -m dynamo.frontend <ServerArgs>
+```
+
+> [!NOTE]
+> There's a known issue where the compile request may fail due to missing bootstrap information, but the kernels are still successfully cached.
+> Using a gradual warm-up phase and enabling caching for FlashInfer (similar to DeepGEMM) can further improve stability and reduce startup time.
+> See https://github.com/sgl-project/sglang/issues/9867#issuecomment-3336551174 for more details.
+
 4. Run the decode worker on the head decode node

 ```bash

--- a/docs/backends/sglang/dsr1-wideep-h100.md
+++ b/docs/backends/sglang/dsr1-wideep-h100.md
@@ -86,6 +86,25 @@ python3 -m dynamo.sglang \

 On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3

+> [!IMPORTANT]
+> If you encounter random CPU recv timeout issues during the warm-up phase in multi-GPU or multi-node setups, they are likely caused by DeepGEMM kernel compilation overhead.
+> To avoid these non-deterministic timeouts, it's strongly recommended to precompile the DeepGEMM kernels before launching the SGLang engine. This ensures all kernels are cached and ready, preventing long initialization delays or distributed timeout errors. To precompile and use cached kernels, please execute the following commands:
+
+```bash
+# 1. Precompile DeepGEMM kernels
+export SGLANG_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
+python3 -m sglang.compile_deep_gemm <ServerArgs>
+
+# 2. Launch the engine with the same cache directory
+export SGLANG_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
+python3 -m dynamo.frontend <ServerArgs>
+```
+
+> [!NOTE]
+> There's a known issue where the compile request may fail due to missing bootstrap information, but the kernels are still successfully cached.
+> Using a gradual warm-up phase and enabling caching for FlashInfer (similar to DeepGEMM) can further improve stability and reduce startup time.
+> See https://github.com/sgl-project/sglang/issues/9867#issuecomment-3336551174 for more details.
+
 4. Run the decode worker on the head decode node

 ```bash