docs: add TRTLLM variable sliding window attention example for gemma3 model (#2134)

c8f6d4d9 · Richard Huo · GitHub · 347620a1 · c8f6d4d9 · c8f6d4d9
Unverified Commit c8f6d4d9 authored Aug 05, 2025 by Richard Huo Committed by GitHub Aug 05, 2025
5 changed files
--- a/components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml
+++ b/components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+tensor_parallel_size: 1
+backend: pytorch
+
+kv_cache_config:
+  max_attention_window:
+    - 512
+    - 512
+    - 512
+    - 512
+    - 512
+    - 32768
+  enable_block_reuse: false
--- a/components/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml
+++ b/components/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+tensor_parallel_size: 1
+backend: pytorch
+
+kv_cache_config:
+  max_attention_window:
+    - 512
+    - 512
+    - 512
+    - 512
+    - 512
+    - 32768
+  enable_block_reuse: false
+
+cache_transceiver_config:
+  backend: default
--- a/components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml
+++ b/components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+tensor_parallel_size: 1
+backend: pytorch
+disable_overlap_scheduler: True
+
+kv_cache_config:
+  max_attention_window:
+    - 512
+    - 512
+    - 512
+    - 512
+    - 512
+    - 32768
+  enable_block_reuse: false
+
+cache_transceiver_config:
+  backend: default
--- a/components/backends/trtllm/gemma3_sliding_window_attention.md
+++ b/components/backends/trtllm/gemma3_sliding_window_attention.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Gemma 3 with Variable Sliding Window Attention
+
+This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
+VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
+
+## Notes
+* To run Gemma 3 with VSWA, ensure that the container has TensorRT-LLM v1.0.0rc4 installed.
+
+## Limitation
+* The current KV event-based KV routing does not work well with VSWA. The Dynamo team is actively working on adding support to distinguish between events from different layer groups.
+
+### Aggregated Serving
+```bash
+cd $DYNAMO_HOME/components/backends/trtllm
+export MODEL_PATH=google/gemma-3-1b-it
+export SERVED_MODEL_NAME=$MODEL_PATH
+export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml
+./launch/agg.sh
+```
+
+#### Disaggregated Serving
+```bash
+cd $DYNAMO_HOME/components/backends/trtllm
+export MODEL_PATH=google/gemma-3-1b-it
+export SERVED_MODEL_NAME=$MODEL_PATH
+export PREFILL_ENGINE_ARGS=engine_configs/gemma3/vswa_prefill.yaml
+export DECODE_ENGINE_ARGS=engine_configs/gemma3/vswa_decode.yaml
+./launch/disagg.sh
+```
--- a/components/backends/trtllm/launch/disagg_router.sh
+++ b/components/backends/trtllm/launch/disagg_router.sh
@@ -53,4 +53,4 @@ CUDA_VISIBLE_DEVICES=$DECODE_CUDA_VISIBLE_DEVICES python3 -m dynamo.trtllm \
  --extra-engine-args "$DECODE_ENGINE_ARGS" \
  --disaggregation-mode decode \
  --disaggregation-strategy "$DISAGGREGATION_STRATEGY" \
-  "${EXTRA_DECODE_ARGS[@]}"
\ No newline at end of file
+  "${EXTRA_DECODE_ARGS[@]}"