Unverified Commit 0a2a820b authored by Anant Sharma's avatar Anant Sharma Committed by GitHub
Browse files

docs: move all md files from components to docs (#3440)


Signed-off-by: default avatarAnant Sharma <anants@nvidia.com>
Co-authored-by: default avatarAnish <80174047+athreesh@users.noreply.github.com>
parent b640f283
...@@ -30,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative ...@@ -30,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative
## Latest News ## Latest News
- [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md) - [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./docs/backends/trtllm/gpt-oss.md)
## The Era of Multi-GPU, Multi-Node ## The Era of Multi-GPU, Multi-Node
...@@ -65,9 +65,9 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa ...@@ -65,9 +65,9 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
To learn more about each framework and their capabilities, check out each framework's README! To learn more about each framework and their capabilities, check out each framework's README!
- **[vLLM](components/backends/vllm/README.md)** - **[vLLM](docs/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)** - **[SGLang](docs/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)** - **[TensorRT-LLM](docs/backends/trtllm/README.md)**
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach. Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
......
...@@ -23,9 +23,9 @@ This directory contains the core components that make up the Dynamo inference fr ...@@ -23,9 +23,9 @@ This directory contains the core components that make up the Dynamo inference fr
Dynamo supports multiple inference engines (with a focus on SGLang, vLLM, and TensorRT-LLM), each with their own deployment configurations and capabilities: Dynamo supports multiple inference engines (with a focus on SGLang, vLLM, and TensorRT-LLM), each with their own deployment configurations and capabilities:
- **[vLLM](backends/vllm/README.md)** - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms - **[vLLM](/docs/backends/vllm/README.md)** - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms
- **[SGLang](backends/sglang/README.md)** - Structured generation language framework with ZMQ-based communication - **[SGLang](/docs/backends/sglang/README.md)** - Structured generation language framework with ZMQ-based communication
- **[TensorRT-LLM](backends/trtllm/README.md)** - NVIDIA's optimized LLM inference engine with TensorRT acceleration - **[TensorRT-LLM](/docs/backends/trtllm/README.md)** - NVIDIA's optimized LLM inference engine with TensorRT acceleration
Each engine provides launch scripts for different deployment patterns in their respective `/launch` & `/deploy` directories. Each engine provides launch scripts for different deployment patterns in their respective `/launch` & `/deploy` directories.
......
...@@ -17,7 +17,7 @@ For this example, we will make some assumptions about your SLURM cluster: ...@@ -17,7 +17,7 @@ For this example, we will make some assumptions about your SLURM cluster:
If your cluster supports similar container based plugins, you may be able to If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead. modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as 3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../docs/dsr1-wideep-gb200.md#instructions). described [here](../../../../docs/backends/sglang/dsr1-wideep-gb200.md#instructions).
This is the image that can be passed to the `--container-image` argument in later steps. This is the image that can be passed to the `--container-image` argument in later steps.
## Scripts Overview ## Scripts Overview
......
...@@ -232,7 +232,7 @@ envs: ...@@ -232,7 +232,7 @@ envs:
## Testing the Deployment ## Testing the Deployment
Send a test request to verify your deployment. See the [client section](../../../../components/backends/vllm/README.md#client) for detailed instructions. Send a test request to verify your deployment. See the [client section](../../../../docs/backends/vllm/README.md#client) for detailed instructions.
**Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`. **Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`.
...@@ -254,7 +254,7 @@ TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving ...@@ -254,7 +254,7 @@ TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving
- **UCX** (default): Standard method for KV cache transfer - **UCX** (default): Standard method for KV cache transfer
- **NIXL** (experimental): Alternative transfer method - **NIXL** (experimental): Alternative transfer method
For detailed configuration instructions, see the [KV cache transfer guide](../kv-cache-transfer.md). For detailed configuration instructions, see the [KV cache transfer guide](../../../../docs/backends/trtllm/kv-cache-transfer.md).
## Request Migration ## Request Migration
...@@ -282,8 +282,8 @@ Configure the `model` name and `host` based on your deployment. ...@@ -282,8 +282,8 @@ Configure the `model` name and `host` based on your deployment.
- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md) - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md)
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md) - **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md)
- **Multinode Deployment**: [Multinode Examples](../multinode/multinode-examples.md) - **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md)
- **Speculative Decoding**: [Llama 4 + Eagle Guide](../llama4_plus_eagle.md) - **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md)
- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
## Troubleshooting ## Troubleshooting
......
...@@ -41,7 +41,7 @@ Please note that: ...@@ -41,7 +41,7 @@ Please note that:
3. `post_process.py` - Scan the genai-perf results to produce a json with entries to each config point. 3. `post_process.py` - Scan the genai-perf results to produce a json with entries to each config point.
4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization. 4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide. For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
## Usage ## Usage
......
...@@ -11,20 +11,20 @@ The examples below assume you build the latest image yourself from source. If us ...@@ -11,20 +11,20 @@ The examples below assume you build the latest image yourself from source. If us
Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph
.. grid-item-card:: :doc:`vLLM <../components/backends/vllm/README>` .. grid-item-card:: :doc:`vLLM <../backends/vllm/README>`
:link: ../components/backends/vllm/README :link: ../backends/vllm/README
:link-type: doc :link-type: doc
Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with VLLM. Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with VLLM.
.. grid-item-card:: :doc:`SGLang <../components/backends/sglang/README>` .. grid-item-card:: :doc:`SGLang <../backends/sglang/README>`
:link: ../components/backends/sglang/README :link: ../backends/sglang/README
:link-type: doc :link-type: doc
Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with SGLang. Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with SGLang.
.. grid-item-card:: :doc:`TensorRT-LLM <../components/backends/trtllm/README>` .. grid-item-card:: :doc:`TensorRT-LLM <../backends/trtllm/README>`
:link: ../components/backends/trtllm/README :link: ../backends/trtllm/README
:link-type: doc :link-type: doc
Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with TensorRT-LLM. Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with TensorRT-LLM.
......
...@@ -37,6 +37,6 @@ Dynamo currently supports the following high-performance inference backends: ...@@ -37,6 +37,6 @@ Dynamo currently supports the following high-performance inference backends:
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
vLLM <../components/backends/vllm/README> vLLM <../backends/vllm/README>
SGLang <../components/backends/sglang/README> SGLang <../backends/sglang/README>
TensorRT-LLM <../components/backends/trtllm/README> TensorRT-LLM <../backends/trtllm/README>
...@@ -63,4 +63,4 @@ The Dynamo KV Block Manager serves as a reference implementation that emphasizes ...@@ -63,4 +63,4 @@ The Dynamo KV Block Manager serves as a reference implementation that emphasizes
KVBM Architecture <kvbm_architecture.md> KVBM Architecture <kvbm_architecture.md>
Understanding KVBM components <kvbm_components.md> Understanding KVBM components <kvbm_components.md>
KVBM Further Reading <kvbm_reading> KVBM Further Reading <kvbm_reading>
LMCache Integration <../components/backends/vllm/LMCache_Integration.md> LMCache Integration <../backends/vllm/LMCache_Integration>
...@@ -35,13 +35,13 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -35,13 +35,13 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| Feature | SGLang | Notes | | Feature | SGLang | Notes |
|---------|--------|-------| |---------|--------|-------|
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | | | [**Disaggregated Serving**](../../architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) | | [**Conditional Disaggregation**](../../architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | | | [**KV-Aware Routing**](../../architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | | | [**SLA-Based Planner**](../../architecture/sla_planner.md) | ✅ | |
| [**Multimodal EPD Disaggregation**](docs/multimodal_epd.md) | ✅ | | | [**Multimodal EPD Disaggregation**](multimodal_epd.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | ❌ | Planned | | [**Load Based Planner**](../../architecture/load_planner.md) | ❌ | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ❌ | Planned | | [**KVBM**](../../architecture/kvbm_architecture.md) | ❌ | Planned |
### Large Scale P/D and WideEP Features ### Large Scale P/D and WideEP Features
...@@ -229,7 +229,7 @@ cd $DYNAMO_HOME/components/backends/sglang ...@@ -229,7 +229,7 @@ cd $DYNAMO_HOME/components/backends/sglang
./launch/disagg_dp_attn.sh ./launch/disagg_dp_attn.sh
``` ```
When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](docs/expert-distribution-eplb.md). When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](expert-distribution-eplb.md).
### Testing the Deployment ### Testing the Deployment
...@@ -266,24 +266,24 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ ...@@ -266,24 +266,24 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example! Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
### Run a multi-node sized model ### Run a multi-node sized model
- **[Run a multi-node model](docs/multinode-examples.md)** - **[Run a multi-node model](multinode-examples.md)**
### Large scale P/D disaggregation with WideEP ### Large scale P/D disaggregation with WideEP
- **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)** - **[Run DeepSeek-R1 on 104+ H100s](dsr1-wideep-h100.md)**
- **[Run DeepSeek-R1-FP8 on GB200s](docs/dsr1-wideep-gb200.md)** - **[Run DeepSeek-R1-FP8 on GB200s](dsr1-wideep-gb200.md)**
### Hierarchical Cache (HiCache) ### Hierarchical Cache (HiCache)
- **[Enable SGLang Hierarchical Cache (HiCache)](docs/sgl-hicache-example.md)** - **[Enable SGLang Hierarchical Cache (HiCache)](sgl-hicache-example.md)**
### Multimodal Encode-Prefill-Decode (EPD) Disaggregation with NIXL ### Multimodal Encode-Prefill-Decode (EPD) Disaggregation with NIXL
- **[Run a multimodal model with EPD Disaggregation](docs/multimodal_epd.md)** - **[Run a multimodal model with EPD Disaggregation](multimodal_epd.md)**
## Deployment ## Deployment
We currently provide deployment examples for Kubernetes and SLURM. We currently provide deployment examples for Kubernetes and SLURM.
## Kubernetes ## Kubernetes
- **[Deploying Dynamo with SGLang on Kubernetes](deploy/README.md)** - **[Deploying Dynamo with SGLang on Kubernetes](../../../components/backends/sglang/deploy/README.md)**
## SLURM ## SLURM
- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)** - **[Deploying Dynamo with SGLang on SLURM](../../../components/backends/sglang/slurm_jobs/README.md)**
...@@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0 ...@@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0
# Running gpt-oss-120b Disaggregated with SGLang # Running gpt-oss-120b Disaggregated with SGLang
The gpt-oss-120b guide for SGLang is largely identical to the [guide for vLLM](/components/backends/vllm/gpt-oss.md), The gpt-oss-120b guide for SGLang is largely identical to the [guide for vLLM](/docs/backends/vllm/gpt-oss.md),
please ues the vLLM guide as a reference with the different deployment steps as highlighted below: please ues the vLLM guide as a reference with the different deployment steps as highlighted below:
# Launch the Deployment # Launch the Deployment
......
...@@ -31,7 +31,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -31,7 +31,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
The MultimodalEncodeWorker is responsible for encoding the image and passing the embeddings to the MultimodalWorker via a combination of NATS and RDMA. The MultimodalEncodeWorker is responsible for encoding the image and passing the embeddings to the MultimodalWorker via a combination of NATS and RDMA.
The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
Its MultimodalWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](../README.md) example. Its MultimodalWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example.
By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
MultimodalEncodeWorker independently from the prefill and decode workers if needed. MultimodalEncodeWorker independently from the prefill and decode workers if needed.
...@@ -116,7 +116,7 @@ You should see a response similar to this: ...@@ -116,7 +116,7 @@ You should see a response similar to this:
For the Qwen2.5-VL model, embeddings are only required during the prefill stage. As such, the image embeddings are transferred using a NIXL descriptor from the encode worker to the worker and then passed to the prefill worker for processing. For the Qwen2.5-VL model, embeddings are only required during the prefill stage. As such, the image embeddings are transferred using a NIXL descriptor from the encode worker to the worker and then passed to the prefill worker for processing.
The prefill worker performs the prefilling step and forwards the KV cache to the worker for decoding. The prefill worker performs the prefilling step and forwards the KV cache to the worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../README.md) example. For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
This figure illustrates the workflow: This figure illustrates the workflow:
```mermaid ```mermaid
......
...@@ -186,11 +186,11 @@ For comprehensive instructions on multinode serving, see the [multinode-examples ...@@ -186,11 +186,11 @@ For comprehensive instructions on multinode serving, see the [multinode-examples
### Kubernetes Deployment ### Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](deploy/README.md). For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](../../../components/backends/trtllm/deploy/README.md).
### Client ### Client
See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment. See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`. NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
...@@ -230,7 +230,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ ...@@ -230,7 +230,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ
## Client ## Client
See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment. See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`. NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
...@@ -302,7 +302,7 @@ sampling_params.logits_processor = create_trtllm_adapters(processors) ...@@ -302,7 +302,7 @@ sampling_params.logits_processor = create_trtllm_adapters(processors)
## Performance Sweep ## Performance Sweep
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](./performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance. For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](../../../components/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
## Dynamo KV Block Manager Integration ## Dynamo KV Block Manager Integration
......
...@@ -23,9 +23,9 @@ VSWA is a mechanism in which a model’s layers alternate between multiple slidi ...@@ -23,9 +23,9 @@ VSWA is a mechanism in which a model’s layers alternate between multiple slidi
> [!Note] > [!Note]
> - Ensure that required services such as `nats` and `etcd` are running before starting. > - Ensure that required services such as `nats` and `etcd` are running before starting.
> - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication. > - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
> - Its recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA. > - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
### Aggregated Serving ## Aggregated Serving
```bash ```bash
cd $DYNAMO_HOME/components/backends/trtllm cd $DYNAMO_HOME/components/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it export MODEL_PATH=google/gemma-3-1b-it
...@@ -34,7 +34,7 @@ export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml ...@@ -34,7 +34,7 @@ export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml
./launch/agg.sh ./launch/agg.sh
``` ```
### Aggregated Serving with KV Routing ## Aggregated Serving with KV Routing
```bash ```bash
cd $DYNAMO_HOME/components/backends/trtllm cd $DYNAMO_HOME/components/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it export MODEL_PATH=google/gemma-3-1b-it
...@@ -43,7 +43,7 @@ export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml ...@@ -43,7 +43,7 @@ export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml
./launch/agg_router.sh ./launch/agg_router.sh
``` ```
#### Disaggregated Serving ## Disaggregated Serving
```bash ```bash
cd $DYNAMO_HOME/components/backends/trtllm cd $DYNAMO_HOME/components/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it export MODEL_PATH=google/gemma-3-1b-it
...@@ -53,7 +53,7 @@ export DECODE_ENGINE_ARGS=engine_configs/gemma3/vswa_decode.yaml ...@@ -53,7 +53,7 @@ export DECODE_ENGINE_ARGS=engine_configs/gemma3/vswa_decode.yaml
./launch/disagg.sh ./launch/disagg.sh
``` ```
#### Disaggregated Serving with KV Routing ## Disaggregated Serving with KV Routing
```bash ```bash
cd $DYNAMO_HOME/components/backends/trtllm cd $DYNAMO_HOME/components/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it export MODEL_PATH=google/gemma-3-1b-it
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment