Unverified Commit b22a9d76 authored by Biswa Panda's avatar Biswa Panda Committed by GitHub
Browse files

docs: add experimental recipes details for Kimi-K2.5 recipe (#7412)

parent 183100b1
...@@ -42,7 +42,6 @@ These recipes demonstrate aggregated or disaggregated serving: ...@@ -42,7 +42,6 @@ These recipes demonstrate aggregated or disaggregated serving:
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅ | ❌ | TP=16, multi-node. Use `model-download-sglang.yaml` | ❌ | | **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅ | ❌ | TP=16, multi-node. Use `model-download-sglang.yaml` | ❌ |
| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ | | **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
| **[DeepSeek-R1](deepseek-r1/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ | | **[DeepSeek-R1](deepseek-r1/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
| **[Kimi-K2.5](kimi-k2.5/)** 🚧 | TensorRT-LLM | Aggregated | 8x B200 | ✅ | ❌ | Experimental — MoE model, TP8×EP8, reasoning + tool calling | ❌ |
**Legend:** **Legend:**
- **Deployment**: ✅ = Complete `deploy.yaml` manifest available - **Deployment**: ✅ = Complete `deploy.yaml` manifest available
...@@ -58,6 +57,15 @@ These recipes demonstrate functional deployments with Dynamo features, but have ...@@ -58,6 +57,15 @@ These recipes demonstrate functional deployments with Dynamo features, but have
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/agg/)** | SGLang | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing, 1.0+ | | **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/agg/)** | SGLang | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing, 1.0+ |
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer | | **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer |
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/disagg/)** | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ | | **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/disagg/)** | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ |
| **[Kimi-K2.5 (Baseten)](kimi-k2.5/trtllm/agg/baseten/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling |
### Experimental Recipes
These recipes are under active development and may require additional setup steps (e.g., container patching). They are functional but not yet fully validated for production use.
| Model | Framework | Mode | GPUs | Deployment | Notes |
|-------|-----------|------|------|------------|-------|
| **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Requires [container patch](kimi-k2.5/trtllm/agg/nvidia/patch/). Vision input not yet functional with the patch. |
## Recipe Structure ## Recipe Structure
......
# Kimi-K2.5 Recipes # Kimi-K2.5 Recipes
> 🚧 **Work-in-Progress — Experimental Recipe** Deployment recipes for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.
>
> The TensorRT-LLM Python package used for Dynamo's TRT-LLM integration does not yet include
> native Kimi K2.5 support. This recipe is an **experimental** effort to bring
> Kimi K2.5 to Dynamo ahead of upstream availability. It needs to patch the container image on top of released dynamo image.
Deployment recipe for **Kimi-K2.5** using TensorRT-LLM with Dynamo's KV-aware routing.
## Available Configurations ## Available Configurations
There are two model weight variants, each with its own model download and deploy manifests: There are two model weight variants, each with its own model download and deploy manifests:
| Variant | Model | Deploy Configs | Notes | | Variant | Model | Status | Modality | Deploy Configs | Notes |
|---------|-------|---------------|-------| |---------|-------|--------|----------|---------------|-------|
| **nvidia** 🚧 | `nvidia/Kimi-K2.5-NVFP4` | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/) | | **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | Functional | Text only | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image, not yet performance-optimized |
| **baseten** | `baseten-admin/Kimi-2.5-text-nvfp4-v3` | [`deploy.yaml`](trtllm/agg/baseten/deploy.yaml) | Works with the stock image | | **nvidia** | `nvidia/Kimi-K2.5-NVFP4` | Experimental | Text only | [`deploy.yaml`](trtllm/agg/nvidia/deploy.yaml), [`deploy-kvbm.yaml`](trtllm/agg/nvidia/deploy-kvbm.yaml) | Requires a [patched image](trtllm/agg/nvidia/patch/). Vision input is not yet functional — the patch loads the text backbone only. |
All configurations use TP8, EP8, aggregated mode with KV-aware routing. All configurations use TP8, EP8, aggregated mode with KV-aware routing.
The **nvidia** variant also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`.
## Prerequisites ## Prerequisites
1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md) 1. **Dynamo Platform installed** — See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **GPU cluster** with B200 GPUs (8x per worker) 2. **GPU cluster** with B200 GPUs (8x per worker)
3. **HuggingFace token** with access to the model 3. **HuggingFace token** with access to the model
## Quick Start (nvidia variant) ## Hardware Requirements
| Configuration | GPUs |
|--------------|------|
| Aggregated | 8x B200 |
---
## baseten-admin/Kimi-2.5-text-nvfp4-v3
**Status:** Functional (not yet performance-optimized) | **Modality:** Text only
The baseten variant uses a text-only backend built on the underlying DeepSeek-V3 architecture, which means it works out of the box with the stock TensorRT-LLM container image -- no patching or custom builds required. This recipe is functional for text-based inference with reasoning and tool calling, but has not yet been performance-tuned or benchmarked.
### Quick Start
The baseten deploy manifest ships with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`.
Update the `image:` fields in [`trtllm/agg/baseten/deploy.yaml`](trtllm/agg/baseten/deploy.yaml) to your actual Dynamo release tag before deploying.
```bash
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Download model (update storageClassName in model-cache/model-cache.yaml first!)
kubectl apply -f model-cache/baseten/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
# Update the image tag in trtllm/agg/baseten/deploy.yaml to your Dynamo release tag
# Deploy
kubectl apply -f trtllm/agg/baseten/deploy.yaml -n ${NAMESPACE}
```
### Test the Deployment
```bash
# Port-forward the frontend
kubectl port-forward svc/kimi-k25-agg-frontend 8000:8000 -n ${NAMESPACE}
# Send a test request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "baseten-admin/Kimi-2.5-text-nvfp4-v3",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
```
---
## nvidia/Kimi-K2.5-NVFP4
**Status:** Experimental | **Modality:** Text only upstream support
> **Experimental:** Upstream TensorRT-LLM does not yet include native support for Kimi K2.5.
> This recipe works around that limitation by directly patching the container image with an
> append-only patch that registers `KimiK25ForConditionalGeneration` on the DeepSeek-V3 code path.
> See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for the patch script and full instructions.
> **Text only:** The patch loads the DeepSeek-V3 text backbone from the Kimi K2.5 config
> (`text_config`). The vision encoder is not loaded, so image inputs are not processed.
> Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
The nvidia variant supports text inference with reasoning parsing (`--dyn-reasoning-parser kimi_k25`) and tool calling (`--dyn-tool-call-parser kimi_k2`). It also has a KVBM (KV Block Manager) deploy that enables CPU-offloaded KV cache via `deploy-kvbm.yaml`.
### Quick Start
The nvidia deploy manifests (`deploy.yaml`, `deploy-kvbm.yaml`) ship with a placeholder image `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag`.
Before deploying, you must:
1. Run the [patch script](trtllm/agg/nvidia/patch/) to build a patched image (appends `-patched` to the tag).
2. Update the `image:` fields in the deploy YAML to reference the patched image.
See [`trtllm/agg/nvidia/patch/`](trtllm/agg/nvidia/patch/) for full details on what the patch does.
```bash ```bash
# Set namespace # Set namespace
...@@ -43,18 +115,19 @@ kubectl create secret generic hf-token-secret \ ...@@ -43,18 +115,19 @@ kubectl create secret generic hf-token-secret \
kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE} kubectl apply -f model-cache/nvidia/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
# Patch the container image (required for nvidia weights) # Patch the container image (required — upstream support not yet available)
# This produces: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
cd trtllm/agg/nvidia/patch cd trtllm/agg/nvidia/patch
./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag ./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
cd - cd -
# Update the image in the deploy manifest to use the patched tag
# Deploy # Deploy
kubectl apply -f trtllm/agg/nvidia/deploy.yaml -n ${NAMESPACE} kubectl apply -f trtllm/agg/nvidia/deploy.yaml -n ${NAMESPACE}
``` ```
For baseten weights, use `model-cache/baseten/` and `trtllm/agg/baseten/deploy.yaml` instead — no image patch needed. ### Test the Deployment
## Test the Deployment
```bash ```bash
# Port-forward the frontend # Port-forward the frontend
...@@ -70,19 +143,14 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -70,19 +143,14 @@ curl http://localhost:8000/v1/chat/completions \
}' }'
``` ```
---
## Model Details ## Model Details
- **Model**: `nvidia/Kimi-K2.5-NVFP4` (NV FP4 quantized, text-only)
- **Architecture**: MoE (Mixture-of-Experts), based on DeepSeek-V3 architecture - **Architecture**: MoE (Mixture-of-Experts), based on DeepSeek-V3 architecture
- **Backend**: TensorRT-LLM (PyTorch backend) - **Backend**: TensorRT-LLM (PyTorch backend)
- **Parallelism**: TP8, EP8 (Expert Parallel) - **Parallelism**: TP8, EP8 (Expert Parallel)
- **Quantization**: NV FP4
## Hardware Requirements
| Configuration | GPUs |
|--------------|------|
| Aggregated | 8x B200 |
## Verifying Reasoning ## Verifying Reasoning
...@@ -185,4 +253,4 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co ...@@ -185,4 +253,4 @@ If `tool_calls` is missing with raw `<|tool_calls_section_begin|>` tokens in `co
## Notes ## Notes
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying - Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- The nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream - The nvidia variant requires a [patched TensorRT-LLM image](trtllm/agg/nvidia/patch/) until Kimi K2.5 support lands upstream in TensorRT-LLM
...@@ -51,7 +51,7 @@ spec: ...@@ -51,7 +51,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0 image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
replicas: 1 replicas: 1
TrtllmWorker: TrtllmWorker:
componentType: worker componentType: worker
...@@ -84,7 +84,7 @@ spec: ...@@ -84,7 +84,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0 image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
env: env:
- name: TRTLLM_ENABLE_PDL - name: TRTLLM_ENABLE_PDL
value: "1" value: "1"
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-config
data:
config.yaml: |
max_batch_size: 128
max_num_tokens: 8448
max_seq_len: 8212
tensor_parallel_size: 8
moe_expert_parallel_size: 8
enable_attention_dp: true
pipeline_parallel_size: 1
print_iter_log: true
kv_cache_config:
free_gpu_memory_fraction: 0.75
dtype: fp8
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 8448
trust_remote_code: true
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: kimi-k25-agg
spec:
backendFramework: trtllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
extraPodSpec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-graph-deployment-name
operator: In
values:
- kimi-k25-agg-frontend
topologyKey: kubernetes.io/hostname
mainContainer:
args:
- python3 -m dynamo.frontend --router-mode kv --http-port 8000
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
replicas: 1
TrtllmWorker:
componentType: worker
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 80Gi
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
mainContainer:
args:
- |
python3 -m dynamo.trtllm \
--model-path "${MODEL_NAME}" \
--served-model-name "${MODEL_NAME}" \
--extra-engine-args "${ENGINE_ARGS}" \
--tensor-parallel-size 8 \
--dyn-reasoning-parser kimi_k25 \
--dyn-tool-call-parser kimi_k2
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
env:
- name: TRTLLM_ENABLE_PDL
value: "1"
- name: MODEL_NAME
value: baseten-admin/Kimi-2.5-text-nvfp4-v3
- name: ENGINE_ARGS
value: /opt/dynamo/configs/config.yaml
- name: HF_HOME
value: /opt/models
volumeMounts:
- mountPath: /opt/dynamo/configs
name: llm-config
readOnly: true
workingDir: /workspace/examples/backends/trtllm
volumes:
- configMap:
name: llm-config
name: llm-config
replicas: 1
resources:
limits:
gpu: "8"
requests:
gpu: "8"
\ No newline at end of file
# Kimi-K2.5 Aggregated Deployment with KVBM on Kubernetes # Kimi-K2.5 nvidia/Kimi-K2.5-NVFP4 — Aggregated Deployments on Kubernetes
> **Note:** The `nvidia/Kimi-K2.5-NVFP4` model requires a patched TensorRT-LLM container image because
> upstream TRT-LLM support for Kimi K2.5 has not yet been released. You must build the patched image before
> deploying either configuration below. See [`patch/`](patch/) for the script and instructions.
> **Text only:** The patch registers `KimiK25ForConditionalGeneration` by loading the DeepSeek-V3
> text backbone (`text_config`) only. The vision encoder is not loaded, so image inputs are not
> processed. Full multimodal support requires native upstream TRT-LLM support for Kimi K2.5.
This directory contains two aggregated deployment configurations for the `nvidia/Kimi-K2.5-NVFP4` model:
| Deployment | Manifest | Description |
|-----------|----------|-------------|
| **Standard Aggregated** | [`deploy.yaml`](deploy.yaml) | Basic aggregated serving with KV-aware routing |
| **Aggregated + KVBM** | [`deploy-kvbm.yaml`](deploy-kvbm.yaml) | Aggregated serving with CPU-offloaded KV cache (KV Block Manager) |
## Prerequisites ## Prerequisites
- A Kubernetes cluster with the [Dynamo Operator](https://docs.nvidia.com/dynamo/) installed - A Kubernetes cluster with the [Dynamo Operator](https://docs.nvidia.com/dynamo/) installed
- 8× GPU nodes (e.g. H100/H200) - 8x B200 GPUs
- A `hf-token-secret` Secret containing your Hugging Face token - A `hf-token-secret` Secret containing your Hugging Face token
- A pre-existing `model-cache` PVC - A pre-existing `model-cache` PVC with the downloaded model
- Replace the placeholder image tag `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag` in `deploy-kvbm.yaml` with your actual image - A **patched container image** -- the deploy manifests ship with a placeholder `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched`. You must build a patched image and update the `image:` fields before deploying. See [patch instructions](patch/) for details.
## Deploy ---
## Standard Aggregated Deployment
Uses [`deploy.yaml`](deploy.yaml). This is the simpler configuration -- aggregated serving with KV-aware routing, no CPU-offloaded KV cache.
```bash ```bash
kubectl apply -f deploy-kvbm.yaml # Update the image in deploy.yaml to your patched image, then:
kubectl apply -f deploy.yaml -n ${NAMESPACE}
```
This creates:
- A **ConfigMap** (`llm-config`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache).
- A **DynamoGraphDeployment** (`kimi-k25-agg`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
---
## Aggregated Deployment with KVBM
Uses [`deploy-kvbm.yaml`](deploy-kvbm.yaml). This configuration adds CPU-offloaded KV cache via the KV Block Manager (KVBM), which allows larger effective context by spilling KV cache to host memory.
```bash
# Update the image in deploy-kvbm.yaml to your patched image, then:
kubectl apply -f deploy-kvbm.yaml -n ${NAMESPACE}
``` ```
This creates: This creates:
- A **ConfigMap** (`llm-config-kimi-agg-kvbm`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache, KVBM connector). - A **ConfigMap** (`llm-config-kimi-agg-kvbm`) with TRT-LLM engine parameters (TP=8, EP=8, FP8 KV-cache, KVBM connector).
- A **DynamoGraphDeployment** (`kimi-k25-agg-kvbm`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`. - A **DynamoGraphDeployment** (`kimi-k25-agg-kvbm`) with a Frontend (KV-router mode) and a TrtllmWorker serving `nvidia/Kimi-K2.5-NVFP4`.
### KVBM Configuration
Key environment variables on the worker: Key environment variables on the worker:
| Variable | Default | Description | | Variable | Default | Description |
...@@ -26,7 +63,7 @@ Key environment variables on the worker: ...@@ -26,7 +63,7 @@ Key environment variables on the worker:
| `DYN_KVBM_METRICS` | `true` | Enable Prometheus metrics endpoint | | `DYN_KVBM_METRICS` | `true` | Enable Prometheus metrics endpoint |
| `DYN_KVBM_METRICS_PORT` | `6880` | Port for the metrics endpoint | | `DYN_KVBM_METRICS_PORT` | `6880` | Port for the metrics endpoint |
## Enable Prometheus Metrics Scraping ### Enable Prometheus Metrics Scraping
If you have the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) installed, apply the PodMonitor: If you have the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) installed, apply the PodMonitor:
......
...@@ -55,7 +55,7 @@ spec: ...@@ -55,7 +55,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0 image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
replicas: 1 replicas: 1
TrtllmWorker: TrtllmWorker:
componentType: worker componentType: worker
...@@ -95,7 +95,10 @@ spec: ...@@ -95,7 +95,10 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0 # REQUIRED: replace with your patched image tag (run patch/patch-container.sh first).
# Upstream TRT-LLM does not support KimiK25ForConditionalGeneration without the patch.
# Example: ./patch/patch-container.sh <your-image> -> produces <your-image>-patched
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
env: env:
- name: TRTLLM_ENABLE_PDL - name: TRTLLM_ENABLE_PDL
value: "1" value: "1"
......
...@@ -51,7 +51,7 @@ spec: ...@@ -51,7 +51,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0 image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
replicas: 1 replicas: 1
TrtllmWorker: TrtllmWorker:
componentType: worker componentType: worker
...@@ -84,7 +84,10 @@ spec: ...@@ -84,7 +84,10 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0 # REQUIRED: replace with your patched image tag (run patch/patch-container.sh first).
# Upstream TRT-LLM does not support KimiK25ForConditionalGeneration without the patch.
# Example: ./patch/patch-container.sh <your-image> -> produces <your-image>-patched
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
env: env:
- name: TRTLLM_ENABLE_PDL - name: TRTLLM_ENABLE_PDL
value: "1" value: "1"
......
...@@ -16,7 +16,7 @@ For example: ...@@ -16,7 +16,7 @@ For example:
```bash ```bash
./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag ./patch-container.sh nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
# produces image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0-patched # produces image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag-patched
``` ```
If `KimiK25ForConditionalGeneration` is already registered, the patch is skipped. The script is idempotent -- re-running it on an already-patched image is a no-op. If `KimiK25ForConditionalGeneration` is already registered, the patch is skipped. The script is idempotent -- re-running it on an already-patched image is a no-op.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment