Unverified Commit 284f772b authored by Yuewei Na's avatar Yuewei Na Committed by GitHub
Browse files

docs: add known issues section for TRT-LLM backend (#5801)


Signed-off-by: default avatarYuewei Na <nv-yna@users.noreply.github.com>
Signed-off-by: default avatarYuewei Na <248773860+nv-yna@users.noreply.github.com>
Co-authored-by: default avatarYuewei Na <nv-yna@users.noreply.github.com>
parent 454c28ab
...@@ -44,6 +44,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -44,6 +44,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
- [Multimodal Support](#multimodal-support) - [Multimodal Support](#multimodal-support)
- [Logits Processing](#logits-processing) - [Logits Processing](#logits-processing)
- [Performance Sweep](#performance-sweep) - [Performance Sweep](#performance-sweep)
- [Known Issues and Mitigations](#known-issues-and-mitigations)
## Feature Support Matrix ## Feature Support Matrix
...@@ -297,3 +298,31 @@ For detailed instructions on running comprehensive performance sweeps across bot ...@@ -297,3 +298,31 @@ For detailed instructions on running comprehensive performance sweeps across bot
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests. Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/trtllm-setup.md) . Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/trtllm-setup.md) .
## Known Issues and Mitigations
### KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving)
**Issue:** In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.
**Symptoms:**
- Workers function normally initially but hang after heavy load testing
- Inference requests get stuck and eventually timeout
- Logs show warnings: `num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache`
- Error logs may contain: `asyncio.exceptions.InvalidStateError: invalid state`
**Root Cause:** When `max_tokens_in_buffer` in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.
**Mitigation:** Ensure `max_tokens_in_buffer` exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., `prefill.yaml` and `decode.yaml`):
```yaml
cache_transceiver_config:
backend: DEFAULT
max_tokens_in_buffer: 65536 # Must exceed max ISL
```
For example, see `examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`.
**Related Issue:** [#4327](https://github.com/ai-dynamo/dynamo/issues/4327)
---
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment