Commits · 84e71e27d36e3db7168e673137ac9d6d10537efe · OpenDAS / dynamo

08 Jul, 2025 1 commit

feat: predictive active blocks for routing without load metrics (#1731) · 84e71e27

Yan Ru Pei authored Jul 08, 2025


Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: Alec <35311602+alec-flowers@users.noreply.github.com>

84e71e27

07 Jul, 2025 1 commit

feat: vllm speculative decoding metrics (#1549) · 439e977d

jain-ria authored Jul 07, 2025


Signed-off-by: jain-ria <riajain@NVIDIA.com>
Co-authored-by: Alec <35311602+alec-flowers@users.noreply.github.com>

439e977d

03 Jul, 2025 1 commit
- feat: Implement frontend tokenization for embedding requests (#1494) · 47e7fde7
  Tom O'Brien authored Jul 03, 2025
  
  47e7fde7
26 Jun, 2025 1 commit
- feat: Add experimental WideEP + EPLB aggregated example for TRTLLM (#1652) · 5fe5a950
  Ryan McCormick authored Jun 27, 2025
  
  5fe5a950
25 Jun, 2025 2 commits
- feat: support batch `/completions` (#1626) · fc16a79b
  ishandhanani authored Jun 25, 2025
```
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
```
  fc16a79b
- fix: add missing await in vllm-v1 `clear_kv_blocks` endpoint (#1642) · 3e1a5534
  Will Killian authored Jun 25, 2025
```
Signed-off-by: Will Killian <wkillian@nvidia.com>
```
  3e1a5534
17 Jun, 2025 1 commit
- fix: Fix sample disagg config for trtllm standalone (#1566) · 65f2de5f
  Tanmay Verma authored Jun 17, 2025
  
  65f2de5f
12 Jun, 2025 2 commits
- feat: add endpoint to clear all kv blocks in vllm v1 (#1384) · d0d364e3
  jain-ria authored Jun 11, 2025
  
  d0d364e3
- fix: Python respects DYN_LOG too (#1486) · af1f1155
  Alec authored Jun 11, 2025
  
  af1f1155
10 Jun, 2025 1 commit
- chore: Default to pytorch backend in trtllm worker (#1445) · d83633b5
  Ryan McCormick authored Jun 10, 2025
  
  d83633b5
04 Jun, 2025 1 commit
- feat: add implementation for embeddings (#1290) · e83009a6
  Tom O'Brien authored Jun 04, 2025
  
  e83009a6
03 Jun, 2025 1 commit
- feat: Enable disagg support in trtllm standalone script (#1355) · ac53c0bb
  Tanmay Verma authored Jun 03, 2025
  
  ac53c0bb
30 May, 2025 1 commit
- refactor: rename KvMetricsPublisher to WorkerMetricsPublisher (#1284) · 2f8da9ad
  Alec authored May 30, 2025
  
  2f8da9ad
29 May, 2025 4 commits
- feat: Publish events and metrics when using kv routing (#1262) · f9ba6f5c
  Tanmay Verma authored May 29, 2025
  
  f9ba6f5c
- fix: Renamed event publisher classes and configuration (#1273) · f67dc38b
  Alec authored May 29, 2025
  
  f67dc38b
- feat: add KV Event Publishing to vLLM v1 (#1181) · 0df6d462
  Alec authored May 29, 2025
  
  0df6d462
- fix: Import json when using --engine-extra-args (#1261) · 8d324489
  jthomson04 authored May 28, 2025
  
  8d324489
28 May, 2025 2 commits
- feat: Enable dynamo-run out=trtllm (#1223) · 1b1e089a
  Tanmay Verma authored May 28, 2025
  
  1b1e089a
- fix: dynamo-run pass proper args using register-llm (#1230) · cc40af70
  Alec authored May 28, 2025
  
  cc40af70
27 May, 2025 1 commit
- feat: Add metrics and event publishers (#1192) · 9acaa8d1
  Tanmay Verma authored May 27, 2025
  
  9acaa8d1
22 May, 2025 3 commits

feat: Add standalone script for TRTLLM integration into dynamo-run (#1162) · 3d4fe574
Tanmay Verma authored May 22, 2025

3d4fe574

feat(dynamo-run): Allow setting KV cache block size (#1175) · 183f2b32

Graham King authored May 22, 2025

Example:
```
dynamo-run out=<engine> <model> --kv-cache-block-size 64
```

In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card.

Previously hard coded to 16, which is now the default.

- Load context_length from model. Closes #1172
- Store context length and KV cache block size in Model Deployment Card #1170

183f2b32

feat(dynamo-run): Allow setting context-length (#1157) · 6d5da821

Graham King authored May 22, 2025

Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context.

Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit.

Future todo:
- Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor.
- mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.

6d5da821

21 May, 2025 1 commit
- fix: register model after engine load (#1145) · 08c01d8c
  Neelay Shah authored May 21, 2025
  
  08c01d8c
14 May, 2025 1 commit

feat(dynamo-run): KV-aware routing (#1064) · 29813508

Graham King authored May 14, 2025

Router:
```
dynamo-run in=http out=dyn://dynamo.endpoint.generate --router-mode kv
```

Worker (* N):
```
dynamo-run in=dyn://dynamo.endpoint.generate out=vllm /data/llms/Qwen/Qwen3-4B
```

You need patched vllm and the C bindings `.so`. Full docs in the updated guide: `docs/guides/dynamo_run.md`.

This gives us a pure-Rust ingress node: OpenAI compliant HTTP server + Pre-processor + KV-aware router.

29813508

09 May, 2025 2 commits
- fix(bindings): serve_endpoint no longer takes a lease (#1014) · c7bb1e83
  Graham King authored May 09, 2025
  
  c7bb1e83
- feat(sglang): aggregated support (#937) · 5d5235bc
  ishandhanani authored May 08, 2025
```
Co-authored-by: ishandhanani <ishandhananai@gmail.com>
```
  5d5235bc
07 May, 2025 2 commits

fix: Fix vllm/sglang engine model name if using HF repo (#986) · 92bbbc39
Graham King authored May 07, 2025
```
Signed-off-by: Graham King <graham@gkgk.org>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
```
92bbbc39

chore: Remove embedded Python vllm and sglang engines (#966) · 42969800

Graham King authored May 07, 2025

vllm and sglang are now the sub-process engines from #954

Also updated docs on doing vllm and sglang multi-gpu (tensor parallel) and multi-node (pipeline parallel).

42969800

06 May, 2025 1 commit

feat(dynamo-run): vllm and sglang subprocess engines (#954) · 28fd481c

Graham King authored May 06, 2025

New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines.
    
Why?
    
  - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain.
  - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues.
  - Should have better performance as it's "native" vllm / sglang.
  - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle.

28fd481c