Commits · 55c6525f5b40e9dac57a764f9f36a912f4de25cc · OpenDAS / dynamo

09 Jun, 2025 3 commits
- feat: Improved offload queueing and block eviction ordering (#1425) · 55c6525f
  jthomson04 authored Jun 09, 2025
  
  55c6525f
- feat: KVBM prometheus monitoring (#1211) · a1aea900
  jthomson04 authored Jun 09, 2025
  
  a1aea900
- feat: Restructure the KVBM WriteTo trait (#1363) · 312ee8e2
  jthomson04 authored Jun 09, 2025
  
  312ee8e2
06 Jun, 2025 1 commit
- feat: KVBM dynamo runtime + event manger (#1195) · 3216003c
  Olga Andreeva authored Jun 06, 2025
  
  3216003c
04 Jun, 2025 4 commits
- refactor: Rename CompletionRequest to NvCreateCompletionRequest (#1383) · c103d56a
  Paul Hendricks authored Jun 04, 2025
  
  c103d56a
- feat: Support larger Gemma 3 models (#1359) · cfd12d7f
  Graham King authored Jun 04, 2025
```
Publish `generation_config.json` from worker to ingress, as part of Model Deployment Card. That allows ingress to read key fields out of it. Gemma 3 4B+ has some important information that's only in there.
```
  cfd12d7f
- feat: add implementation for embeddings (#1290) · e83009a6
  Tom O'Brien authored Jun 04, 2025
  
  e83009a6
- feat: Integrate KVBM with `CriticalTaskHandle` (#1321) · 25c711f8
  jthomson04 authored Jun 03, 2025
  
  25c711f8
03 Jun, 2025 1 commit

feat: add more metrics to rust frontend (#1315) · 98d4abbb

Hongkuan Zhou authored Jun 03, 2025


Signed-off-by: Hongkuan Zhou <tedzhouhk@gmail.com>
Co-authored-by: jothomson <jwillthomson19@gmail.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

98d4abbb

02 Jun, 2025 2 commits
- feat: expose router configurations to dynamo-run (#1259) · d849f7ec
  Hongkuan Zhou authored Jun 02, 2025
  
  d849f7ec
- chore: Remove PreprocessedRequest alias BackendInput (#1307) · 3f6a7472
  Graham King authored Jun 02, 2025
```
It was confusing to have two names for one type.

This tidy up started in #1064 , is now complete.
```
  3f6a7472
30 May, 2025 3 commits
- feat: all blocks cleared event (#1279) · 1d34af75
  jain-ria authored May 30, 2025
  
  1d34af75
- refactor: rename KvMetricsPublisher to WorkerMetricsPublisher (#1284) · 2f8da9ad
  Alec authored May 30, 2025
  
  2f8da9ad
- refactor: Refactor kv event publishers (#1287) · 9210a26d
  jthomson04 authored May 30, 2025
  
  9210a26d
29 May, 2025 8 commits

feat(dynamo-run): Use llama.cpp as the default engine for GGUF (#1276) · 3e3c3b10

Graham King authored May 29, 2025

Previously `mistral.rs` was the default engine for both safetensors and GGUF models. Now it is only the default for safetensors, `llama.cpp` becomes the default for GGUF.

Why?

- Since #1177 `llama.cpp` is built-in by default, so we can switch.
- `llama.cpp` is very very good at running GGUF (but can't run other types of model), so we should switch.

Dynamo's multi-engine support gives us a secret super-power: we can use the best engine for this specific format or model.

We can still run GGUF with mistralrs by doing `out=mistralrs`.

3e3c3b10

fix: Only check model name on etcd-registered endpoints (#1263) · 4e47903b
jthomson04 authored May 29, 2025

4e47903b
fix: Renamed event publisher classes and configuration (#1273) · f67dc38b
Alec authored May 29, 2025

f67dc38b
feat: Restructure kv manager block registration (#1093) · 3d40a692
jthomson04 authored May 29, 2025

3d40a692

feat: Initial Granite support (#1271) · 7d0c9386

Graham King authored May 29, 2025

- Add Granite to our tokenizer
- Fix pre-processor to load context length correctly
- Add strftime_now Jinja function for prompt templates
- Update llama.cpp
- Handle trtllm errors when not using trtllm

Support depends on the engine:

- `mistral.rs`, our default engine, doesn't support Granite yet.

- `llama.cpp` does and works very well:
```
dynamo-run out=llamacpp ~/llms/granite-3.3-2b-instruct-Q4_K_M.gguf --context-length 16384
```

- `vllm` also works very well:
```
dynamo-run in=http out=vllm ~/llms/granite-3.3-2b-instruct --context-length 16384
```

- `sglang` mostly works, but it doesn't catch the stop token, so we do in the HTTP ingress, and log an error. The Text ingress doesn't catch it because I disabled it to make the raw echo engine work. A bit of work to do here.

Closes: #1245

7d0c9386

chore: update dynamo and nixl versions for 0.3.0 (#1240) · 9d9a1d9b
Anant Sharma authored May 29, 2025

9d9a1d9b

feat: expose estimated kv cache hit in dynamo-run (#1246) · c9eb6a83

Hongkuan Zhou authored May 29, 2025


Signed-off-by: Hongkuan Zhou <tedzhouhk@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

c9eb6a83

feat: add KV Event Publishing to vLLM v1 (#1181) · 0df6d462
Alec authored May 29, 2025

0df6d462

28 May, 2025 3 commits
- fix: correct calculation of block needed in rust kv router (#1253) · 8cc13610
  Hongkuan Zhou authored May 28, 2025
  
  8cc13610
- fix(dynamo-llm): Use HF_TOKEN env var (#1249) · 471a352f
  Graham King authored May 28, 2025
```
Fixes #286
```
  471a352f
- fix: dynamo-run add warning if block-size different (#1233) · e450c2c7
  Alec authored May 28, 2025
  
  e450c2c7
27 May, 2025 1 commit
- feat(http): add health check endpoint (#1037) · 39d01eac
  ishandhanani authored May 27, 2025
  
  39d01eac
24 May, 2025 1 commit
- feat: kvbm offload fixes and tests (#1191) · 6d9aac77
  jthomson04 authored May 24, 2025
  
  6d9aac77
23 May, 2025 4 commits
- chore: rm duplicate fwd pass metric (#1190) · 9d944c27
  Yan Ru Pei authored May 23, 2025
  
  9d944c27
- chore: Upgrade Rust to 1.87 (#1189) · a4c49fe5
  Graham King authored May 23, 2025
  
  a4c49fe5
- fix: etcd.rs - linear increasing watch with number of requests (#1081) · 3f9c3ffe
  Yan Ru Pei authored May 23, 2025
```
Signed-off-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com>
Co-authored-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com>
Co-authored-by: jthomson04 <jwillthomson19@gmail.com>
Co-authored-by: Ryan Olson <ryanolson@users.noreply.github.com>
```
  3f9c3ffe
- feat: adding arena allocator for storage objects (#1178) · 31ff2370
  Ryan Olson authored May 23, 2025
  
  31ff2370
22 May, 2025 4 commits

feat(dynamo-run): Allow setting KV cache block size (#1175) · 183f2b32

Graham King authored May 22, 2025

Example:
```
dynamo-run out=<engine> <model> --kv-cache-block-size 64
```

In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card.

Previously hard coded to 16, which is now the default.

- Load context_length from model. Closes #1172
- Store context length and KV cache block size in Model Deployment Card #1170

183f2b32

fix: Fix race condition in kv_router unit test (#1174) · 3bde1e45

Graham King authored May 22, 2025

Removed the hard coded sleeps, explained what we're testing.

Closes https://github.com/ai-dynamo/dynamo/issues/1132

The race condition is that `apply_event` sends a message on a channel, it does not directly apply the event. At some later point the tokio runtime schedules the task running the channel receiver, which applies the event. If that had not happened yet the test would fail.

3bde1e45

feat: Various KVBM improvements (#1134) · 5d5080ba
jthomson04 authored May 22, 2025

5d5080ba

feat(dynamo-run): Allow setting context-length (#1157) · 6d5da821

Graham King authored May 22, 2025

Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context.

Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit.

Future todo:
- Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor.
- mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.

6d5da821

21 May, 2025 3 commits
- fix(llmctl): Use ModelWatcher instead of direct etcd operations (#1150) · 3e8e38a9
  Graham King authored May 21, 2025
  
  3e8e38a9
- chore: Fix model removal on instance stop, refactor discovery (#1142) · b520bf44
  Graham King authored May 21, 2025
```
- Stop advertising a model when it's last instance stops. Previously was when any instance stops.
- Faster locks on model manager.
- Move discovery code out of http, as it is used by all inputs.
```
  b520bf44
- feat: vllm mock workers, Rusty skeleton (#1033) · 03c160af
  Yan Ru Pei authored May 21, 2025
```
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
```
  03c160af
20 May, 2025 1 commit
- feat: adding outer dimension to isolate k/v blocks (#1126) · 80256acf
  Ryan Olson authored May 20, 2025
  
  80256acf
19 May, 2025 1 commit

feat: Support multiple models on single ingress node (#1127) · aeb79e62

Graham King authored May 19, 2025

We can now do this:

- Node 1:

```
dynamo-run in=http out=dyn
```

- Node 2 and 3, two instances of component 'backend' in the nemotron_ultra pipeline:

```
dynamo-run in=dyn://nemotron_ultra.backend.generate out=vllm /data/models/NemotronUltra
```

- Node 4 and 5, two instances of the 'backend' component in nemotron_super pipeline:

```
dynamo-run in=dyn://nemotron_super.backend.generate out=vllm /data/models/NemotronSuper
```

The ingress node will discover all four instances and route correctly. We have been planning for this for a long time now.

As part of this auto-discovery is now always `out=dyn`, with no extra URL parts. Previously it could only route to a single pipeline.

Also:
- Refactor endpoint / instance naming now that I understand them
- Fix removing models when their instance stops.

aeb79e62