Commits · 4eae238f2f2c75278707961ee995dc1e3a765775 · OpenDAS / dynamo

23 May, 2025 3 commits
- feat: add dynamo operator overview doc (#688) · 4eae238f
  julienmancuso authored May 23, 2025
  
  4eae238f
- feat: support k8s target in dynamo deploy command (#1104) · 33e72720
  hhzhang16 authored May 23, 2025
  
  33e72720
- feat: adding arena allocator for storage objects (#1178) · 31ff2370
  Ryan Olson authored May 23, 2025
  
  31ff2370
22 May, 2025 11 commits

fix: add blocking mode for k8s connector in planner (#1176) · 14e1d446
julienmancuso authored May 22, 2025

14e1d446
feat: Add standalone script for TRTLLM integration into dynamo-run (#1162) · 3d4fe574
Tanmay Verma authored May 22, 2025

3d4fe574

feat(dynamo-run): Allow setting KV cache block size (#1175) · 183f2b32

Graham King authored May 22, 2025

Example:
```
dynamo-run out=<engine> <model> --kv-cache-block-size 64
```

In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card.

Previously hard coded to 16, which is now the default.

- Load context_length from model. Closes #1172
- Store context length and KV cache block size in Model Deployment Card #1170

183f2b32

feat: Add TTFT and ITL Interpolation to Profiling Script (#1159) · 7860861f
Hongkuan Zhou authored May 22, 2025
```
Co-authored-by: root <root@kkranen-dt.nvidia.com>
```
7860861f

fix: Fix race condition in kv_router unit test (#1174) · 3bde1e45

Graham King authored May 22, 2025

Removed the hard coded sleeps, explained what we're testing.

Closes https://github.com/ai-dynamo/dynamo/issues/1132

The race condition is that `apply_event` sends a message on a channel, it does not directly apply the event. At some later point the tokio runtime schedules the task running the channel receiver, which applies the event. If that had not happened yet the test would fail.

3bde1e45

feat: Various KVBM improvements (#1134) · 5d5080ba
jthomson04 authored May 22, 2025

5d5080ba
chore: vLLM arm build has verbose logging turned on to see progress (#1160) · d3b0cae1
Kyle McGill authored May 22, 2025

d3b0cae1
fix: typo in planner doc and log (#1165) · 3d697d4d
Hongkuan Zhou authored May 22, 2025

3d697d4d

feat(dynamo-run): Allow setting context-length (#1157) · 6d5da821

Graham King authored May 22, 2025

Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context.

Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit.

Future todo:
- Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor.
- mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.

6d5da821

fix: Enable Dynamo HTTP servers to run on IPv6-only hosts (#1166) · 27e92701
jmswen authored May 21, 2025

27e92701
docs: Fix broken link in python bindings documentation (#1163) · f992a6a2
Suman Tatiraju authored May 22, 2025
```
Co-authored-by: Suman Tatiraju <statiraju@statiraju-mlt.client.nvidia.com>
```
f992a6a2

21 May, 2025 10 commits
- fix(llmctl): Add back the model_type in remove (#1158) · f1896c49
  Graham King authored May 21, 2025
  
  f1896c49
- fix(dynamo-run): Don't exit interactive chat on error (#1155) · b226b7b0
  Graham King authored May 21, 2025
```
Previously any error would cause us to halt. Most of them are recoverable. So now we print the error and return to the prompt.
```
  b226b7b0
- fix(llmctl): Use ModelWatcher instead of direct etcd operations (#1150) · 3e8e38a9
  Graham King authored May 21, 2025
  
  3e8e38a9
- fix: make component type a simple string (#1144) · dcad8ac7
  mohammedabdulwahhab authored May 21, 2025
```
Co-authored-by: Hannah Zhang <hannahz@nvidia.com>
```
  dcad8ac7
- fix: register model after engine load (#1145) · 08c01d8c
  Neelay Shah authored May 21, 2025
  
  08c01d8c
- docs: Add sphinx-theme based userguides (#528) · 8d636ebd
  Suman Tatiraju authored May 21, 2025
```
Signed-off-by: Suman Tatiraju <167138127+statiraju@users.noreply.github.com>
Signed-off-by: Anant Sharma <anants@nvidia.com>
Co-authored-by: Anant Sharma <anants@nvidia.com>
Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Co-authored-by: Kristen Kelleher <kkelleher@nvidia.com>
Co-authored-by: Suman Tatiraju <statiraju@statiraju-mlt.client.nvidia.com>
Co-authored-by: Hannah Zhang <hannahz@nvidia.com>
```
  8d636ebd
- feat: rename dynamo decorator (#1133) · 6d46288c
  Biswa Panda authored May 21, 2025
  
  6d46288c
- chore: Fix model removal on instance stop, refactor discovery (#1142) · b520bf44
  Graham King authored May 21, 2025
```
- Stop advertising a model when it's last instance stops. Previously was when any instance stops.
- Faster locks on model manager.
- Move discovery code out of http, as it is used by all inputs.
```
  b520bf44
- feat: vllm mock workers, Rusty skeleton (#1033) · 03c160af
  Yan Ru Pei authored May 21, 2025
```
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
```
  03c160af
- fix: Fix the protocol in the example (#1146) · 84377e5d
  Tanmay Verma authored May 21, 2025
  
  84377e5d
20 May, 2025 5 commits
- fix: set gpus as strings in config files (#1123) · 35229c74
  julienmancuso authored May 20, 2025
  
  35229c74
- fix: Incrementally decode token to reduce the overhead from Processor (#1129) · b3da9427
  Tanmay Verma authored May 20, 2025
  
  b3da9427
- feat: SLA Profiling and Recommending Parallelization Mapping (#1114) · 93702e44
  Hongkuan Zhou authored May 20, 2025
  
  93702e44
- chore: Remove unused RouterType and ModelMetaData (#1138) · eb821bee
  Faradawn Yang authored May 20, 2025
```
Remove RouterType and ModelMetaData in `lib/runtime/src/protocols.rs`, which are unused (no outside reference). It is because that the routing has been moved to its own module, `pipeline/network/egress/push_router.rs`. Therefore, the legacy definition of RouterType in `protocols.rs` is no longer used.
```
  eb821bee
- feat: adding outer dimension to isolate k/v blocks (#1126) · 80256acf
  Ryan Olson authored May 20, 2025
  
  80256acf
19 May, 2025 9 commits

fix: Disable block manager by default in Python bindings (#1128) · 7e452a2e
Jacky authored May 19, 2025

7e452a2e

feat: Support multiple models on single ingress node (#1127) · aeb79e62

Graham King authored May 19, 2025

We can now do this:

- Node 1:

```
dynamo-run in=http out=dyn
```

- Node 2 and 3, two instances of component 'backend' in the nemotron_ultra pipeline:

```
dynamo-run in=dyn://nemotron_ultra.backend.generate out=vllm /data/models/NemotronUltra
```

- Node 4 and 5, two instances of the 'backend' component in nemotron_super pipeline:

```
dynamo-run in=dyn://nemotron_super.backend.generate out=vllm /data/models/NemotronSuper
```

The ingress node will discover all four instances and route correctly. We have been planning for this for a long time now.

As part of this auto-discovery is now always `out=dyn`, with no extra URL parts. Previously it could only route to a single pipeline.

Also:
- Refactor endpoint / instance naming now that I understand them
- Fix removing models when their instance stops.

aeb79e62

feat: Add support for SSD offloading in block manager (#1115) · 74221fd7
jthomson04 authored May 19, 2025

74221fd7

feat: Add LWS to Dynamo Operator (#998) · 024422b9

Rohan Varma authored May 19, 2025

Co-authored-by: Rohan Varma <rohanv@rohanv-mlt.client.nvidia.com>
Co-authored-by: Julien Mancuso <jmancuso@nvidia.com>
Co-authored-by: julienmancuso <161955438+julienmancuso@users.noreply.github.com>

024422b9

fix(sglang): allow for `disaggregation_bootstrap_port` for multinode deployment (#1119) · eb133e3f
ishandhanani authored May 19, 2025

eb133e3f
feat: KV Block Manager Python bindings (#1022) · 437cae0a
Jacky authored May 19, 2025

437cae0a
feat: add update deployment to dynamo deploy API and CLI (#1048) · a6899da9
hhzhang16 authored May 19, 2025

a6899da9

feat: Add OpenAI Embeddings interface in rust lib (#1110) · 73fdfb8a

Tom O'Brien authored May 19, 2025

Implements OpenAI embeddings (interface only).

- Adds ModelType::Embedding
- Adds OpenAI embedding request/response structs
- Adds support for embedding model discovery

73fdfb8a

fix: remove lib.real from LD (#1117) · ac82bcf3
Alec authored May 19, 2025

ac82bcf3

17 May, 2025 1 commit
- fix: add planner path in devcontainer (#1113) · c22315fc
  Biswa Panda authored May 16, 2025
  
  c22315fc
16 May, 2025 1 commit
- test: Add doc tests to Rust CI (#1102) · 34f3fc6d
  Ryan McCormick authored May 16, 2025
  
  34f3fc6d