Commits · 439e977d9c751ef80f1ed72f03078dc408137a74 · OpenDAS / dynamo

07 Jul, 2025 1 commit

feat: vllm speculative decoding metrics (#1549) · 439e977d

jain-ria authored Jul 07, 2025


Signed-off-by: jain-ria <riajain@NVIDIA.com>
Co-authored-by: Alec <35311602+alec-flowers@users.noreply.github.com>

439e977d

02 Jul, 2025 1 commit
- chore: fix typo for dynamo-run docs (#1720) · 7fd379a7
  Zhongdongming Dai authored Jul 02, 2025
  
  7fd379a7
30 Jun, 2025 1 commit
- docs: Update dynamo_run.md with the information how to resolve ModuleNotFou… (#1691) · 8f485b18
  tzulingk authored Jun 30, 2025
  
  8f485b18
27 Jun, 2025 1 commit
- fix: add steps to install using published helm charts (#1623) · 8b1f2ded
  julienmancuso authored Jun 26, 2025
  
  8b1f2ded
18 Jun, 2025 1 commit
- docs: Fix missing logging import in basic worker example (#1580) · 316dffc0
  Shriyash.Patil authored Jun 18, 2025
```
Signed-off-by: Shriyash.Patil <shriyash81@gmail.com>
```
  316dffc0
13 Jun, 2025 3 commits
- fix: enable GCP deployments (#1474) · 648740e8
  julienmancuso authored Jun 13, 2025
  
  648740e8
- docs: Cleanup & Standardize Guides (#1357) · 6f8c68c1
  J Wyman authored Jun 13, 2025
  
  6f8c68c1
- merging docs: fix DIS-133 and NvB 5322259 (#1518) to main · 1da05309
  Kristen Kelleher authored Jun 13, 2025
  
  1da05309
12 Jun, 2025 1 commit

docs: DIS-133 and DIS-134 plus copyediting (#1439) · 0e7d4d82

Kristen Kelleher authored Jun 12, 2025


Signed-off-by: Kristen Kelleher <kkelleher@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

0e7d4d82

10 Jun, 2025 1 commit
- fix: remove unused bentoml references (#1412) · 75d7c3b9
  Biswa Panda authored Jun 09, 2025
  
  75d7c3b9
05 Jun, 2025 1 commit
- feat: data synthesizer based on prefix statistics (#1087) · 9cdba76d
  Yan Ru Pei authored Jun 04, 2025
```
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
```
  9cdba76d
04 Jun, 2025 3 commits
- feat: add result of fluid experiment (#1379) · c6d66bc3
  julienmancuso authored Jun 04, 2025
  
  c6d66bc3
- fix: prefillqueue stream name in load-planner (#1377) · c675fd1b
  Hongkuan Zhou authored Jun 04, 2025
  
  c675fd1b
- docs: fix sphinx errors admonitions adobe config (#1179) · 5e9370d3
  Kristen Kelleher authored Jun 04, 2025
```
Signed-off-by: Kristen Kelleher <kkelleher@nvidia.com>
- Content, format, and structural changes to the Dynamo docs for 0.3.0. 
- Includes copyediting and the first batch of changes from the DMO review.
```
  5e9370d3
03 Jun, 2025 1 commit
- docs: Add documentation for verbosity flag in `dynamo-run` (#1353) · 9bf79b67
  Paul Hendricks authored Jun 03, 2025
  
  9bf79b67
02 Jun, 2025 3 commits
- feat: set env variables in Dynamo deployments from secrets (#1325) · ba16ed52
  hhzhang16 authored Jun 02, 2025
```
Signed-off-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
```
  ba16ed52
- feat: Make llama.cpp Gnu OpenMP dependency optional (#1331) · d3ca7661
  Graham King authored Jun 02, 2025
```
Do not include by default as it needs libgomp1 at runtime. Add a feature to enable it at build time.
```
  d3ca7661
- feat: expose router configurations to dynamo-run (#1259) · d849f7ec
  Hongkuan Zhou authored Jun 02, 2025
  
  d849f7ec
30 May, 2025 3 commits
- chore: Fix typos in docs/guides (#1270) · 8df6e882
  Ryan McCormick authored May 31, 2025
  
  8df6e882
- refactor: rename KvMetricsPublisher to WorkerMetricsPublisher (#1284) · 2f8da9ad
  Alec authored May 30, 2025
  
  2f8da9ad
- feat: flatten out dynamo cloud helm chart (#1258) · 39dcdf1f
  julienmancuso authored May 30, 2025
  
  39dcdf1f
29 May, 2025 1 commit
- chore: Make llama.cpp a default engine (#1177) · b889948c
  Graham King authored May 29, 2025
  
  b889948c
28 May, 2025 4 commits
- feat: Enable dynamo-run out=trtllm (#1223) · 1b1e089a
  Tanmay Verma authored May 28, 2025
  
  1b1e089a
- fix: update kv-router usage (#1238) · 761f67e0
  Hongkuan Zhou authored May 28, 2025
  
  761f67e0
- feat: fluxcd guide to managing custom resources (#1220) · c12f61a6
  mohammedabdulwahhab authored May 27, 2025
```
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
```
  c12f61a6
- feat: document model caching using Fluid (#1218) · 0594235b
  julienmancuso authored May 27, 2025
```
Signed-off-by: julienmancuso <161955438+julienmancuso@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
```
  0594235b
23 May, 2025 1 commit
- feat: add dynamo operator overview doc (#688) · 4eae238f
  julienmancuso authored May 23, 2025
  
  4eae238f
22 May, 2025 4 commits

feat(dynamo-run): Allow setting KV cache block size (#1175) · 183f2b32

Graham King authored May 22, 2025

Example:
```
dynamo-run out=<engine> <model> --kv-cache-block-size 64
```

In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card.

Previously hard coded to 16, which is now the default.

- Load context_length from model. Closes #1172
- Store context length and KV cache block size in Model Deployment Card #1170

183f2b32

feat: Add TTFT and ITL Interpolation to Profiling Script (#1159) · 7860861f
Hongkuan Zhou authored May 22, 2025
```
Co-authored-by: root <root@kkranen-dt.nvidia.com>
```
7860861f
fix: typo in planner doc and log (#1165) · 3d697d4d
Hongkuan Zhou authored May 22, 2025

3d697d4d

feat(dynamo-run): Allow setting context-length (#1157) · 6d5da821

Graham King authored May 22, 2025

Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context.

Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit.

Future todo:
- Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor.
- mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.

6d5da821

21 May, 2025 2 commits

fix(llmctl): Use ModelWatcher instead of direct etcd operations (#1150) · 3e8e38a9
Graham King authored May 21, 2025

3e8e38a9

docs: Add sphinx-theme based userguides (#528) · 8d636ebd

Suman Tatiraju authored May 21, 2025


Signed-off-by: Suman Tatiraju <167138127+statiraju@users.noreply.github.com>
Signed-off-by: Anant Sharma <anants@nvidia.com>
Co-authored-by: Anant Sharma <anants@nvidia.com>
Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Co-authored-by: Kristen Kelleher <kkelleher@nvidia.com>
Co-authored-by: Suman Tatiraju <statiraju@statiraju-mlt.client.nvidia.com>
Co-authored-by: Hannah Zhang <hannahz@nvidia.com>

8d636ebd

19 May, 2025 2 commits

feat: Support multiple models on single ingress node (#1127) · aeb79e62

Graham King authored May 19, 2025

We can now do this:

- Node 1:

```
dynamo-run in=http out=dyn
```

- Node 2 and 3, two instances of component 'backend' in the nemotron_ultra pipeline:

```
dynamo-run in=dyn://nemotron_ultra.backend.generate out=vllm /data/models/NemotronUltra
```

- Node 4 and 5, two instances of the 'backend' component in nemotron_super pipeline:

```
dynamo-run in=dyn://nemotron_super.backend.generate out=vllm /data/models/NemotronSuper
```

The ingress node will discover all four instances and route correctly. We have been planning for this for a long time now.

As part of this auto-discovery is now always `out=dyn`, with no extra URL parts. Previously it could only route to a single pipeline.

Also:
- Refactor endpoint / instance naming now that I understand them
- Fix removing models when their instance stops.

aeb79e62

feat: add update deployment to dynamo deploy API and CLI (#1048) · a6899da9
hhzhang16 authored May 19, 2025

a6899da9

15 May, 2025 2 commits
- chore: Update default router mode from random to round-robin (#1097) · 770c230c
  Ryan McCormick authored May 15, 2025
  
  770c230c
- fix: planner fixes (#1055) · 1a163f6d
  mohammedabdulwahhab authored May 15, 2025
  
  1a163f6d
14 May, 2025 2 commits

feat(dynamo-run): KV-aware routing (#1064) · 29813508

Graham King authored May 14, 2025

Router:
```
dynamo-run in=http out=dyn://dynamo.endpoint.generate --router-mode kv
```

Worker (* N):
```
dynamo-run in=dyn://dynamo.endpoint.generate out=vllm /data/llms/Qwen/Qwen3-4B
```

You need patched vllm and the C bindings `.so`. Full docs in the updated guide: `docs/guides/dynamo_run.md`.

This gives us a pure-Rust ingress node: OpenAI compliant HTTP server + Pre-processor + KV-aware router.

29813508

docs: kv routing perf docs (#1078) · 20c470be
Yan Ru Pei authored May 14, 2025

20c470be

09 May, 2025 1 commit

docs: Example Chat sglang engine (#1015) · 24e2cbf5

Graham King authored May 09, 2025

Example of how to connect a Python sglang engine to the message bus (NATS/etc). I

In this example sglang does the pre/post processing. There is already an example where Dynamo does it.

The examples teach this:

- Be a chat completions engine, do your own pre-processing:

```
await register_llm(ModelType.Chat, endpoint, config.model)
```

- Have Dynamo do pre-processing. It will register us under both Chat and Completions endpoints, because that's handled before a Backend engine gets the request:

```
await register_llm(ModelType.Backend, endpoint, config.model)
```

24e2cbf5