Commits · c61e0dd3ccafefcfad7d1153c53df8386779fc53 · OpenDAS / dynamo

"lib/llm/src/vscode:/vscode.git/clone" did not exist on "d849f7eccabdd850e2c7cb5e6103d6f8b39b0a77"

21 Nov, 2025 1 commit
- chore: merge KvIndexer and ApproxKvIndexer (#4500) · c61e0dd3
  Yan Ru Pei authored Nov 21, 2025
```
Signed-off-by: PeaBrane <yanrpei@gmail.com>
```
  c61e0dd3
19 Nov, 2025 1 commit
- feat: Only monitor NATS metrics if using NATS request plane (#4442) · 69797b5a
  Graham King authored Nov 19, 2025
```
Signed-off-by: Graham King <grahamk@nvidia.com>
```
  69797b5a
11 Nov, 2025 1 commit
- chore: Remove static mode (#4235) · e1af3af6
  Graham King authored Nov 11, 2025
```
Signed-off-by: Graham King <grahamk@nvidia.com>
```
  e1af3af6
31 Oct, 2025 1 commit
- refactor: move backend deploy, launch and slurm files from components to examples (#3849) · 8bd37c96
  Anant Sharma authored Oct 31, 2025
```
Signed-off-by: Anant Sharma <anants@nvidia.com>
```
  8bd37c96
22 Oct, 2025 1 commit

docs: address Harry/VDR feedback + fixing broken links across repository (#3802) · c6b59045

Anish authored Oct 22, 2025


Signed-off-by: Harry Kim <harry_kim@live.com>
Signed-off-by: athreesh <anish.maddipoti@utexas.edu>
Signed-off-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com>
Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Co-authored-by: Harry Kim <harry_kim@live.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com>
Co-authored-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>

c6b59045

16 Oct, 2025 1 commit
- docs: reorganizing documentation to make things clearer (#3658) · 598cbbb7
  Anish authored Oct 16, 2025
```
Signed-off-by: athreesh <anish.maddipoti@utexas.edu>
Co-authored-by: Claude <noreply@anthropic.com>
```
  598cbbb7
08 Oct, 2025 2 commits
- chore: Remove llama.cpp engine (#3499) · 0aa0768f
  Graham King authored Oct 08, 2025
```
Signed-off-by: Graham King <grahamk@nvidia.com>
```
  0aa0768f
- chore: Remove GGUF support (#3488) · 1b1265e6
  Graham King authored Oct 08, 2025
```
Signed-off-by: Graham King <grahamk@nvidia.com>
```
  1b1265e6
16 Sep, 2025 1 commit
- fix: Interactive inputs actually stops, does not ignore stop token (#3057) · 87e6e052
  Graham King authored Sep 16, 2025
```
Signed-off-by: Graham King <grahamk@nvidia.com>
```
  87e6e052
03 Sep, 2025 1 commit

refactor: Split ModelType to ModelInput for request and response type;... · 27fad26f

Olga Andreeva authored Sep 03, 2025

refactor: Split ModelType to ModelInput for request and response type; ModelType for the supported workloads (#2714)
Signed-off-by: Guan Luo <gluo@nvidia.com>
Signed-off-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com>
Co-authored-by: Guan Luo <gluo@nvidia.com>
Co-authored-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com>

27fad26f

02 Sep, 2025 1 commit
- feat: FT Request Cancellation feature and test for 0.5.0 (#2500) · 6c539fbd
  Jacky authored Sep 02, 2025
  
  6c539fbd
06 Aug, 2025 2 commits
- docs(dynamo-run): Remove vllm/sglang/trtllm engines from dynamo-run docs (#2332) · 6be5c196
  Graham King authored Aug 06, 2025
  
  6be5c196
- feat: Support static workers, run without etcd. (#2281) · 6a1a801c
  Graham King authored Aug 06, 2025
  
  6a1a801c
05 Aug, 2025 1 commit
- feat: Pass user_data to register_llm for LoRA support (#2286) · 433f6012
  Chi authored Aug 05, 2025
  
  433f6012
01 Aug, 2025 1 commit

test: Request Migration Docs and E2E vLLM Tests (#2177) · ae51b3f4

Jacky authored Aug 01, 2025


Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>

ae51b3f4

28 Jul, 2025 1 commit
- chore: Add Request Migration docs and minor enhancements (#2038) · fdcf611f
  Jacky authored Jul 28, 2025
  
  fdcf611f
22 Jul, 2025 1 commit
- docs: Cleanup index.rst (#2007) · c49a13eb
  atchernych authored Jul 22, 2025
  
  c49a13eb
18 Jul, 2025 1 commit

feat: enable / disable chunked prefill for mockers (#2015) · e330d969

Yan Ru Pei authored Jul 18, 2025


Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

e330d969

17 Jul, 2025 1 commit
- feat(runtime): Support tokio-console (#1986) · 1eadc013
  Graham King authored Jul 17, 2025
  
  1eadc013
16 Jul, 2025 1 commit
- feat: integrate mocker with dynamo-run and python cli (#1927) · f31732a2
  Yan Ru Pei authored Jul 16, 2025
  
  f31732a2
14 Jul, 2025 1 commit
- feat: prefill aware routing (#1895) · df91fce2
  Yan Ru Pei authored Jul 14, 2025
  
  df91fce2
10 Jul, 2025 1 commit
- feat: allow using ApproxKvIndexer for routing via use_kv_events flag (#1869) · 13640e15
  Yan Ru Pei authored Jul 10, 2025
```
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com>
```
  13640e15
08 Jul, 2025 1 commit

feat: predictive active blocks for routing without load metrics (#1731) · 84e71e27

Yan Ru Pei authored Jul 08, 2025


Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: Alec <35311602+alec-flowers@users.noreply.github.com>

84e71e27

02 Jul, 2025 1 commit
- chore: fix typo for dynamo-run docs (#1720) · 7fd379a7
  Zhongdongming Dai authored Jul 02, 2025
  
  7fd379a7
30 Jun, 2025 1 commit
- docs: Update dynamo_run.md with the information how to resolve ModuleNotFou… (#1691) · 8f485b18
  tzulingk authored Jun 30, 2025
  
  8f485b18
12 Jun, 2025 1 commit

docs: DIS-133 and DIS-134 plus copyediting (#1439) · 0e7d4d82

Kristen Kelleher authored Jun 12, 2025


Signed-off-by: Kristen Kelleher <kkelleher@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

0e7d4d82

04 Jun, 2025 1 commit

docs: fix sphinx errors admonitions adobe config (#1179) · 5e9370d3

Kristen Kelleher authored Jun 04, 2025


Signed-off-by: Kristen Kelleher <kkelleher@nvidia.com>
- Content, format, and structural changes to the Dynamo docs for 0.3.0. 
- Includes copyediting and the first batch of changes from the DMO review.

5e9370d3

03 Jun, 2025 1 commit
- docs: Add documentation for verbosity flag in `dynamo-run` (#1353) · 9bf79b67
  Paul Hendricks authored Jun 03, 2025
  
  9bf79b67
02 Jun, 2025 2 commits
- feat: Make llama.cpp Gnu OpenMP dependency optional (#1331) · d3ca7661
  Graham King authored Jun 02, 2025
```
Do not include by default as it needs libgomp1 at runtime. Add a feature to enable it at build time.
```
  d3ca7661
- feat: expose router configurations to dynamo-run (#1259) · d849f7ec
  Hongkuan Zhou authored Jun 02, 2025
  
  d849f7ec
29 May, 2025 1 commit
- chore: Make llama.cpp a default engine (#1177) · b889948c
  Graham King authored May 29, 2025
  
  b889948c
28 May, 2025 1 commit
- feat: Enable dynamo-run out=trtllm (#1223) · 1b1e089a
  Tanmay Verma authored May 28, 2025
  
  1b1e089a
22 May, 2025 2 commits

feat(dynamo-run): Allow setting KV cache block size (#1175) · 183f2b32

Graham King authored May 22, 2025

Example:
```
dynamo-run out=<engine> <model> --kv-cache-block-size 64
```

In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card.

Previously hard coded to 16, which is now the default.

- Load context_length from model. Closes #1172
- Store context length and KV cache block size in Model Deployment Card #1170

183f2b32

feat(dynamo-run): Allow setting context-length (#1157) · 6d5da821

Graham King authored May 22, 2025

Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context.

Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit.

Future todo:
- Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor.
- mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.

6d5da821

21 May, 2025 2 commits

fix(llmctl): Use ModelWatcher instead of direct etcd operations (#1150) · 3e8e38a9
Graham King authored May 21, 2025

3e8e38a9

docs: Add sphinx-theme based userguides (#528) · 8d636ebd

Suman Tatiraju authored May 21, 2025


Signed-off-by: Suman Tatiraju <167138127+statiraju@users.noreply.github.com>
Signed-off-by: Anant Sharma <anants@nvidia.com>
Co-authored-by: Anant Sharma <anants@nvidia.com>
Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Co-authored-by: Kristen Kelleher <kkelleher@nvidia.com>
Co-authored-by: Suman Tatiraju <statiraju@statiraju-mlt.client.nvidia.com>
Co-authored-by: Hannah Zhang <hannahz@nvidia.com>

8d636ebd

19 May, 2025 1 commit

feat: Support multiple models on single ingress node (#1127) · aeb79e62

Graham King authored May 19, 2025

We can now do this:

- Node 1:

```
dynamo-run in=http out=dyn
```

- Node 2 and 3, two instances of component 'backend' in the nemotron_ultra pipeline:

```
dynamo-run in=dyn://nemotron_ultra.backend.generate out=vllm /data/models/NemotronUltra
```

- Node 4 and 5, two instances of the 'backend' component in nemotron_super pipeline:

```
dynamo-run in=dyn://nemotron_super.backend.generate out=vllm /data/models/NemotronSuper
```

The ingress node will discover all four instances and route correctly. We have been planning for this for a long time now.

As part of this auto-discovery is now always `out=dyn`, with no extra URL parts. Previously it could only route to a single pipeline.

Also:
- Refactor endpoint / instance naming now that I understand them
- Fix removing models when their instance stops.

aeb79e62

15 May, 2025 1 commit
- chore: Update default router mode from random to round-robin (#1097) · 770c230c
  Ryan McCormick authored May 15, 2025
  
  770c230c
14 May, 2025 1 commit

feat(dynamo-run): KV-aware routing (#1064) · 29813508

Graham King authored May 14, 2025

Router:
```
dynamo-run in=http out=dyn://dynamo.endpoint.generate --router-mode kv
```

Worker (* N):
```
dynamo-run in=dyn://dynamo.endpoint.generate out=vllm /data/llms/Qwen/Qwen3-4B
```

You need patched vllm and the C bindings `.so`. Full docs in the updated guide: `docs/guides/dynamo_run.md`.

This gives us a pure-Rust ingress node: OpenAI compliant HTTP server + Pre-processor + KV-aware router.

29813508

09 May, 2025 1 commit

docs: Example Chat sglang engine (#1015) · 24e2cbf5

Graham King authored May 09, 2025

Example of how to connect a Python sglang engine to the message bus (NATS/etc). I

In this example sglang does the pre/post processing. There is already an example where Dynamo does it.

The examples teach this:

- Be a chat completions engine, do your own pre-processing:

```
await register_llm(ModelType.Chat, endpoint, config.model)
```

- Have Dynamo do pre-processing. It will register us under both Chat and Completions endpoints, because that's handled before a Backend engine gets the request:

```
await register_llm(ModelType.Backend, endpoint, config.model)
```

24e2cbf5