Commits · c920cbd9dca095001ff83bcebc9f7b4d1f12f34f · OpenDAS / dynamo

03 Sep, 2025 2 commits
- feat: Add --custom-jinja-template argument to pass a custom chat template for vLLM (#2829) · c920cbd9
  KrishnanPrash authored Sep 03, 2025
```
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
```
  c920cbd9
- feat: dynamo namespace isolation (#2394) · c6becbc8
  Biswa Panda authored Sep 03, 2025
```
Signed-off-by: Biswa Panda <biswa.panda@gmail.com>
```
  c6becbc8
29 Aug, 2025 2 commits
- chore: added include_stop_str_in_output (#2782) · 1f6b83be
  Ayush Agarwal authored Aug 29, 2025
  
  1f6b83be
- chore: deprecate nvext.top_k and nvext.repetition_penalty and make available top level (#2767) · 63f5bbc0
  ryan-lempka authored Aug 28, 2025
```
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
```
  63f5bbc0
28 Aug, 2025 2 commits
- feat: Prevent double-tokenization when EPP picks worker (#2559) · 7d13b6e3
  atchernych authored Aug 28, 2025
  
  7d13b6e3
- refactor: centralize Prometheus metrics naming and sanitization DIS-554 (#2733) · 84c9890b
  Keiven C authored Aug 28, 2025
```
Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
```
  84c9890b
27 Aug, 2025 1 commit
- feat: KServe gRPC support (#2638) · 91a459c0
  GuanLuo authored Aug 26, 2025
  
  91a459c0
26 Aug, 2025 1 commit
- feat: align OpenAI response IDs with distributed trace IDs (#2496) · a485ab78
  Chi McIsaac authored Aug 26, 2025
  
  a485ab78
25 Aug, 2025 2 commits
- feat: enable --dyn-reasoning-parser flag to set reasoning parser for vllm deployments (#2700) · f5a41004
  nachiketb-nvidia authored Aug 25, 2025
  
  f5a41004
- feat: add gpt oss reasoning parser through harmony (#2656) · 3036e60b
  nachiketb-nvidia authored Aug 25, 2025
```
- couple of refactors
- added a new dependency, openai-harmony
- implemented the gpt oss parser
```
  3036e60b
22 Aug, 2025 2 commits
- chore: Rust to 1.89 and edition 2024 (#2659) · bce74588
  Graham King authored Aug 22, 2025
  
  bce74588
- feat: [vLLM] implement cli args for tool and reasoning parsers (#2619) · cbe854fc
  Ayush Agarwal authored Aug 22, 2025
  
  cbe854fc
21 Aug, 2025 1 commit
- feat: enable basic reasoning parsing of <think> </think> tokens (#2555) · 8e8152a1
  nachiketb-nvidia authored Aug 21, 2025
  
  8e8152a1
20 Aug, 2025 1 commit

chore: remove flatten for chat response types, add reasoning_content (#2543) · c12fe501

nachiketb-nvidia authored Aug 19, 2025

Changing the chat completions response objects from structs to types of dynamo_async_openai

Implement aggregator traits for them chat completion structs

add reasoning_content under message and delta message in lib/async-openai

c12fe501

19 Aug, 2025 2 commits

chore: Bring async-openai into repo as request starter (#2520) · 199b9a30
nachiketb-nvidia authored Aug 19, 2025
```
Co-authored-by: Graham King <grahamk@nvidia.com>
```
199b9a30

feat: kvbm + connector (#2258) · 07cfc3a1

Ryan Olson authored Aug 19, 2025


Signed-off-by: Ryan Olson <rolson@nvidia.com>
Co-authored-by: Olga Andreeva <oandreeva@nvidia.com>
Co-authored-by: Ziqi Fan <ziqif@nvidia.com>
Co-authored-by: John Thompson <jothomson@nvidia.com>
Co-authored-by: Richard Huo <rihuo@nvidia.com>
Co-authored-by: Zicheng Ma <zichengm@nvidia.com>

07cfc3a1

18 Aug, 2025 1 commit
- fix: use random port assignment for http tests (#2472) · e63d728f
  ryan-lempka authored Aug 18, 2025
  
  e63d728f
15 Aug, 2025 1 commit
- feat: Dynamic Endpoint Exposure Based on Model Type (#1447) · 537759f1
  Abrar Shivani authored Aug 15, 2025
  
  537759f1
13 Aug, 2025 2 commits
- feat: enable custom metrics prefix (#2432) · 3411bda8
  ryan-lempka authored Aug 13, 2025
  
  3411bda8
- fix: Add detokenize stream (#2413) · 8c40bbb0
  jthomson04 authored Aug 13, 2025
```
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
```
  8c40bbb0
12 Aug, 2025 1 commit

feat: Add frontend support for `min_tokens` and `ignore_eos` (outside of... · 18bb779e

KrishnanPrash authored Aug 12, 2025

feat: Add frontend support for `min_tokens` and `ignore_eos` (outside of `nvext`) and Structured Output / Guided Decoding (#2380)
Signed-off-by: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: Ayush Agarwal <ayushag@nvidia.com>

18bb779e

07 Aug, 2025 2 commits
- chore: Remove service_name from ModelDeploymentCard (#2349) · 1954fcfa
  Graham King authored Aug 07, 2025
  
  1954fcfa
- fix: improve HF token handling in preprocessor tests (#2321) · ccc8815b
  Keiven C authored Aug 06, 2025
```
Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
```
  ccc8815b
01 Aug, 2025 1 commit
- fix: frontend metrics to be renamed from nv_llm_http_service_* => dynamo_frontend_* (#2176) · 8c75ed79
  Keiven C authored Jul 31, 2025
```
Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
```
  8c75ed79
18 Jul, 2025 1 commit
- feat: http disconnects (#2014) · 343a4814
  Ryan Olson authored Jul 18, 2025
  
  343a4814
17 Jul, 2025 1 commit
- feat: record + analyze logprobs (#1957) · 49b7a0d9
  Ryan Olson authored Jul 17, 2025
  
  49b7a0d9
15 Jul, 2025 1 commit
- feat: adding http clients and recorded response stream (#1919) · a9e0891c
  Ryan Olson authored Jul 15, 2025
  
  a9e0891c
10 Jul, 2025 1 commit
- perf(tokenizer): Make de-tokenize ~50% faster (#1868) · 61a1f4ff
  Graham King authored Jul 10, 2025
  
  61a1f4ff
01 Jul, 2025 2 commits
- feat: Validation engine for validating OpenAI api request data (#1674) · ee86bad3
  Nathan Barry authored Jul 01, 2025
  
  ee86bad3
- feat: Support for Responses API (#1694) · dfbd741d
  Paul Hendricks authored Jul 01, 2025
  
  dfbd741d
26 Jun, 2025 1 commit
- refactor: refactored using CompletionResponse (#1658) · e3f1bd5d
  Paul Hendricks authored Jun 26, 2025
  
  e3f1bd5d
06 Jun, 2025 1 commit
- feat: KVBM dynamo runtime + event manger (#1195) · 3216003c
  Olga Andreeva authored Jun 06, 2025
  
  3216003c
04 Jun, 2025 2 commits
- refactor: Rename CompletionRequest to NvCreateCompletionRequest (#1383) · c103d56a
  Paul Hendricks authored Jun 04, 2025
  
  c103d56a
- feat: Support larger Gemma 3 models (#1359) · cfd12d7f
  Graham King authored Jun 04, 2025
```
Publish `generation_config.json` from worker to ingress, as part of Model Deployment Card. That allows ingress to read key fields out of it. Gemma 3 4B+ has some important information that's only in there.
```
  cfd12d7f
22 May, 2025 2 commits

feat(dynamo-run): Allow setting KV cache block size (#1175) · 183f2b32

Graham King authored May 22, 2025

Example:
```
dynamo-run out=<engine> <model> --kv-cache-block-size 64
```

In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card.

Previously hard coded to 16, which is now the default.

- Load context_length from model. Closes #1172
- Store context length and KV cache block size in Model Deployment Card #1170

183f2b32

fix: Fix race condition in kv_router unit test (#1174) · 3bde1e45

Graham King authored May 22, 2025

Removed the hard coded sleeps, explained what we're testing.

Closes https://github.com/ai-dynamo/dynamo/issues/1132

The race condition is that `apply_event` sends a message on a channel, it does not directly apply the event. At some later point the tokio runtime schedules the task running the channel receiver, which applies the event. If that had not happened yet the test would fail.

3bde1e45

21 May, 2025 1 commit

chore: Fix model removal on instance stop, refactor discovery (#1142) · b520bf44

Graham King authored May 21, 2025

- Stop advertising a model when it's last instance stops. Previously was when any instance stops.
- Faster locks on model manager.
- Move discovery code out of http, as it is used by all inputs.

b520bf44

19 May, 2025 1 commit

feat: Add OpenAI Embeddings interface in rust lib (#1110) · 73fdfb8a

Tom O'Brien authored May 19, 2025

Implements OpenAI embeddings (interface only).

- Adds ModelType::Embedding
- Adds OpenAI embedding request/response structs
- Adds support for embedding model discovery

73fdfb8a

08 May, 2025 1 commit

feat: Qwen3, Gemma3 and Llama4 support (#1002) · ceaeba3e

Graham King authored May 08, 2025

. New mistralrs and llamacpp version
. mistralrs: Handle Gemma 3 and Llama 4 as vision models
. Update the dynamo-run docs to use Qwen 3
. Our pre-processor now supports Llama 4's newer multi-modal `config.json`
. Upgrade minijinja to handle Qwen 3's prompt template

For Llama 4 we'll need to limit the max seq len. vllm says:
> To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed,...

I was able to run Llama 4 with llamacpp and a quantized GGUF, with Dynamo doing the pre-processing.

ceaeba3e

06 May, 2025 1 commit

feat: dynamo-run <-> python interop (#934) · 99cd9d85

Graham King authored May 05, 2025

Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests:
```
from dynamo.llm import register_llm

MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
await register_llm(endpoint, MODEL, 3)
```

Full vllm example, with pre-processing in dynamo:
- `dynamo-run in=text out=dyn://dynamo.backend.generate`
- `cd lib/bindings/python/examples/hello_world`
- `python server_vllm.py`

This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus.

The `register_llm` call does this:

- Download the model from HF if necessary
- Load the model deployment card from the HF folder or extract from GGUF
- Push the tokenizer config etc into NATS object store so ingress can access it from a different machine
- Publish the model deployment card to ETCD

99cd9d85