Commits · 3e3c3b10e6a660c30901fd355da2533dfe8ef02c · OpenDAS / dynamo

29 May, 2025 6 commits
- feat(dynamo-run): Use llama.cpp as the default engine for GGUF (#1276) · 3e3c3b10
  Graham King authored May 29, 2025
```
Previously `mistral.rs` was the default engine for both safetensors and GGUF models. Now it is only the default for safetensors, `llama.cpp` becomes the default for GGUF.

Why?

- Since #1177 `llama.cpp` is built-in by default, so we can switch.
- `llama.cpp` is very very good at running GGUF (but can't run other types of model), so we should switch.

Dynamo's multi-engine support gives us a secret super-power: we can use the best engine for this specific format or model.

We can still run GGUF with mistralrs by doing `out=mistralrs`.
```
  3e3c3b10
- feat: Publish events and metrics when using kv routing (#1262) · f9ba6f5c
  Tanmay Verma authored May 29, 2025
  
  f9ba6f5c
- fix: Renamed event publisher classes and configuration (#1273) · f67dc38b
  Alec authored May 29, 2025
  
  f67dc38b
- chore: Make llama.cpp a default engine (#1177) · b889948c
  Graham King authored May 29, 2025
  
  b889948c
- feat: add KV Event Publishing to vLLM v1 (#1181) · 0df6d462
  Alec authored May 29, 2025
  
  0df6d462
- fix: Import json when using --engine-extra-args (#1261) · 8d324489
  jthomson04 authored May 28, 2025
  
  8d324489
28 May, 2025 3 commits

feat(dynamo-llm): Remove bring-your-own-engine (#1216) · 0a1d1fbe

Graham King authored May 28, 2025

It was removed from the docs in 0.2.1 and replaced with writing a [standalone Python engine](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_run.md#writing-your-own-engine-in-python).

Also remove the associated `dynamo-run` feature `python`.

Releasing this in 0.3.0 will resolve #784 and #1109.

0a1d1fbe

feat: Enable dynamo-run out=trtllm (#1223) · 1b1e089a
Tanmay Verma authored May 28, 2025

1b1e089a
fix: dynamo-run pass proper args using register-llm (#1230) · cc40af70
Alec authored May 28, 2025

cc40af70

27 May, 2025 1 commit
- feat: Add metrics and event publishers (#1192) · 9acaa8d1
  Tanmay Verma authored May 27, 2025
  
  9acaa8d1
22 May, 2025 3 commits

feat: Add standalone script for TRTLLM integration into dynamo-run (#1162) · 3d4fe574
Tanmay Verma authored May 22, 2025

3d4fe574

feat(dynamo-run): Allow setting KV cache block size (#1175) · 183f2b32

Graham King authored May 22, 2025

Example:
```
dynamo-run out=<engine> <model> --kv-cache-block-size 64
```

In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card.

Previously hard coded to 16, which is now the default.

- Load context_length from model. Closes #1172
- Store context length and KV cache block size in Model Deployment Card #1170

183f2b32

feat(dynamo-run): Allow setting context-length (#1157) · 6d5da821

Graham King authored May 22, 2025

Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context.

Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit.

Future todo:
- Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor.
- mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.

6d5da821

21 May, 2025 4 commits
- fix(dynamo-run): Don't exit interactive chat on error (#1155) · b226b7b0
  Graham King authored May 21, 2025
```
Previously any error would cause us to halt. Most of them are recoverable. So now we print the error and return to the prompt.
```
  b226b7b0
- fix(llmctl): Use ModelWatcher instead of direct etcd operations (#1150) · 3e8e38a9
  Graham King authored May 21, 2025
  
  3e8e38a9
- fix: register model after engine load (#1145) · 08c01d8c
  Neelay Shah authored May 21, 2025
  
  08c01d8c
- chore: Fix model removal on instance stop, refactor discovery (#1142) · b520bf44
  Graham King authored May 21, 2025
```
- Stop advertising a model when it's last instance stops. Previously was when any instance stops.
- Faster locks on model manager.
- Move discovery code out of http, as it is used by all inputs.
```
  b520bf44
19 May, 2025 2 commits

feat: Support multiple models on single ingress node (#1127) · aeb79e62

Graham King authored May 19, 2025

We can now do this:

- Node 1:

```
dynamo-run in=http out=dyn
```

- Node 2 and 3, two instances of component 'backend' in the nemotron_ultra pipeline:

```
dynamo-run in=dyn://nemotron_ultra.backend.generate out=vllm /data/models/NemotronUltra
```

- Node 4 and 5, two instances of the 'backend' component in nemotron_super pipeline:

```
dynamo-run in=dyn://nemotron_super.backend.generate out=vllm /data/models/NemotronSuper
```

The ingress node will discover all four instances and route correctly. We have been planning for this for a long time now.

As part of this auto-discovery is now always `out=dyn`, with no extra URL parts. Previously it could only route to a single pipeline.

Also:
- Refactor endpoint / instance naming now that I understand them
- Fix removing models when their instance stops.

aeb79e62

feat: Add OpenAI Embeddings interface in rust lib (#1110) · 73fdfb8a

Tom O'Brien authored May 19, 2025

Implements OpenAI embeddings (interface only).

- Adds ModelType::Embedding
- Adds OpenAI embedding request/response structs
- Adds support for embedding model discovery

73fdfb8a

16 May, 2025 1 commit
- test: Add doc tests to Rust CI (#1102) · 34f3fc6d
  Ryan McCormick authored May 16, 2025
  
  34f3fc6d
15 May, 2025 3 commits

chore: Prevent duplicate components with different models. (#1103) · 641234cd

Graham King authored May 15, 2025

Each namespace is for a single pipeline, so a component must be model-unique. The means we can have several components with the same name running the same model (data parallel), their traffic will be routed according to `--router-mode`, but we cannot have several components with the same name running different models.

Add an `ensure_unique` check to prevent that happening.

641234cd

chore: Update default router mode from random to round-robin (#1097) · 770c230c
Ryan McCormick authored May 15, 2025

770c230c

fix: Fix default RouterMode value (#1092) · 889ab67e

Graham King authored May 15, 2025

The Python bindings use the default value for RouterMode. Previously that was Random (good), but now it became None (bad).

Remove the option and clean up the duplicate RouterMode. I was trying to avoid putting the `KV` enum in dynamo-runtime. Turns out adding those two characters gives us a healthy simplification, and restores the old default router value.

Also clean up two noisy log messages when waiting for KV routing metrics to start in worker.

889ab67e

14 May, 2025 2 commits

feat(dynamo-run): KV-aware routing (#1064) · 29813508

Graham King authored May 14, 2025

Router:
```
dynamo-run in=http out=dyn://dynamo.endpoint.generate --router-mode kv
```

Worker (* N):
```
dynamo-run in=dyn://dynamo.endpoint.generate out=vllm /data/llms/Qwen/Qwen3-4B
```

You need patched vllm and the C bindings `.so`. Full docs in the updated guide: `docs/guides/dynamo_run.md`.

This gives us a pure-Rust ingress node: OpenAI compliant HTTP server + Pre-processor + KV-aware router.

29813508

feat(dynamo-run): Print HTTP routes on startup (#1010) · ed290f0a

Graham King authored May 14, 2025

For #1006

Prints this on startup:
```
2025-05-09T13:06:34.529Z DEBUG dynamo_run::input::http: Supported routes: ["GET /metrics", "GET /dynamo/alpha/list-models", "GET /v1/models", "POST /v1/chat/completions", "POST /v1/completions"]
```

ed290f0a

09 May, 2025 2 commits
- fix(bindings): serve_endpoint no longer takes a lease (#1014) · c7bb1e83
  Graham King authored May 09, 2025
  
  c7bb1e83
- feat(sglang): aggregated support (#937) · 5d5235bc
  ishandhanani authored May 08, 2025
```
Co-authored-by: ishandhanani <ishandhananai@gmail.com>
```
  5d5235bc
08 May, 2025 1 commit

feat: Qwen3, Gemma3 and Llama4 support (#1002) · ceaeba3e

Graham King authored May 08, 2025

. New mistralrs and llamacpp version
. mistralrs: Handle Gemma 3 and Llama 4 as vision models
. Update the dynamo-run docs to use Qwen 3
. Our pre-processor now supports Llama 4's newer multi-modal `config.json`
. Upgrade minijinja to handle Qwen 3's prompt template

For Llama 4 we'll need to limit the max seq len. vllm says:
> To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed,...

I was able to run Llama 4 with llamacpp and a quantized GGUF, with Dynamo doing the pre-processing.

ceaeba3e

07 May, 2025 2 commits

fix: Fix vllm/sglang engine model name if using HF repo (#986) · 92bbbc39
Graham King authored May 07, 2025
```
Signed-off-by: Graham King <graham@gkgk.org>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
```
92bbbc39

chore: Remove embedded Python vllm and sglang engines (#966) · 42969800

Graham King authored May 07, 2025

vllm and sglang are now the sub-process engines from #954

Also updated docs on doing vllm and sglang multi-gpu (tensor parallel) and multi-node (pipeline parallel).

42969800

06 May, 2025 2 commits

feat(dynamo-run): vllm and sglang subprocess engines (#954) · 28fd481c

Graham King authored May 06, 2025

New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines.
    
Why?
    
  - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain.
  - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues.
  - Should have better performance as it's "native" vllm / sglang.
  - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle.

28fd481c

feat: dynamo-run <-> python interop (#934) · 99cd9d85

Graham King authored May 05, 2025

Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests:
```
from dynamo.llm import register_llm

MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
await register_llm(endpoint, MODEL, 3)
```

Full vllm example, with pre-processing in dynamo:
- `dynamo-run in=text out=dyn://dynamo.backend.generate`
- `cd lib/bindings/python/examples/hello_world`
- `python server_vllm.py`

This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus.

The `register_llm` call does this:

- Download the model from HF if necessary
- Load the model deployment card from the HF folder or extract from GGUF
- Push the tokenizer config etc into NATS object store so ingress can access it from a different machine
- Publish the model deployment card to ETCD

99cd9d85

01 May, 2025 2 commits
- chore(dynamo-llm): Move the pre-processor to ingress side (#903) · 2d2a1027
  Graham King authored May 01, 2025
```
Part of https://github.com/ai-dynamo/dynamo/issues/743
```
  2d2a1027
- feat: Support hf:// URLs in dynamo run (#917) · 877b2ec3
  Abrar Shivani authored May 01, 2025
```
Allow `hf://` prefix on command line. 

Closes GitHub issue: https://github.com/ai-dynamo/dynamo/issues/829
```
  877b2ec3
29 Apr, 2025 2 commits

feat: Add request template support for default inference parameters (#841) · adad2ecd

Abrar Shivani authored Apr 30, 2025

Adds support for specifying default request parameters through a json template file that can be applied across all inference requests. This enables consistent parameter settings while still allowing per-request overrides.

Changes:
- Add --request-template CLI flag to specify template file path
- Integrate template support in HTTP, batch and text input modes
- Template values can be overridden by individual request parameters
- Example template.json:
```
{
    "model": "Qwen2.5-3B-Instruct",
    "temperature": 0.7,
    "max_completion_tokens": 4096
}
```

adad2ecd

chore: Split PushRouter from Client (#817) · a1a10365

Graham King authored Apr 29, 2025

In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side.

As part of moving pre-processing back to ingress-side we need to split this into two steps:
- Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card.
- PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters.

Part of #743

a1a10365

28 Apr, 2025 1 commit
- feat: Adding completions endpoint support to `dynamo run in=http` (#777) · b495cd83
  Olga Andreeva authored Apr 28, 2025
```
Signed-off-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
```
  b495cd83
25 Apr, 2025 2 commits

fix: Change default vLLM router to round-robin (#597) · 0e4fffbc
Piotr Marcinkiewicz authored Apr 25, 2025

0e4fffbc

chore: Publish Model Deployment Card to NATS (#799) · d346782c

Graham King authored Apr 25, 2025

This will allow an ingress-side pre-processor to see it without needing a model checkout.

Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files.

To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store.

The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided.

Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit.

Part of #743

d346782c

24 Apr, 2025 1 commit
- feat: Add linux aarch64 support to dynamo-run build (#802) · d757604c
  Ryan McCormick authored Apr 23, 2025
  
  d757604c