Commits · cfd12d7f264e486d454358aeb43e4155168ff25e · OpenDAS / dynamo

04 Jun, 2025 1 commit

feat: Support larger Gemma 3 models (#1359) · cfd12d7f

Graham King authored Jun 04, 2025

Publish `generation_config.json` from worker to ingress, as part of Model Deployment Card. That allows ingress to read key fields out of it. Gemma 3 4B+ has some important information that's only in there.

cfd12d7f

22 May, 2025 1 commit

feat(dynamo-run): Allow setting KV cache block size (#1175) · 183f2b32

Graham King authored May 22, 2025

Example:
```
dynamo-run out=<engine> <model> --kv-cache-block-size 64
```

In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card.

Previously hard coded to 16, which is now the default.

- Load context_length from model. Closes #1172
- Store context length and KV cache block size in Model Deployment Card #1170

183f2b32

21 May, 2025 3 commits
- fix(llmctl): Use ModelWatcher instead of direct etcd operations (#1150) · 3e8e38a9
  Graham King authored May 21, 2025
  
  3e8e38a9
- chore: Fix model removal on instance stop, refactor discovery (#1142) · b520bf44
  Graham King authored May 21, 2025
```
- Stop advertising a model when it's last instance stops. Previously was when any instance stops.
- Faster locks on model manager.
- Move discovery code out of http, as it is used by all inputs.
```
  b520bf44
- feat: vllm mock workers, Rusty skeleton (#1033) · 03c160af
  Yan Ru Pei authored May 21, 2025
```
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
```
  03c160af
14 May, 2025 1 commit

feat(dynamo-run): KV-aware routing (#1064) · 29813508

Graham King authored May 14, 2025

Router:
```
dynamo-run in=http out=dyn://dynamo.endpoint.generate --router-mode kv
```

Worker (* N):
```
dynamo-run in=dyn://dynamo.endpoint.generate out=vllm /data/llms/Qwen/Qwen3-4B
```

You need patched vllm and the C bindings `.so`. Full docs in the updated guide: `docs/guides/dynamo_run.md`.

This gives us a pure-Rust ingress node: OpenAI compliant HTTP server + Pre-processor + KV-aware router.

29813508

09 May, 2025 1 commit
- feat: kv block manager (#965) · 4564a387
  Ryan Olson authored May 09, 2025
  
  4564a387
06 May, 2025 1 commit

feat: dynamo-run <-> python interop (#934) · 99cd9d85

Graham King authored May 05, 2025

Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests:
```
from dynamo.llm import register_llm

MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
await register_llm(endpoint, MODEL, 3)
```

Full vllm example, with pre-processing in dynamo:
- `dynamo-run in=text out=dyn://dynamo.backend.generate`
- `cd lib/bindings/python/examples/hello_world`
- `python server_vllm.py`

This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus.

The `register_llm` call does this:

- Download the model from HF if necessary
- Load the model deployment card from the HF folder or extract from GGUF
- Push the tokenizer config etc into NATS object store so ingress can access it from a different machine
- Publish the model deployment card to ETCD

99cd9d85

29 Apr, 2025 1 commit

feat: Add request template support for default inference parameters (#841) · adad2ecd

Abrar Shivani authored Apr 30, 2025

Adds support for specifying default request parameters through a json template file that can be applied across all inference requests. This enables consistent parameter settings while still allowing per-request overrides.

Changes:
- Add --request-template CLI flag to specify template file path
- Integrate template support in HTTP, batch and text input modes
- Template values can be overridden by individual request parameters
- Example template.json:
```
{
    "model": "Qwen2.5-3B-Instruct",
    "temperature": 0.7,
    "max_completion_tokens": 4096
}
```

adad2ecd

25 Apr, 2025 1 commit

chore: Publish Model Deployment Card to NATS (#799) · d346782c

Graham King authored Apr 25, 2025

This will allow an ingress-side pre-processor to see it without needing a model checkout.

Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files.

To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store.

The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided.

Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit.

Part of #743

d346782c

04 Apr, 2025 1 commit
- feat: KV recorder for dumping router events into a jsonl (#505) · 4b6cfc1b
  Yan Ru Pei authored Apr 04, 2025
  
  4b6cfc1b
24 Mar, 2025 1 commit

feat: Build pre-processor from GGUF (#344) · c7067fc2

Graham King authored Mar 24, 2025

This lets us do:
```
dynamo-run out=llamacpp <gguf_file>
```

Previously a `--model-config <hf-repo>` was also required, to configure our tokenizer.

c7067fc2

14 Mar, 2025 1 commit
- feat: global kv block manager (#45) · f04359cf
  Ryan Olson authored Mar 13, 2025
  
  f04359cf
09 Mar, 2025 1 commit

feat: kv aware router + disagg router + prefill queue (#11) · 19844fc0

Hongkuan Zhou authored Mar 08, 2025


Signed-off-by: Hongkuan Zhou <tedzhouhk@gmail.com>
Co-authored-by: hongkuan <hongkuanz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local>
Co-authored-by: alec-flowers <aflowers@nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>

19844fc0

08 Mar, 2025 1 commit
- chore: rename dynamo (#44) · 602352ce
  Neelay Shah authored Mar 08, 2025
```
Co-authored-by: Biswa Panda <biswa.panda@gmail.com>
```
  602352ce
05 Mar, 2025 1 commit
- refactor: rename triton_distributed to dynemo (#22) · 1af7433b
  Neelay Shah authored Mar 05, 2025
```
Co-authored-by: Graham King <grahamk@nvidia.com>
```
  1af7433b
25 Feb, 2025 2 commits
- feat: Add completion endpoint to http server and llmctl (#230) · b760c569
  Alec authored Feb 25, 2025
```
Co-authored-by: aflowers <aflowers@nvidia.com>
```
  b760c569
- refactor: move libs to lib dir · 08fcd7e9
  Neelay Shah authored Feb 24, 2025
```
Signed-off-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
```
  08fcd7e9
24 Feb, 2025 1 commit
- feat: add rust based tokenizer · 4f6f63cd
  Biswa Panda authored Feb 24, 2025
  
  4f6f63cd
21 Feb, 2025 1 commit

feat(tio): Distributed inference! (#235) · 32a748e4

Graham King authored Feb 21, 2025

Add support in tio for distributed components and discovery.

Node 1:
```
tio in=http out=tdr://ns/backend/mistralrs
```

Node 2:
```
tio in=tdr://ns/backend/mistralrs out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct
```

This will use etcd to auto-discover the model and NATS to talk to it. You can run multiple workers on the same endpoint and it will pick one at random each time.

The `ns/backend/mistralrs` are purely symbolic, pick anything as long as it has three parts, and it matches the other node.

32a748e4

20 Feb, 2025 1 commit
- feat: add local model card (#216) · 65a2dfab
  Biswa Panda authored Feb 20, 2025
  
  65a2dfab
18 Feb, 2025 1 commit

feat: Add KV publisher and receiver. Add KV aware routing example. · 8588e33a

GuanLuo authored Feb 18, 2025


Signed-off-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: aflowers <aflowers@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: hongkuanz <hongkuanz@nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>

8588e33a

14 Feb, 2025 1 commit

feat: Add a mistralrs engine to tio (#178) · 2f700421

Graham King authored Feb 14, 2025

This allows us to run a real model.

Build:
```
cargo build --release --features mistralrs,cuda
```

Run:
```
./target/release/tio in=text out=mistralrs --model-path Llama-3.2-1B-Instruct-Q4_K_M.gguf
```

Why [mistral.rs](https://github.com/EricLBuehler/mistral.rs)?

- It has no dependencies. You don't need a container or a virtual env to get started.
- It supports CUDA, Metal (MacOS) and CPU-only. Everyone can join the AI revolution.
- It starts fast and serves fast (with CUDA). That makes it fun to experiment with.
- It runs many models, not just Mistral, that's just it's name.

2f700421

10 Feb, 2025 1 commit

feat: OpenAI compatible http service (#123) · ffc6dde1

Ryan Olson authored Feb 10, 2025


Signed-off-by: Ryan Olson <ryanolson@users.noreply.github.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>

ffc6dde1

05 Feb, 2025 1 commit
- ci: Add Copyright Verification Scripts w/ Automation (#110) · c9130f8f
  J Wyman authored Feb 05, 2025
  
  c9130f8f
04 Feb, 2025 1 commit
- feat: rust - initial commit · 5ed8c1c0
  Ryan Olson authored Feb 03, 2025
```
the journey begins
```
  5ed8c1c0