Commits · b2aa2317041ed79d390874a8cf89d48452779ddf · OpenDAS / dynamo

08 May, 2025 8 commits
- feat: deploy planner in operator (#921) · b2aa2317
  julienmancuso authored May 08, 2025
```
Co-authored-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
```
  b2aa2317
- feat: Remove vllm and sglang from cargo build command (#1003) · 57975b27
  hhzhang16 authored May 08, 2025
  
  57975b27
- feat: Qwen3, Gemma3 and Llama4 support (#1002) · ceaeba3e
  Graham King authored May 08, 2025
```
. New mistralrs and llamacpp version
. mistralrs: Handle Gemma 3 and Llama 4 as vision models
. Update the dynamo-run docs to use Qwen 3
. Our pre-processor now supports Llama 4's newer multi-modal `config.json`
. Upgrade minijinja to handle Qwen 3's prompt template

For Llama 4 we'll need to limit the max seq len. vllm says:
> To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed,...

I was able to run Llama 4 with llamacpp and a quantized GGUF, with Dynamo doing the pre-processing.
```
  ceaeba3e
- docs: Add slurm env var workaround for MPI spawn errors (#992) · 57402e70
  Ryan McCormick authored May 08, 2025
  
  57402e70
- fix: typo in devcontainer ulimit nofile (#994) · 02145479
  Anthony Casagrande authored May 08, 2025
```
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
```
  02145479
- fix: should route based on waiting requests, not active (#989) · 8bdf18e5
  Yan Ru Pei authored May 08, 2025
  
  8bdf18e5
- ci: add PR labels and config for github release notes (#955) · 5c98f8d1
  Anant Sharma authored May 08, 2025
  
  5c98f8d1
- feat: add ingress to graph deployments (#960) · 1e8b2866
  hhzhang16 authored May 07, 2025
  
  1e8b2866
07 May, 2025 12 commits
- feat: cleanup EtcdKvCache and PrefillQueue before and after launch (#925) · a590d103
  Hongkuan Zhou authored May 07, 2025
  
  a590d103
- feat: Add multimodal example with disaggregated serving (#811) · 10e91264
  Kris Hung authored May 07, 2025
  
  10e91264
- fix: Fix vllm/sglang engine model name if using HF repo (#986) · 92bbbc39
  Graham King authored May 07, 2025
```
Signed-off-by: Graham King <graham@gkgk.org>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
```
  92bbbc39
- fix: Check nvext for ignore_eos and set min_tokens for benchmark consistency (#988) · 0a894cc3
  Ryan McCormick authored May 07, 2025
  
  0a894cc3
- feat: add interface for deployment manager (#987) · dc3ae2b7
  Biswa Panda authored May 07, 2025
  
  dc3ae2b7
- build: Cleans the TensorRTLLM + Dynamo container build (#968) · 7dd79013
  Tanmay Verma authored May 07, 2025
```
Signed-off-by: Tanmay Verma <tanmay2592@gmail.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
```
  7dd79013
- docs: add fix for Zsh globbing error with `pip install .[all]` (#945) · 412ec843
  祝健聪 authored May 08, 2025
```
Signed-off-by: Chasing1020 <chasing1020@gmail.com>
```
  412ec843
- fix: increase ulimit nofile for container (#969) · 3c3cec97
  Anthony Casagrande authored May 07, 2025
  
  3c3cec97
- chore: Remove embedded Python vllm and sglang engines (#966) · 42969800
  Graham King authored May 07, 2025
```
vllm and sglang are now the sub-process engines from #954

Also updated docs on doing vllm and sglang multi-gpu (tensor parallel) and multi-node (pipeline parallel).
```
  42969800
- fix: Create default sampling params only once during initialization (#982) · 5d89a0c8
  ptarasiewiczNV authored May 07, 2025
  
  5d89a0c8
- fix: fix missing num_remote_prefill_groups in vLLM patch (#981) · af9ee90e
  ptarasiewiczNV authored May 07, 2025
  
  af9ee90e
- fix: create k8s service for main component only (#953) · 8af8c82f
  julienmancuso authored May 06, 2025
  
  8af8c82f
06 May, 2025 8 commits

feat: Migrate NATS Queue to Rust (#669) (#961) · c4213899
jthomson04 authored May 06, 2025

c4213899
docs: add drt doc (#951) · 2d4f8b50
Hongkuan Zhou authored May 06, 2025

2d4f8b50

feat(dynamo-run): vllm and sglang subprocess engines (#954) · 28fd481c

Graham King authored May 06, 2025

New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines.
    
Why?
    
  - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain.
  - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues.
  - Should have better performance as it's "native" vllm / sglang.
  - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle.

28fd481c

chore: Add John as Codeowner (#962) · 9f0e12a0
jthomson04 authored May 06, 2025

9f0e12a0

chore: Two-line copyright check (#958) · a9068dc6

Graham King authored May 06, 2025

Approved by OSRB in Slack.

Note we don't check for the closing delimiter to allow the longer copyright format.

Motivation is that it reduces the context usage by 12 lines for every file in the project. That helps things like Cursor and Claude Code fit more, go faster, and cost less.

a9068dc6

ci: lock cuda at 12.8 (#957) · 632158be
hhzhang16 authored May 06, 2025

632158be
refactor: refactor dynamo deploy subfolder (#927) · 403344e5
hhzhang16 authored May 06, 2025

403344e5

feat: dynamo-run <-> python interop (#934) · 99cd9d85

Graham King authored May 05, 2025

Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests:
```
from dynamo.llm import register_llm

MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
await register_llm(endpoint, MODEL, 3)
```

Full vllm example, with pre-processing in dynamo:
- `dynamo-run in=text out=dyn://dynamo.backend.generate`
- `cd lib/bindings/python/examples/hello_world`
- `python server_vllm.py`

This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus.

The `register_llm` call does this:

- Download the model from HF if necessary
- Load the model deployment card from the HF folder or extract from GGUF
- Push the tokenizer config etc into NATS object store so ingress can access it from a different machine
- Publish the model deployment card to ETCD

99cd9d85

05 May, 2025 6 commits
- fix: remove requirement for istio in doc (#950) · 829e1cf5
  julienmancuso authored May 05, 2025
  
  829e1cf5
- feat: multi-thread (via asyncio.task) in processor (#904) · e0cd8489
  Hongkuan Zhou authored May 05, 2025
  
  e0cd8489
- feat: automatically reserve port for assigning port number to endpoint and pubsub (#946) · 191748e0
  richardhuo-nv authored May 05, 2025
  
  191748e0
- feat: allow to set http port (#931) · 4faa026e
  julienmancuso authored May 05, 2025
  
  4faa026e
- chore: merge in support matrix and nixl commit hash (#944) · 67fc3b8c
  Harrison Saturley-Hall authored May 05, 2025
```
Signed-off-by: Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com>
Co-authored-by: Anant Sharma <anants@nvidia.com>
```
  67fc3b8c
- fix: use primary lease for NixlMetadataStore (#928) · 9d643f1e
  Hongkuan Zhou authored May 05, 2025
  
  9d643f1e
02 May, 2025 3 commits
- feat: Update to support completion endpoint in TRTLLM (#837) · 960ee927
  Tanmay Verma authored May 02, 2025
  
  960ee927
- docs: Add multi-node TRTLLM steps to README (#930) · f0ac8e2b
  Ryan McCormick authored May 02, 2025
  
  f0ac8e2b
- feat: Add multimodal example with aggregated serving (#709) · 58df5aca
  Kris Hung authored May 02, 2025
  
  58df5aca
01 May, 2025 3 commits
- fix: default docker username and password are empty (#926) · f122aa4e
  hhzhang16 authored May 01, 2025
  
  f122aa4e
- chore(dynamo-llm): Move the pre-processor to ingress side (#903) · 2d2a1027
  Graham King authored May 01, 2025
```
Part of https://github.com/ai-dynamo/dynamo/issues/743
```
  2d2a1027
- docs: update examples in document (#897) · f6d03f2f
  Biswa Panda authored May 01, 2025
  
  f6d03f2f