Commits · 7df6bb18e6d9dc68d1a3cd349bcd2838734031fa · OpenDAS / dynamo

14 Mar, 2025 1 commit
- feat: global kv block manager (#45) · f04359cf
  Ryan Olson authored Mar 13, 2025
  
  f04359cf
13 Mar, 2025 4 commits

build: add top level rust workspace (#137) · 3d292851
Anant Sharma authored Mar 13, 2025

3d292851

feat(mistralrs): Let the engine enforce max tokens (#134) · 404a78e9

Graham King authored Mar 13, 2025

Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step.

Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.

404a78e9

fix(dynamo-run): Network interface detection is Linux only (#133) · b0d3eba1

Graham King authored Mar 13, 2025

"netlink" doesn't exist on Mac. We print the primary network interface to help multi-node setup, which is also unlikely on Mac.

b0d3eba1

feat(dynamo-run): Download models from HF, smart model defaults (#126) · 089f8e1b

Graham King authored Mar 12, 2025



- Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine.

- The default engine (previously always mistralrs) depends on what is compiled in.

- Text can be piped in and will result in a single run of the model.

All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit:
```
echo "What is the capital of Costa Rica?"  | dynamo-run Qwen/Qwen2.5-3B-Instruct
```
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

089f8e1b

12 Mar, 2025 1 commit

feat(pystr): Pass command line arguments (#123) · 995f71cc

Graham King authored Mar 12, 2025

Command line arguments are passed to the python engine like this:
```
dynamo-run out=pystr:my_python_engine.py -- -n 42 --custom-arg Orange --yes
```

The python engine receives the arguments in `sys.argv`. The argument list will include some standard ones as well as anything after the `--`.

This input:
```
dynamo-run out=pystr:my_engine.py /opt/models/Llama-3.2-3B-Instruct/ --model-name llama_3.2 --tensor-parallel-size 4 -- -n 1
```

is read like this:
```
async def generate(request):
    .. as before ..

if __name__ == "__main__":
    print(f"MAIN: {sys.argv}")
```

and produces this output:
```
MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--model-name', 'llama3.2', '--http-port', '8080', '--tensor-parallel-size', '4', '--base-gpu-id', '0', '--num-nodes', '1', '--node-rank', '0', '-n', '1']
```

This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`.

995f71cc

11 Mar, 2025 6 commits
- fix(pystr): Output python errors (#99) · 9c7b1ead
  Graham King authored Mar 11, 2025
```
If the python file raises an exception we print it like Python would.

```
  $ ./target/debug/dynamo-run in=http out=pystr:~/Temp/cn47/1_e.py --model-name test
  
  Traceback (most recent call last):
    File "/home/graham/Temp/cn47/1_e.py", line 17, in generate
      raise MyException("The message")
  1_e.MyException: The message
```
```
  9c7b1ead
- feat(dynamo-run): Upgrade mistral.rs (#97) · d99b188d
  Graham King authored Mar 11, 2025
```
- Latest from repo, many improvements
- Support most of the OpenAI request features (temperature, top_p, etc)
- Download models from Hugging Face if necessary
```
  d99b188d
- refactor: Move rust binaries out of examples, update nixl dockerfile (#89) · e5db9e86
  Neelay Shah authored Mar 11, 2025
```
Co-authored-by: Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com>
```
  e5db9e86
- fix: Support vllm_nixl (custom vllm patch) from dynamo-run (#84) · e1a95dab
  Ryan McCormick authored Mar 11, 2025
  
  e1a95dab
- feat: add new metrics and simple router cost fn (#88) · 3f84cdad
  Alec authored Mar 11, 2025
  
  3f84cdad
- feat: add openai http service (#82) · dd620825
  Biswa Panda authored Mar 10, 2025
  
  dd620825
10 Mar, 2025 1 commit
- chore: update wheel name and reset versions (#73) · fc4da345
  Anant Sharma authored Mar 10, 2025
  
  fc4da345
09 Mar, 2025 5 commits

feat: make block_size input for indexer, router, publisher (#66) · 989bb3d5
Alec authored Mar 09, 2025

989bb3d5
chore: stragglers rename (#69) · dd31a322
Neelay Shah authored Mar 09, 2025
```
Co-authored-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
```
dd31a322

chore: left over renaming (#67) · 678cffb4

Neelay Shah authored Mar 09, 2025


Co-authored-by: Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com>
Co-authored-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

678cffb4

chore: address comments for #35 (#53) · 6ba39b09
GuanLuo authored Mar 09, 2025

6ba39b09

feat: kv aware router + disagg router + prefill queue (#11) · 19844fc0

Hongkuan Zhou authored Mar 08, 2025


Signed-off-by: Hongkuan Zhou <tedzhouhk@gmail.com>
Co-authored-by: hongkuan <hongkuanz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local>
Co-authored-by: alec-flowers <aflowers@nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>

19844fc0

08 Mar, 2025 3 commits
- chore: Renamed Triton Distributed to Dynamo (#56) · b4d56a57
  Dmitry Tokarev authored Mar 08, 2025
  
  b4d56a57
- chore: rename dynamo (#44) · 602352ce
  Neelay Shah authored Mar 08, 2025
```
Co-authored-by: Biswa Panda <biswa.panda@gmail.com>
```
  602352ce
- test: add tests for kv bindings (#35) · dcecc47d
  GuanLuo authored Mar 07, 2025
  
  dcecc47d
07 Mar, 2025 4 commits

fix: dynemo-run model discovery working again (#52) · 9f53922a

Graham King authored Mar 07, 2025

There are two etcd keys:
- The service
- The model

The second one is the interesting one for us. Previously we confused the two.

9f53922a

refactor: Use library constant for kv-hit-rate subject (#48) · 2ee29443
Ryan McCormick authored Mar 07, 2025
```
Replaces hard-coded "kv-hit-rate" string in multiple places with KV_HIT_RATE_SUBJECT constant in lib/llm.
```
2ee29443

feat: Python bring-your-own-engine with our tokenizer (#47) · 12714d90

Graham King authored Mar 07, 2025

Instead of using `out=pystr:<my.py>` we can now do this:
```
dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout>
```

That engine will receive and respond with tokens. Here's an example engine file:
```
import asyncio

async def generate(request):
    yield {"token_ids":[791]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[6864]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[315]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[9822]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[374]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[12366]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[13]}
```

Also reduce duplication by making the bindings engine use the llm lib engine.

12714d90

feat: Bring-your-own engine for dynemo-run (#43) · 1b96c2c4

Graham King authored Mar 06, 2025

1. Create `my_engine.py`

```
import asyncio

async def generate(request):
    yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
```

2. Build

```
cargo build --release --feature python
```

3. Run

```
dynemo-run out=pystr:my_engine.py --name test
```

And here's a distributed system, with your engine:

- Node 1: `dynemo-run in=http out=dyn://test`
- Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`

1b96c2c4

06 Mar, 2025 3 commits
- feat: Add estimated kv cache hit metric events (#30) · 09656f6c
  Ryan McCormick authored Mar 06, 2025
  
  09656f6c
- refactor: Simplify codespell configuration, allow contractions, add custom dictionary (#28) · e1ae9aa0
  Ryan McCormick authored Mar 05, 2025
  
  e1ae9aa0
- feat: expose KV routing components for easier router customization (#15) · e159e53f
  GuanLuo authored Mar 05, 2025
  
  e159e53f
05 Mar, 2025 2 commits
- fix: mistralrs use auto device map (#31) · 46ed649c
  Graham King authored Mar 05, 2025
```
Fixes a panic.
```
  46ed649c
- refactor: rename triton_distributed to dynemo (#22) · 1af7433b
  Neelay Shah authored Mar 05, 2025
```
Co-authored-by: Graham King <grahamk@nvidia.com>
```
  1af7433b
04 Mar, 2025 3 commits

feat: vllm engine tensor parallel and pipeline parallel (#16) · a657ec61
Graham King authored Mar 04, 2025
```
Needs more testing but good enough for now. I get the same results with this as with `vllm serve`.
```
a657ec61
feat: add python binding for rust llm modules (#13) · a32cdad6
Biswa Panda authored Mar 04, 2025

a32cdad6

feat: nixl metadata store and retrieved from etcd (#6) · 3a5fe17d

Neelay Shah authored Mar 04, 2025

Co-authored-by: hongkuanz <hongkuanz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local>
Co-authored-by: Neelay Shah <neelays@ipp2-0493.ipp2u1.colossus.nvidia.com>
Co-authored-by: Neelay Shah <neelays@ipp1-1941.ipp1a1.colossus.nvidia.com>
Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Co-authored-by: Neelay Shah <neelays@4u8g-gen-0078.ipp3a2.colossus.nvidia.com>
Co-authored-by: ptarasiewiczNV <104908264+ptarasiewiczNV@users.noreply.github.com>

3a5fe17d

03 Mar, 2025 1 commit

fix: Install specific toolchain (#329) · 2d906fb4

Graham King authored Mar 03, 2025

`cargo build --locked` won't let you use "1.85.0" if you only have "stable" installed, even if those are the same thing right now.

2d906fb4

02 Mar, 2025 1 commit
- [fix] OpenAI API object: completion to text_completion (#318) · a48ffc52
  Alec authored Mar 01, 2025
  
  a48ffc52
28 Feb, 2025 5 commits
- refactor: use async-openai CompletionRequest (#310) · 9162f3ad
  Paul Hendricks authored Feb 28, 2025
  
  9162f3ad
- feat: TensorRT-LLM engine (#317) · 057f8f47
  Graham King authored Feb 28, 2025
```
Engine, `tio` support and docs.

Proof of concept / experimental.
```
  057f8f47
- [fix] KV Router Example fixes (#314) · 11a36651
  Alec authored Feb 28, 2025
```
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
```
  11a36651
- feat: Add initial prometheus/grafana support for count (#303) · d38325c2
  Ryan McCormick authored Feb 28, 2025
  
  d38325c2
- feat: vllm engine (#308) · 6e0cfbd9
  Graham King authored Feb 28, 2025
```
triton-distributed-llm component and support in tio
```
  6e0cfbd9