Commits · cc086811b312194099826a93454f7f108ee67269 · OpenDAS / dynamo

11 Mar, 2025 4 commits
- feat(sdk): pass in CLI args when running `serve` (#78) · cc086811
  ishandhanani authored Mar 11, 2025
  
  cc086811
- feat: unified entry point for vllm-nixl (#83) · 30c5a79f
  Hongkuan Zhou authored Mar 10, 2025
```
Co-authored-by: hongkuanz <hongkuanz@nvidia.com>
```
  30c5a79f
- style: fix go linting errors (#86) · 2340751b
  Anant Sharma authored Mar 10, 2025
  
  2340751b
- feat: add openai http service (#82) · dd620825
  Biswa Panda authored Mar 10, 2025
  
  dd620825
10 Mar, 2025 8 commits
- chore: update wheel name and reset versions (#73) · fc4da345
  Anant Sharma authored Mar 10, 2025
  
  fc4da345
- feat: Add configurable DYN_TOKEN_ECHO_DELAY_MS for echo engine testing (#81) · 0a3f2c69
  Ryan McCormick authored Mar 10, 2025
  
  0a3f2c69
- feat: LLM API integration with smart routing bits (#55) · 11e3e188
  Tanmay Verma authored Mar 10, 2025
```
Co-authored-by: Shreyas Misra <shreyasm@nvidia.com>
```
  11e3e188
- fix(dynamo-run): Text input doesn't need a name (#80) · ec46ed52
  Graham King authored Mar 10, 2025
```
For the `echo` and `pystr` engines we previously required the user to pass `--model-name <x>` so we would have a name for the model. If the input is HTTP we do need this to match on the users' JSON request.

If the input is Text we don't need a name. So if the input is Text and we don't already have a name for the model, give it one.
```
  ec46ed52
- ci: start using ECR for container caches (#77) · c8b70289
  Harrison Saturley-Hall authored Mar 10, 2025
  
  c8b70289
- style: fix formatting for .go file (#62) · 07afe3c9
  Anant Sharma authored Mar 10, 2025
  
  07afe3c9
- chore: Add dynamo-run to workspace file (#76) · 090c825f
  Ryan McCormick authored Mar 10, 2025
  
  090c825f
- build(deps): bump transformers from 4.45.2 to 4.48.3 (#58) · 7591a5cc
  Dmitry Tokarev authored Mar 10, 2025
  
  7591a5cc
09 Mar, 2025 8 commits
- feat: make block_size input for indexer, router, publisher (#66) · 989bb3d5
  Alec authored Mar 09, 2025
  
  989bb3d5
- chore: stragglers rename (#69) · dd31a322
  Neelay Shah authored Mar 09, 2025
```
Co-authored-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
```
  dd31a322
- feat: make vllm baseline support both chat and completions (#70) · efe82b86
  Alec authored Mar 09, 2025
  
  efe82b86
- ci: remove caching of docker layers on PR builds (#61) · 5944dbed
  Harrison Saturley-Hall authored Mar 09, 2025
  
  5944dbed
- chore: left over renaming (#67) · 678cffb4
  Neelay Shah authored Mar 09, 2025
```
Co-authored-by: Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com>
Co-authored-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
```
  678cffb4
- chore: address comments for #35 (#53) · 6ba39b09
  GuanLuo authored Mar 09, 2025
  
  6ba39b09
- feat: kv aware router + disagg router + prefill queue (#11) · 19844fc0
  Hongkuan Zhou authored Mar 08, 2025
```
Signed-off-by: Hongkuan Zhou <tedzhouhk@gmail.com>
Co-authored-by: hongkuan <hongkuanz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local>
Co-authored-by: alec-flowers <aflowers@nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
```
  19844fc0
- fix: vLLM disagg fix incorrect block ids order (#63) · 7567620f
  ptarasiewiczNV authored Mar 09, 2025
```
Co-authored-by: ptarasiewicz@nvidia.com <Piotr Tarasiewicz>
```
  7567620f
08 Mar, 2025 7 commits
- Update README.md · cbd20c30
  Meenakshi Sharma authored Mar 08, 2025
  
  cbd20c30
- ci: rename project to dynamo (#60) · 61abae51
  Harrison Saturley-Hall authored Mar 08, 2025
  
  61abae51
- chore: Renamed Triton Distributed to Dynamo (#56) · b4d56a57
  Dmitry Tokarev authored Mar 08, 2025
  
  b4d56a57
- chore: remove debug statements (#57) · dd7646ef
  Neelay Shah authored Mar 08, 2025
  
  dd7646ef
- chore: rename dynamo (#44) · 602352ce
  Neelay Shah authored Mar 08, 2025
```
Co-authored-by: Biswa Panda <biswa.panda@gmail.com>
```
  602352ce
- ci: skip test redundancy in Gitlab CI (#36) · ecf53ce2
  Pavithra Vijayakrishnan authored Mar 07, 2025
  
  ecf53ce2
- test: add tests for kv bindings (#35) · dcecc47d
  GuanLuo authored Mar 07, 2025
  
  dcecc47d
07 Mar, 2025 10 commits

test: add gpu sanity test for ci job (#49) · 6705d483
Anant Sharma authored Mar 07, 2025

6705d483
feat: Enhance mock worker with mock KvHitRate events (#50) · 1ce7ba03
Ryan McCormick authored Mar 07, 2025

1ce7ba03

fix: dynemo-run model discovery working again (#52) · 9f53922a

Graham King authored Mar 07, 2025

There are two etcd keys:
- The service
- The model

The second one is the interesting one for us. Previously we confused the two.

9f53922a

feat: onboard dynamo-sdk basic and kv-router examples (#20) · aacc5d76
Biswa Panda authored Mar 07, 2025
```
Co-authored-by: Neelay Shah <neelays@nvidia.com>
```
aacc5d76
refactor: Use library constant for kv-hit-rate subject (#48) · 2ee29443
Ryan McCormick authored Mar 07, 2025
```
Replaces hard-coded "kv-hit-rate" string in multiple places with KV_HIT_RATE_SUBJECT constant in lib/llm.
```
2ee29443
chore: remove ucx-py from requirements and fix UCX env variable (#46) · 44bde250
ptarasiewiczNV authored Mar 07, 2025
```
Co-authored-by: ptarasiewicz@nvidia.com <Piotr Tarasiewicz>
```
44bde250

feat: Python bring-your-own-engine with our tokenizer (#47) · 12714d90

Graham King authored Mar 07, 2025

Instead of using `out=pystr:<my.py>` we can now do this:
```
dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout>
```

That engine will receive and respond with tokens. Here's an example engine file:
```
import asyncio

async def generate(request):
    yield {"token_ids":[791]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[6864]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[315]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[9822]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[374]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[12366]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[13]}
```

Also reduce duplication by making the bindings engine use the llm lib engine.

12714d90

docs: Add VLLM_NIXL in main readme (#23) · d752a1a2
Piotr Marcinkiewicz authored Mar 07, 2025

d752a1a2
refactor: rename count to metrics and move location (#21) · ac13ed06
Neelay Shah authored Mar 06, 2025

ac13ed06

feat: Bring-your-own engine for dynemo-run (#43) · 1b96c2c4

Graham King authored Mar 06, 2025

1. Create `my_engine.py`

```
import asyncio

async def generate(request):
    yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
```

2. Build

```
cargo build --release --feature python
```

3. Run

```
dynemo-run out=pystr:my_engine.py --name test
```

And here's a distributed system, with your engine:

- Node 1: `dynemo-run in=http out=dyn://test`
- Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`

1b96c2c4

06 Mar, 2025 3 commits
- feat: Use round robin for disagg routing (#40) · 3c60fe2a
  ptarasiewiczNV authored Mar 07, 2025
```
Co-authored-by: ptarasiewicz@nvidia.com <Piotr Tarasiewicz>
```
  3c60fe2a
- feat: Enable make_xfer NIXL kv transfer (#39) · bc42616e
  ptarasiewiczNV authored Mar 07, 2025
```
Co-authored-by: ptarasiewicz@nvidia.com <Piotr Tarasiewicz>
```
  bc42616e
- feat: Add estimated kv cache hit metric events (#30) · 09656f6c
  Ryan McCormick authored Mar 06, 2025
  
  09656f6c