Commits · 05765cd4be22bbed3340200aea87171ad07f78b0 · OpenDAS / dynamo

17 Mar, 2025 1 commit

fix(vllm,sglang): Let the engine enforce max tokens (#216) · 05765cd4

Graham King authored Mar 17, 2025

Previously several parts of the stack ensured max tokens (for this single request) was set.

Now only text input sets it (to 8k). Everything else leaves as is, potentially blank. The engines themselves have very small defaults, 16 for vllm and 128 for sglang.

Also fix dynamo-run CUDA startup message to only print if we're using an engine that would benefit from it (mistralrs, llamacpp).

05765cd4

15 Mar, 2025 1 commit

feat(dynamo-run): Batch mode (#142) · 2cca070c

Graham King authored Mar 14, 2025

```
dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/
```

The file has genai format, one entry per line:
```
{"text": "the prompt"}
{"text": ..etc
```

The prompt is evaluated and the output written to `output.jsonl` in the
same folder as the input.

At the end of the run various statistics are printed:
> Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s)

This is also helpful for pushing load into the system and stressing the
various components. Not intended for performance measurement, it's a
batch inference tool.

2cca070c

14 Mar, 2025 4 commits

feat(dynamo-run): Various UX improvements (#168) · 1fb31d6a

Graham King authored Mar 14, 2025

Engines mistralrs, sglang and vllm included by default. Can be disabled like this: `cargo build --no-default-features --features <add-back-what-you-want>`.

Added `--feature vulkan` option, for llamacpp.

Build time message if CUDA or Metal would help and are missing. That's the best we can do:
> warning: dynamo-run@0.1.0: CUDA not enabled, re-run with `--features cuda`

Runtime message if CUDA, Metal or Vulkan are enabled:
> 2025-03-14T21:59:26.501937Z  INFO dynamo_run: CUDA on

Runtime message if they are missing:
> 2025-03-14T22:02:37.439404Z  INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance

Defaut engine message includes available engines:
> 2025-03-14T21:59:26.503612Z  INFO dynamo_run: Using default engine: mistralrs. Use out=<engine> to specify one of echo_core, echo_full, mistralrs, llamacpp, sglang, vllm, pystr, pytok

The really important outcome is that this should now "just work":
```
cargo install dynamo-run
dynamo-run Qwen/Qwen2.5-3B-Instruct
```

Sadly you still need `--features cuda|metal` for performance, I couldn't automate that.

1fb31d6a

fix: Improve error handling for failed HF download (#160) · 0f4529e9
Ryan McCormick authored Mar 14, 2025

0f4529e9
refactor: Update default log level to INFO and promote/demote a few log messages (#159) · 6a93d2c7
Ryan McCormick authored Mar 14, 2025

6a93d2c7

fix: Various for MacOS (#155) · 76b79149

Graham King authored Mar 14, 2025

- Mac doesn't have `pipe2` syscall so use plain `pipe`.
- rtnetlink isn't a dependency on mac so don't use the type

76b79149

13 Mar, 2025 5 commits

build: add top level rust workspace (#137) · 3d292851
Anant Sharma authored Mar 13, 2025

3d292851

feat(mistralrs): Let the engine enforce max tokens (#134) · 404a78e9

Graham King authored Mar 13, 2025

Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step.

Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.

404a78e9

fix(dynamo-run): Network interface detection is Linux only (#133) · b0d3eba1

Graham King authored Mar 13, 2025

"netlink" doesn't exist on Mac. We print the primary network interface to help multi-node setup, which is also unlikely on Mac.

b0d3eba1

docs: Updated macOS build instructions for dynamo-run. (#131) · 05465f78
Dmitry Tokarev authored Mar 13, 2025

05465f78

feat(dynamo-run): Download models from HF, smart model defaults (#126) · 089f8e1b

Graham King authored Mar 12, 2025



- Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine.

- The default engine (previously always mistralrs) depends on what is compiled in.

- Text can be piped in and will result in a single run of the model.

All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit:
```
echo "What is the capital of Costa Rica?"  | dynamo-run Qwen/Qwen2.5-3B-Instruct
```
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

089f8e1b

12 Mar, 2025 1 commit

feat(pystr): Pass command line arguments (#123) · 995f71cc

Graham King authored Mar 12, 2025

Command line arguments are passed to the python engine like this:
```
dynamo-run out=pystr:my_python_engine.py -- -n 42 --custom-arg Orange --yes
```

The python engine receives the arguments in `sys.argv`. The argument list will include some standard ones as well as anything after the `--`.

This input:
```
dynamo-run out=pystr:my_engine.py /opt/models/Llama-3.2-3B-Instruct/ --model-name llama_3.2 --tensor-parallel-size 4 -- -n 1
```

is read like this:
```
async def generate(request):
    .. as before ..

if __name__ == "__main__":
    print(f"MAIN: {sys.argv}")
```

and produces this output:
```
MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--model-name', 'llama3.2', '--http-port', '8080', '--tensor-parallel-size', '4', '--base-gpu-id', '0', '--num-nodes', '1', '--node-rank', '0', '-n', '1']
```

This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`.

995f71cc

11 Mar, 2025 5 commits

fix: Add missing arg to echo_full example and source cargo env in setup steps (#101) · a7c35dcf
Ryan McCormick authored Mar 11, 2025

a7c35dcf

docs(dynamo-run): Fix for workspace (#102) · 8992e895

Graham King authored Mar 11, 2025

In https://github.com/ai-dynamo/dynamo/pull/89 `dynamo-run` was moved into a workspace. That means it builds in that workspace, so into `launch/target` not `launch/dynamo-run/target`.

Update docs to match.

8992e895

fix(pystr): Output python errors (#99) · 9c7b1ead

Graham King authored Mar 11, 2025

If the python file raises an exception we print it like Python would.

```
$ ./target/debug/dynamo-run in=http out=pystr:~/Temp/cn47/1_e.py --model-name test

Traceback (most recent call last):
  File "/home/graham/Temp/cn47/1_e.py", line 17, in generate
    raise MyException("The message")
1_e.MyException: The message
```

9c7b1ead

feat(dynamo-run): Upgrade mistral.rs (#97) · d99b188d

Graham King authored Mar 11, 2025

- Latest from repo, many improvements
- Support most of the OpenAI request features (temperature, top_p, etc)
- Download models from Hugging Face if necessary

d99b188d

refactor: Move rust binaries out of examples, update nixl dockerfile (#89) · e5db9e86
Neelay Shah authored Mar 11, 2025
```
Co-authored-by: Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com>
```
e5db9e86

10 Mar, 2025 3 commits

chore: update wheel name and reset versions (#73) · fc4da345
Anant Sharma authored Mar 10, 2025

fc4da345
feat: Add configurable DYN_TOKEN_ECHO_DELAY_MS for echo engine testing (#81) · 0a3f2c69
Ryan McCormick authored Mar 10, 2025

0a3f2c69

fix(dynamo-run): Text input doesn't need a name (#80) · ec46ed52

Graham King authored Mar 10, 2025

For the `echo` and `pystr` engines we previously required the user to pass `--model-name <x>` so we would have a name for the model. If the input is HTTP we do need this to match on the users' JSON request.

If the input is Text we don't need a name. So if the input is Text and we don't already have a name for the model, give it one.

ec46ed52

09 Mar, 2025 1 commit

chore: left over renaming (#67) · 678cffb4

Neelay Shah authored Mar 09, 2025


Co-authored-by: Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com>
Co-authored-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

678cffb4

08 Mar, 2025 1 commit
- chore: rename dynamo (#44) · 602352ce
  Neelay Shah authored Mar 08, 2025
```
Co-authored-by: Biswa Panda <biswa.panda@gmail.com>
```
  602352ce
07 Mar, 2025 3 commits

fix: dynemo-run model discovery working again (#52) · 9f53922a

Graham King authored Mar 07, 2025

There are two etcd keys:
- The service
- The model

The second one is the interesting one for us. Previously we confused the two.

9f53922a

feat: Python bring-your-own-engine with our tokenizer (#47) · 12714d90

Graham King authored Mar 07, 2025

Instead of using `out=pystr:<my.py>` we can now do this:
```
dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout>
```

That engine will receive and respond with tokens. Here's an example engine file:
```
import asyncio

async def generate(request):
    yield {"token_ids":[791]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[6864]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[315]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[9822]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[374]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[12366]}
    await asyncio.sleep(0.1)
    yield {"token_ids":[13]}
```

Also reduce duplication by making the bindings engine use the llm lib engine.

12714d90

feat: Bring-your-own engine for dynemo-run (#43) · 1b96c2c4

Graham King authored Mar 06, 2025

1. Create `my_engine.py`

```
import asyncio

async def generate(request):
    yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
    await asyncio.sleep(0.1)
    yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
```

2. Build

```
cargo build --release --feature python
```

3. Run

```
dynemo-run out=pystr:my_engine.py --name test
```

And here's a distributed system, with your engine:

- Node 1: `dynemo-run in=http out=dyn://test`
- Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`

1b96c2c4

05 Mar, 2025 3 commits
- fix: mistralrs use auto device map (#31) · 46ed649c
  Graham King authored Mar 05, 2025
```
Fixes a panic.
```
  46ed649c
- refactor: rename triton_distributed to dynemo (#22) · 1af7433b
  Neelay Shah authored Mar 05, 2025
```
Co-authored-by: Graham King <grahamk@nvidia.com>
```
  1af7433b
- refactor: Rename 'tio' to 'dynemo-run' (#18) · 14ce7e03
  Graham King authored Mar 04, 2025
  
  14ce7e03
04 Mar, 2025 1 commit
- feat: vllm engine tensor parallel and pipeline parallel (#16) · a657ec61
  Graham King authored Mar 04, 2025
```
Needs more testing but good enough for now. I get the same results with this as with `vllm serve`.
```
  a657ec61
03 Mar, 2025 1 commit

fix: Install specific toolchain (#329) · 2d906fb4

Graham King authored Mar 03, 2025

`cargo build --locked` won't let you use "1.85.0" if you only have "stable" installed, even if those are the same thing right now.

2d906fb4

28 Feb, 2025 2 commits
- feat: TensorRT-LLM engine (#317) · 057f8f47
  Graham King authored Feb 28, 2025
```
Engine, `tio` support and docs.

Proof of concept / experimental.
```
  057f8f47
- feat: vllm engine (#308) · 6e0cfbd9
  Graham King authored Feb 28, 2025
```
triton-distributed-llm component and support in tio
```
  6e0cfbd9
27 Feb, 2025 4 commits
- feat: llama.cpp engine for tio (#298) · e584e96f
  Graham King authored Feb 27, 2025
```
Docs in README
```
  e584e96f
- ci: build wheel from root directory (#274) · ea401e3b
  Anant Sharma authored Feb 27, 2025
  
  ea401e3b
- refactor: rename ChatCompletionResponseDelta to NvCreateChatCompletionStreamResponse (#292) · 110f3f8c
  Paul Hendricks authored Feb 27, 2025
  
  110f3f8c
- refactor: rename ChatCompletionRequest to NvCreateChatCompletionRequest (#284) · 96866f43
  Paul Hendricks authored Feb 27, 2025
  
  96866f43
26 Feb, 2025 2 commits
- refactor: using async_openai · 86aff237
  Paul Hendricks authored Feb 26, 2025
```
Co-authored-by: Graham King <grahamk@nvidia.com>
```
  86aff237
- ci: fix rust deny workflow (#275) · 76439997
  Anant Sharma authored Feb 26, 2025
  
  76439997
25 Feb, 2025 2 commits

feat: sglang backend for tio (#271) · e97493eb

Graham King authored Feb 25, 2025

- Setup venv

```
uv venv
source .venv/bin/activate
uv pip install pip
uv pip install sgl-kernel --force-reinstall --no-deps
uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
```

- Build: `cargo build --release --features sglang`

- Run single node (make sure you're in the venv): `./tio out=sglang ~/llm_models/my_model`

- Run Deepseek multi-gpu / multi-node:

Node 1:
```
tio in=http out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 0 --dist-init-addr 10.217.98.122:9876
```

Node 2:
```
tio in=none out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 1 --dist-init-addr 10.217.98.122:9876
```

e97493eb

feat: Add completion endpoint to http server and llmctl (#230) · b760c569
Alec authored Feb 25, 2025
```
Co-authored-by: aflowers <aflowers@nvidia.com>
```
b760c569