Commits · 0e4fffbc6b28b65f894ebcc520b13cea59db369d · OpenDAS / dynamo

25 Apr, 2025 2 commits

fix: Change default vLLM router to round-robin (#597) · 0e4fffbc
Piotr Marcinkiewicz authored Apr 25, 2025

0e4fffbc

chore: Publish Model Deployment Card to NATS (#799) · d346782c

Graham King authored Apr 25, 2025

This will allow an ingress-side pre-processor to see it without needing a model checkout.

Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files.

To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store.

The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided.

Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit.

Part of #743

d346782c

24 Apr, 2025 1 commit
- feat: Add linux aarch64 support to dynamo-run build (#802) · d757604c
  Ryan McCormick authored Apr 23, 2025
  
  d757604c
23 Apr, 2025 1 commit

feat: Add log verbosity level flag to dynamo-run cli (#780) · a03fd307

Abrar Shivani authored Apr 24, 2025

#### Overview:

This PR adds a command-line verbosity flag (-v, -vv) to dynamo-run to control log levels.
- Added new verbosity flag to Flags struct:
  - -v: Sets log level to debug
  - -vv: Sets log level to trace
  - No flag (default): Keeps log level at info

#### Details:
- closes GitHub issue: https://github.com/ai-dynamo/dynamo/issues/567

a03fd307

21 Apr, 2025 3 commits
- feat: add custom lease to worker components (#748) · c392c341
  ishandhanani authored Apr 21, 2025
  
  c392c341
- chore(dynamo-run): Fix echo_core for EOS tokens (#759) · 4e75b04b
  Graham King authored Apr 21, 2025
```
"echo_core" is an engine that echoes the post-processed request back to you so you can see the template. Good for testing. It needed an extra flag set to work correctly.
```
  4e75b04b
- feat(dynamo-run): make the model name to be the same as the HF repo name (#749) · f2e0d6c2
  Zhongdongming Dai authored Apr 21, 2025
  
  f2e0d6c2
18 Apr, 2025 2 commits

chore: Remove TRT-LLM C++ engine in favor of Python one (#747) · 675a9bf5
Graham King authored Apr 18, 2025

675a9bf5

feat(dynamo-engine-vllm): vllm 0.8.X support (#728) · a745a980

Graham King authored Apr 18, 2025

It's different enough that I made a new engine vllm0_8 and renamed the previous engine to vllm0_7.

`dynamo-run out=vllm` now expects 0.8. This matches the container change in #690.

For older use `dynamo-run out=vllm0_7`.

a745a980

16 Apr, 2025 1 commit
- chore: Replace TRD->Dynamo in llmctl help output (#710) · d374e89f
  Ryan McCormick authored Apr 16, 2025
  
  d374e89f
14 Apr, 2025 1 commit
- feat(dynamo-run): improve available engines list in --help (#664) · cb0ceb81
  Xiaochuan Ye authored Apr 15, 2025
```
Signed-off-by: yexiaochuan <tap91624@gmail.com>
```
  cb0ceb81
07 Apr, 2025 1 commit

feat(dynamo-run): Basic routing choice (#524) · ec2e7307

Graham King authored Apr 07, 2025

As a first step towards KV routing:
- introduce a `--router-mode` in dynamo-run that only does random and round-robin right now. Not that interesting yet.
- Make the vllm engine publish the KV events received from our patched vllm.

Now we "just" need to connect the two. Easy right?

ec2e7307

04 Apr, 2025 3 commits

docs: dynamo-run clarify engine list (#522) · 75360111
Graham King authored Apr 04, 2025

75360111

chore: Upgrade Rust to 1.86 (#518) · e99aa1e1

Graham King authored Apr 04, 2025

Also upgrade the cargo resolver to v3, the default.

New clippy lints:
- `next_back()` instead of `last()` for a double-ended iterator. That avoids walking the whole list.
- ` repeat_n` instead of `repeat.take`. That avoids cloning.
- Doc indenting

e99aa1e1

feat: Python decorator dynamo_worker takes optional `static` parameter without etcd (#494) · 88ad3425

Graham King authored Apr 04, 2025

Adds `@dynamo_worker(static = True)` to create a static worker which has a predictable name and hence does not require discovery or `etcd` to be running. There can only be a single static worker per namespace / component / endpoint trio.

This contrasts with the default dynamic `dynamo_worker` endpoints we have now, which get a unique random name (based on namespace/component/endpoint), and are discovered by ingress components using etcd.

Also change the hello_world example to use `dynamo_worker(static = True)` so that it is exercised and demonstrated somewhere.

For NIM.

88ad3425

03 Apr, 2025 1 commit

refactor: migrate engines to standalone crates (#453) · 84985d3f

Ryan Olson authored Apr 03, 2025

Moved all of `lib/llm/src/engines` to their own crates as e.g. `lib/engines/mistralrs`. This will allow publishing of the `dynamo-llm` crate as it won't have any github dependencies.

The only engines in dynamo-llm will be the demo `echo` ones.
Co-authored-by: Graham King <grahamk@nvidia.com>

84985d3f

25 Mar, 2025 1 commit

feat: Allow passing any arguments to vllm and sglang engines (#368) · 670661f6

Graham King authored Mar 25, 2025

Put the arguments in a JSON file:
```
{
    "dtype": "half",
    "trust_remote_code": true
}
```

Pass it like this:
```
dynamo-run out=sglang ~/llm_models/Llama-3.2-3B-Instruct --extra-engine-args sglang_extra.json
```

Requested here https://github.com/ai-dynamo/dynamo/issues/290 (`dtype`) and here https://github.com/ai-dynamo/dynamo/issues/360 (`trust_remote_code`).

670661f6

24 Mar, 2025 2 commits

feat: Build pre-processor from GGUF (#344) · c7067fc2

Graham King authored Mar 24, 2025

This lets us do:
```
dynamo-run out=llamacpp <gguf_file>
```

Previously a `--model-config <hf-repo>` was also required, to configure our tokenizer.

c7067fc2

fix: Attach lease to etcd key (#364) · d7165149
Graham King authored Mar 24, 2025
```
That ensures it gets removed when the process stops.
```
d7165149

21 Mar, 2025 1 commit
- chore: Clarified docs, added more informative error prints (#342) · 1831c9cc
  Olga Andreeva authored Mar 21, 2025
```
Co-authored-by: Olga Andreeva <oandreeva@oandreeva-mlt.client.nvidia.com>
```
  1831c9cc
19 Mar, 2025 4 commits

fix: update crates metadata (#264) · 68d953f7
Anant Sharma authored Mar 19, 2025
```
Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>
```
68d953f7

chore: Don't depend on openssl (#292) · 7c3fd5c9

Graham King authored Mar 19, 2025

This makes the Rust parts all use ring / rustls library instead of local install of openssl. It's a step on the journey to being statically linked.

Pieces:
- `tokenizers` and `mistralrs` now support rustls (mistralrs by default, tokenizers with feature flag).
- Move shared dependencies up into workspace
- New `rand` crate has some renames for future rust
- Ensure the dependency doesn't creep back in by enforcing it with cargo deny.

7c3fd5c9

fix(mistralrs): Disable paged attention (#234) · fd95f37b

Graham King authored Mar 19, 2025

Under load it sometimes drops a request. The request gets added to the batch (sequence) and immediately gets a FinishReason Stop. Not sure why. It doesn't happen with the default scheduler (non-paged attention), so switch to that for now.

fd95f37b

fix(dynamo-run): Fix build if llamacpp and mistralrs are disabled (#262) · 3ac95a90
Graham King authored Mar 19, 2025

3ac95a90

18 Mar, 2025 2 commits
- docs(dynamo-run): Move README into docs/guides/ , add Quickstart (#265) · 40c55a24
  Graham King authored Mar 18, 2025
  
  40c55a24
- docs: fix links in docs (#256) · 548578f4
  Dmitry Tokarev authored Mar 18, 2025
```
Co-authored-by: Anant Sharma <anants@nvidia.com>
```
  548578f4
17 Mar, 2025 2 commits

fix(runtime): Shutdown message from eprintln to tracing debug (#219) · f46f6d0e
Graham King authored Mar 17, 2025

f46f6d0e

fix(vllm,sglang): Let the engine enforce max tokens (#216) · 05765cd4

Graham King authored Mar 17, 2025

Previously several parts of the stack ensured max tokens (for this single request) was set.

Now only text input sets it (to 8k). Everything else leaves as is, potentially blank. The engines themselves have very small defaults, 16 for vllm and 128 for sglang.

Also fix dynamo-run CUDA startup message to only print if we're using an engine that would benefit from it (mistralrs, llamacpp).

05765cd4

15 Mar, 2025 1 commit

feat(dynamo-run): Batch mode (#142) · 2cca070c

Graham King authored Mar 14, 2025

```
dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/
```

The file has genai format, one entry per line:
```
{"text": "the prompt"}
{"text": ..etc
```

The prompt is evaluated and the output written to `output.jsonl` in the
same folder as the input.

At the end of the run various statistics are printed:
> Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s)

This is also helpful for pushing load into the system and stressing the
various components. Not intended for performance measurement, it's a
batch inference tool.

2cca070c

14 Mar, 2025 4 commits

feat(dynamo-run): Various UX improvements (#168) · 1fb31d6a

Graham King authored Mar 14, 2025

Engines mistralrs, sglang and vllm included by default. Can be disabled like this: `cargo build --no-default-features --features <add-back-what-you-want>`.

Added `--feature vulkan` option, for llamacpp.

Build time message if CUDA or Metal would help and are missing. That's the best we can do:
> warning: dynamo-run@0.1.0: CUDA not enabled, re-run with `--features cuda`

Runtime message if CUDA, Metal or Vulkan are enabled:
> 2025-03-14T21:59:26.501937Z  INFO dynamo_run: CUDA on

Runtime message if they are missing:
> 2025-03-14T22:02:37.439404Z  INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance

Defaut engine message includes available engines:
> 2025-03-14T21:59:26.503612Z  INFO dynamo_run: Using default engine: mistralrs. Use out=<engine> to specify one of echo_core, echo_full, mistralrs, llamacpp, sglang, vllm, pystr, pytok

The really important outcome is that this should now "just work":
```
cargo install dynamo-run
dynamo-run Qwen/Qwen2.5-3B-Instruct
```

Sadly you still need `--features cuda|metal` for performance, I couldn't automate that.

1fb31d6a

fix: Improve error handling for failed HF download (#160) · 0f4529e9
Ryan McCormick authored Mar 14, 2025

0f4529e9
refactor: Update default log level to INFO and promote/demote a few log messages (#159) · 6a93d2c7
Ryan McCormick authored Mar 14, 2025

6a93d2c7

fix: Various for MacOS (#155) · 76b79149

Graham King authored Mar 14, 2025

- Mac doesn't have `pipe2` syscall so use plain `pipe`.
- rtnetlink isn't a dependency on mac so don't use the type

76b79149

13 Mar, 2025 5 commits

build: add top level rust workspace (#137) · 3d292851
Anant Sharma authored Mar 13, 2025

3d292851

feat(mistralrs): Let the engine enforce max tokens (#134) · 404a78e9

Graham King authored Mar 13, 2025

Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step.

Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.

404a78e9

fix(dynamo-run): Network interface detection is Linux only (#133) · b0d3eba1

Graham King authored Mar 13, 2025

"netlink" doesn't exist on Mac. We print the primary network interface to help multi-node setup, which is also unlikely on Mac.

b0d3eba1

docs: Updated macOS build instructions for dynamo-run. (#131) · 05465f78
Dmitry Tokarev authored Mar 13, 2025

05465f78

feat(dynamo-run): Download models from HF, smart model defaults (#126) · 089f8e1b

Graham King authored Mar 12, 2025



- Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine.

- The default engine (previously always mistralrs) depends on what is compiled in.

- Text can be piped in and will result in a single run of the model.

All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit:
```
echo "What is the capital of Costa Rica?"  | dynamo-run Qwen/Qwen2.5-3B-Instruct
```
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

089f8e1b

12 Mar, 2025 1 commit

feat(pystr): Pass command line arguments (#123) · 995f71cc

Graham King authored Mar 12, 2025

Command line arguments are passed to the python engine like this:
```
dynamo-run out=pystr:my_python_engine.py -- -n 42 --custom-arg Orange --yes
```

The python engine receives the arguments in `sys.argv`. The argument list will include some standard ones as well as anything after the `--`.

This input:
```
dynamo-run out=pystr:my_engine.py /opt/models/Llama-3.2-3B-Instruct/ --model-name llama_3.2 --tensor-parallel-size 4 -- -n 1
```

is read like this:
```
async def generate(request):
    .. as before ..

if __name__ == "__main__":
    print(f"MAIN: {sys.argv}")
```

and produces this output:
```
MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--model-name', 'llama3.2', '--http-port', '8080', '--tensor-parallel-size', '4', '--base-gpu-id', '0', '--num-nodes', '1', '--node-rank', '0', '-n', '1']
```

This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`.

995f71cc

11 Mar, 2025 1 commit
- fix: Add missing arg to echo_full example and source cargo env in setup steps (#101) · a7c35dcf
  Ryan McCormick authored Mar 11, 2025
  
  a7c35dcf