Commits · 5d89a0c858b65599922a89967d415ff9010a76f4 · OpenDAS / dynamo

"vscode:/vscode.git/clone" did not exist on "bd4a9d7e325a83fbc6520665a0945bbd5eb820e5"

06 May, 2025 2 commits

feat(dynamo-run): vllm and sglang subprocess engines (#954) · 28fd481c

Graham King authored May 06, 2025

New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines.
    
Why?
    
  - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain.
  - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues.
  - Should have better performance as it's "native" vllm / sglang.
  - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle.

28fd481c

feat: dynamo-run <-> python interop (#934) · 99cd9d85

Graham King authored May 05, 2025

Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests:
```
from dynamo.llm import register_llm

MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
await register_llm(endpoint, MODEL, 3)
```

Full vllm example, with pre-processing in dynamo:
- `dynamo-run in=text out=dyn://dynamo.backend.generate`
- `cd lib/bindings/python/examples/hello_world`
- `python server_vllm.py`

This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus.

The `register_llm` call does this:

- Download the model from HF if necessary
- Load the model deployment card from the HF folder or extract from GGUF
- Push the tokenizer config etc into NATS object store so ingress can access it from a different machine
- Publish the model deployment card to ETCD

99cd9d85

01 May, 2025 2 commits
- chore(dynamo-llm): Move the pre-processor to ingress side (#903) · 2d2a1027
  Graham King authored May 01, 2025
```
Part of https://github.com/ai-dynamo/dynamo/issues/743
```
  2d2a1027
- feat: Support hf:// URLs in dynamo run (#917) · 877b2ec3
  Abrar Shivani authored May 01, 2025
```
Allow `hf://` prefix on command line. 

Closes GitHub issue: https://github.com/ai-dynamo/dynamo/issues/829
```
  877b2ec3
29 Apr, 2025 2 commits

feat: Add request template support for default inference parameters (#841) · adad2ecd

Abrar Shivani authored Apr 30, 2025

Adds support for specifying default request parameters through a json template file that can be applied across all inference requests. This enables consistent parameter settings while still allowing per-request overrides.

Changes:
- Add --request-template CLI flag to specify template file path
- Integrate template support in HTTP, batch and text input modes
- Template values can be overridden by individual request parameters
- Example template.json:
```
{
    "model": "Qwen2.5-3B-Instruct",
    "temperature": 0.7,
    "max_completion_tokens": 4096
}
```

adad2ecd

chore: Split PushRouter from Client (#817) · a1a10365

Graham King authored Apr 29, 2025

In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side.

As part of moving pre-processing back to ingress-side we need to split this into two steps:
- Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card.
- PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters.

Part of #743

a1a10365

28 Apr, 2025 1 commit
- feat: Adding completions endpoint support to `dynamo run in=http` (#777) · b495cd83
  Olga Andreeva authored Apr 28, 2025
```
Signed-off-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
```
  b495cd83
25 Apr, 2025 2 commits

fix: Change default vLLM router to round-robin (#597) · 0e4fffbc
Piotr Marcinkiewicz authored Apr 25, 2025

0e4fffbc

chore: Publish Model Deployment Card to NATS (#799) · d346782c

Graham King authored Apr 25, 2025

This will allow an ingress-side pre-processor to see it without needing a model checkout.

Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files.

To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store.

The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided.

Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit.

Part of #743

d346782c

24 Apr, 2025 1 commit
- feat: Add linux aarch64 support to dynamo-run build (#802) · d757604c
  Ryan McCormick authored Apr 23, 2025
  
  d757604c
23 Apr, 2025 1 commit

feat: Add log verbosity level flag to dynamo-run cli (#780) · a03fd307

Abrar Shivani authored Apr 24, 2025

#### Overview:

This PR adds a command-line verbosity flag (-v, -vv) to dynamo-run to control log levels.
- Added new verbosity flag to Flags struct:
  - -v: Sets log level to debug
  - -vv: Sets log level to trace
  - No flag (default): Keeps log level at info

#### Details:
- closes GitHub issue: https://github.com/ai-dynamo/dynamo/issues/567

a03fd307

21 Apr, 2025 3 commits
- feat: add custom lease to worker components (#748) · c392c341
  ishandhanani authored Apr 21, 2025
  
  c392c341
- chore(dynamo-run): Fix echo_core for EOS tokens (#759) · 4e75b04b
  Graham King authored Apr 21, 2025
```
"echo_core" is an engine that echoes the post-processed request back to you so you can see the template. Good for testing. It needed an extra flag set to work correctly.
```
  4e75b04b
- feat(dynamo-run): make the model name to be the same as the HF repo name (#749) · f2e0d6c2
  Zhongdongming Dai authored Apr 21, 2025
  
  f2e0d6c2
18 Apr, 2025 2 commits

chore: Remove TRT-LLM C++ engine in favor of Python one (#747) · 675a9bf5
Graham King authored Apr 18, 2025

675a9bf5

feat(dynamo-engine-vllm): vllm 0.8.X support (#728) · a745a980

Graham King authored Apr 18, 2025

It's different enough that I made a new engine vllm0_8 and renamed the previous engine to vllm0_7.

`dynamo-run out=vllm` now expects 0.8. This matches the container change in #690.

For older use `dynamo-run out=vllm0_7`.

a745a980

16 Apr, 2025 1 commit
- chore: Replace TRD->Dynamo in llmctl help output (#710) · d374e89f
  Ryan McCormick authored Apr 16, 2025
  
  d374e89f
14 Apr, 2025 1 commit
- feat(dynamo-run): improve available engines list in --help (#664) · cb0ceb81
  Xiaochuan Ye authored Apr 15, 2025
```
Signed-off-by: yexiaochuan <tap91624@gmail.com>
```
  cb0ceb81
07 Apr, 2025 1 commit

feat(dynamo-run): Basic routing choice (#524) · ec2e7307

Graham King authored Apr 07, 2025

As a first step towards KV routing:
- introduce a `--router-mode` in dynamo-run that only does random and round-robin right now. Not that interesting yet.
- Make the vllm engine publish the KV events received from our patched vllm.

Now we "just" need to connect the two. Easy right?

ec2e7307

04 Apr, 2025 3 commits

docs: dynamo-run clarify engine list (#522) · 75360111
Graham King authored Apr 04, 2025

75360111

chore: Upgrade Rust to 1.86 (#518) · e99aa1e1

Graham King authored Apr 04, 2025

Also upgrade the cargo resolver to v3, the default.

New clippy lints:
- `next_back()` instead of `last()` for a double-ended iterator. That avoids walking the whole list.
- ` repeat_n` instead of `repeat.take`. That avoids cloning.
- Doc indenting

e99aa1e1

feat: Python decorator dynamo_worker takes optional `static` parameter without etcd (#494) · 88ad3425

Graham King authored Apr 04, 2025

Adds `@dynamo_worker(static = True)` to create a static worker which has a predictable name and hence does not require discovery or `etcd` to be running. There can only be a single static worker per namespace / component / endpoint trio.

This contrasts with the default dynamic `dynamo_worker` endpoints we have now, which get a unique random name (based on namespace/component/endpoint), and are discovered by ingress components using etcd.

Also change the hello_world example to use `dynamo_worker(static = True)` so that it is exercised and demonstrated somewhere.

For NIM.

88ad3425

03 Apr, 2025 1 commit

refactor: migrate engines to standalone crates (#453) · 84985d3f

Ryan Olson authored Apr 03, 2025

Moved all of `lib/llm/src/engines` to their own crates as e.g. `lib/engines/mistralrs`. This will allow publishing of the `dynamo-llm` crate as it won't have any github dependencies.

The only engines in dynamo-llm will be the demo `echo` ones.
Co-authored-by: Graham King <grahamk@nvidia.com>

84985d3f

25 Mar, 2025 1 commit

feat: Allow passing any arguments to vllm and sglang engines (#368) · 670661f6

Graham King authored Mar 25, 2025

Put the arguments in a JSON file:
```
{
    "dtype": "half",
    "trust_remote_code": true
}
```

Pass it like this:
```
dynamo-run out=sglang ~/llm_models/Llama-3.2-3B-Instruct --extra-engine-args sglang_extra.json
```

Requested here https://github.com/ai-dynamo/dynamo/issues/290 (`dtype`) and here https://github.com/ai-dynamo/dynamo/issues/360 (`trust_remote_code`).

670661f6

24 Mar, 2025 2 commits

feat: Build pre-processor from GGUF (#344) · c7067fc2

Graham King authored Mar 24, 2025

This lets us do:
```
dynamo-run out=llamacpp <gguf_file>
```

Previously a `--model-config <hf-repo>` was also required, to configure our tokenizer.

c7067fc2

fix: Attach lease to etcd key (#364) · d7165149
Graham King authored Mar 24, 2025
```
That ensures it gets removed when the process stops.
```
d7165149

21 Mar, 2025 1 commit
- chore: Clarified docs, added more informative error prints (#342) · 1831c9cc
  Olga Andreeva authored Mar 21, 2025
```
Co-authored-by: Olga Andreeva <oandreeva@oandreeva-mlt.client.nvidia.com>
```
  1831c9cc
19 Mar, 2025 4 commits

fix: update crates metadata (#264) · 68d953f7
Anant Sharma authored Mar 19, 2025
```
Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>
```
68d953f7

chore: Don't depend on openssl (#292) · 7c3fd5c9

Graham King authored Mar 19, 2025

This makes the Rust parts all use ring / rustls library instead of local install of openssl. It's a step on the journey to being statically linked.

Pieces:
- `tokenizers` and `mistralrs` now support rustls (mistralrs by default, tokenizers with feature flag).
- Move shared dependencies up into workspace
- New `rand` crate has some renames for future rust
- Ensure the dependency doesn't creep back in by enforcing it with cargo deny.

7c3fd5c9

fix(mistralrs): Disable paged attention (#234) · fd95f37b

Graham King authored Mar 19, 2025

Under load it sometimes drops a request. The request gets added to the batch (sequence) and immediately gets a FinishReason Stop. Not sure why. It doesn't happen with the default scheduler (non-paged attention), so switch to that for now.

fd95f37b

fix(dynamo-run): Fix build if llamacpp and mistralrs are disabled (#262) · 3ac95a90
Graham King authored Mar 19, 2025

3ac95a90

18 Mar, 2025 2 commits
- docs(dynamo-run): Move README into docs/guides/ , add Quickstart (#265) · 40c55a24
  Graham King authored Mar 18, 2025
  
  40c55a24
- docs: fix links in docs (#256) · 548578f4
  Dmitry Tokarev authored Mar 18, 2025
```
Co-authored-by: Anant Sharma <anants@nvidia.com>
```
  548578f4
17 Mar, 2025 2 commits

fix(runtime): Shutdown message from eprintln to tracing debug (#219) · f46f6d0e
Graham King authored Mar 17, 2025

f46f6d0e

fix(vllm,sglang): Let the engine enforce max tokens (#216) · 05765cd4

Graham King authored Mar 17, 2025

Previously several parts of the stack ensured max tokens (for this single request) was set.

Now only text input sets it (to 8k). Everything else leaves as is, potentially blank. The engines themselves have very small defaults, 16 for vllm and 128 for sglang.

Also fix dynamo-run CUDA startup message to only print if we're using an engine that would benefit from it (mistralrs, llamacpp).

05765cd4

15 Mar, 2025 1 commit

feat(dynamo-run): Batch mode (#142) · 2cca070c

Graham King authored Mar 14, 2025

```
dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/
```

The file has genai format, one entry per line:
```
{"text": "the prompt"}
{"text": ..etc
```

The prompt is evaluated and the output written to `output.jsonl` in the
same folder as the input.

At the end of the run various statistics are printed:
> Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s)

This is also helpful for pushing load into the system and stressing the
various components. Not intended for performance measurement, it's a
batch inference tool.

2cca070c

14 Mar, 2025 4 commits

feat(dynamo-run): Various UX improvements (#168) · 1fb31d6a

Graham King authored Mar 14, 2025

Engines mistralrs, sglang and vllm included by default. Can be disabled like this: `cargo build --no-default-features --features <add-back-what-you-want>`.

Added `--feature vulkan` option, for llamacpp.

Build time message if CUDA or Metal would help and are missing. That's the best we can do:
> warning: dynamo-run@0.1.0: CUDA not enabled, re-run with `--features cuda`

Runtime message if CUDA, Metal or Vulkan are enabled:
> 2025-03-14T21:59:26.501937Z  INFO dynamo_run: CUDA on

Runtime message if they are missing:
> 2025-03-14T22:02:37.439404Z  INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance

Defaut engine message includes available engines:
> 2025-03-14T21:59:26.503612Z  INFO dynamo_run: Using default engine: mistralrs. Use out=<engine> to specify one of echo_core, echo_full, mistralrs, llamacpp, sglang, vllm, pystr, pytok

The really important outcome is that this should now "just work":
```
cargo install dynamo-run
dynamo-run Qwen/Qwen2.5-3B-Instruct
```

Sadly you still need `--features cuda|metal` for performance, I couldn't automate that.

1fb31d6a

fix: Improve error handling for failed HF download (#160) · 0f4529e9
Ryan McCormick authored Mar 14, 2025

0f4529e9
refactor: Update default log level to INFO and promote/demote a few log messages (#159) · 6a93d2c7
Ryan McCormick authored Mar 14, 2025

6a93d2c7

fix: Various for MacOS (#155) · 76b79149

Graham King authored Mar 14, 2025

- Mac doesn't have `pipe2` syscall so use plain `pipe`.
- rtnetlink isn't a dependency on mac so don't use the type

76b79149