Commits · 92f06b0e7ff03bd02cc6a56f9ba9258917dc9dae · OpenDAS / dynamo

30 Jun, 2025 2 commits

chore(dynamo-run): Refactor to library (#1687) · 92f06b0e

Graham King authored Jun 30, 2025

Move much of what was in the `dynamo-run` crate into `dynamo-llm` so that everyone can use it.

Example usage:

1. Create a `LocalModel`:

```
    let local_model = LocalModelBuilder::default()
	.model_path("Qwen/Qwen3-0.6B")
	.http_port(8080)
	.build().await?;
```

2. Make an engine:

```
    let engine_config = EngineConfig::StaticFull {
	engine: dynamo_engine_mistralrs::make_engine(&local_model).await?,
	model: Box::new(local_model),
    };
```

3. Connect it to an input and run it

```
    dynamo_llm::entrypoint::input::run_input(Input::Http, runtime, engine_config).await?;
```

For https://github.com/ai-dynamo/dynamo/issues/1647

Code Rabbit summary, thanks:
  * Introduced a flexible builder pattern for local model configuration, allowing advanced customization and easier initialization.
  * Added new input modes and unified input handling, supporting interactive chat, HTTP server, batch file, and distributed endpoint modes.
  * Centralized engine configuration and routing, enabling more extensible and maintainable engine management.
  * Simplified and modularized the codebase by moving input and engine logic into dedicated modules.
  * Replaced direct model construction with an asynchronous builder for improved clarity and extensibility.
  * Streamlined configuration and validation for flags and router settings.
  * Added validation to prevent incompatible input and output combinations in endpoint and dynamic modes.

92f06b0e

refactor: Upgrade async-openai (#1693) · 82eae1fd
Paul Hendricks authored Jun 30, 2025

82eae1fd

26 Jun, 2025 2 commits
- feat: Add experimental WideEP + EPLB aggregated example for TRTLLM (#1652) · 5fe5a950
  Ryan McCormick authored Jun 27, 2025
  
  5fe5a950
- refactor: refactored using CompletionResponse (#1658) · e3f1bd5d
  Paul Hendricks authored Jun 26, 2025
  
  e3f1bd5d
25 Jun, 2025 4 commits
- feat: support batch `/completions` (#1626) · fc16a79b
  ishandhanani authored Jun 25, 2025
```
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
```
  fc16a79b
- fix: add missing await in vllm-v1 `clear_kv_blocks` endpoint (#1642) · 3e1a5534
  Will Killian authored Jun 25, 2025
```
Signed-off-by: Will Killian <wkillian@nvidia.com>
```
  3e1a5534
- fix: remove http endpoint for clearing kv blocks (#1629) · 2d3fb39f
  jain-ria authored Jun 25, 2025
  
  2d3fb39f
- feat: Add --version flag to dynamo-run (#1596) · bed8b335
  Nathan Barry authored Jun 25, 2025
  
  bed8b335
17 Jun, 2025 2 commits
- fix: Fix sample disagg config for trtllm standalone (#1566) · 65f2de5f
  Tanmay Verma authored Jun 17, 2025
  
  65f2de5f
- refactor: Log subprocess stderr as WARN (#1563) · ac4fd87b
  Ryan McCormick authored Jun 18, 2025
  
  ac4fd87b
12 Jun, 2025 3 commits
- feat: add endpoint to clear all kv blocks in vllm v1 (#1384) · d0d364e3
  jain-ria authored Jun 11, 2025
  
  d0d364e3
- fix: dynamo-run change python subprocess from debug to info (#1484) · a4600ba1
  Alec authored Jun 11, 2025
  
  a4600ba1
- fix: Python respects DYN_LOG too (#1486) · af1f1155
  Alec authored Jun 11, 2025
  
  af1f1155
10 Jun, 2025 1 commit
- chore: Default to pytorch backend in trtllm worker (#1445) · d83633b5
  Ryan McCormick authored Jun 10, 2025
  
  d83633b5
04 Jun, 2025 2 commits
- refactor: Rename CompletionRequest to NvCreateCompletionRequest (#1383) · c103d56a
  Paul Hendricks authored Jun 04, 2025
  
  c103d56a
- feat: add implementation for embeddings (#1290) · e83009a6
  Tom O'Brien authored Jun 04, 2025
  
  e83009a6
03 Jun, 2025 3 commits
- feat: Enable disagg support in trtllm standalone script (#1355) · ac53c0bb
  Tanmay Verma authored Jun 03, 2025
  
  ac53c0bb
- fix(dynamo-run): For internal comms use a random endpoint instead of hard coded (#1335) · 43991e76
  Graham King authored Jun 03, 2025
```
To talk to the vllm/sglang/trtllm engine we previously hardcoded an endpoint. The user never sees it so it doesn't matter which one.

However if you try to run _two_ instances of Dynamo on one machine they will conflict.

Use a UUID as the component name to resolve that.

Part of the solution for:
https://github.com/ai-dynamo/dynamo/issues/1073
```
  43991e76
- docs: Add documentation for verbosity flag in `dynamo-run` (#1353) · 9bf79b67
  Paul Hendricks authored Jun 03, 2025
  
  9bf79b67
02 Jun, 2025 4 commits

feat: Make llama.cpp Gnu OpenMP dependency optional (#1331) · d3ca7661
Graham King authored Jun 02, 2025
```
Do not include by default as it needs libgomp1 at runtime. Add a feature to enable it at build time.
```
d3ca7661

fix: Allow building only llamacpp or only mistralrs engine. (#1328) · 9907d104

Graham King authored Jun 02, 2025

This allows building:
-  only `mistral.rs` engine: `--no-default-features --features mistralrs`  
- or only `llama.cpp` engine: `--no-default-features --features llamacpp`. 

Since llama.cpp became a default we'd only tested building both at once. The docs already said we supported that but there was some combo of Rust features that didn't build. This is the fix.

9907d104

feat: expose router configurations to dynamo-run (#1259) · d849f7ec
Hongkuan Zhou authored Jun 02, 2025

d849f7ec
chore: Remove PreprocessedRequest alias BackendInput (#1307) · 3f6a7472
Graham King authored Jun 02, 2025
```
It was confusing to have two names for one type.

This tidy up started in #1064 , is now complete.
```
3f6a7472

30 May, 2025 1 commit
- refactor: rename KvMetricsPublisher to WorkerMetricsPublisher (#1284) · 2f8da9ad
  Alec authored May 30, 2025
  
  2f8da9ad
29 May, 2025 6 commits
- feat(dynamo-run): Use llama.cpp as the default engine for GGUF (#1276) · 3e3c3b10
  Graham King authored May 29, 2025
```
Previously `mistral.rs` was the default engine for both safetensors and GGUF models. Now it is only the default for safetensors, `llama.cpp` becomes the default for GGUF.

Why?

- Since #1177 `llama.cpp` is built-in by default, so we can switch.
- `llama.cpp` is very very good at running GGUF (but can't run other types of model), so we should switch.

Dynamo's multi-engine support gives us a secret super-power: we can use the best engine for this specific format or model.

We can still run GGUF with mistralrs by doing `out=mistralrs`.
```
  3e3c3b10
- feat: Publish events and metrics when using kv routing (#1262) · f9ba6f5c
  Tanmay Verma authored May 29, 2025
  
  f9ba6f5c
- fix: Renamed event publisher classes and configuration (#1273) · f67dc38b
  Alec authored May 29, 2025
  
  f67dc38b
- chore: Make llama.cpp a default engine (#1177) · b889948c
  Graham King authored May 29, 2025
  
  b889948c
- feat: add KV Event Publishing to vLLM v1 (#1181) · 0df6d462
  Alec authored May 29, 2025
  
  0df6d462
- fix: Import json when using --engine-extra-args (#1261) · 8d324489
  jthomson04 authored May 28, 2025
  
  8d324489
28 May, 2025 3 commits

feat(dynamo-llm): Remove bring-your-own-engine (#1216) · 0a1d1fbe

Graham King authored May 28, 2025

It was removed from the docs in 0.2.1 and replaced with writing a [standalone Python engine](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_run.md#writing-your-own-engine-in-python).

Also remove the associated `dynamo-run` feature `python`.

Releasing this in 0.3.0 will resolve #784 and #1109.

0a1d1fbe

feat: Enable dynamo-run out=trtllm (#1223) · 1b1e089a
Tanmay Verma authored May 28, 2025

1b1e089a
fix: dynamo-run pass proper args using register-llm (#1230) · cc40af70
Alec authored May 28, 2025

cc40af70

27 May, 2025 1 commit
- feat: Add metrics and event publishers (#1192) · 9acaa8d1
  Tanmay Verma authored May 27, 2025
  
  9acaa8d1
22 May, 2025 3 commits

feat: Add standalone script for TRTLLM integration into dynamo-run (#1162) · 3d4fe574
Tanmay Verma authored May 22, 2025

3d4fe574

feat(dynamo-run): Allow setting KV cache block size (#1175) · 183f2b32

Graham King authored May 22, 2025

Example:
```
dynamo-run out=<engine> <model> --kv-cache-block-size 64
```

In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card.

Previously hard coded to 16, which is now the default.

- Load context_length from model. Closes #1172
- Store context length and KV cache block size in Model Deployment Card #1170

183f2b32

feat(dynamo-run): Allow setting context-length (#1157) · 6d5da821

Graham King authored May 22, 2025

Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context.

Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit.

Future todo:
- Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor.
- mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.

6d5da821

21 May, 2025 3 commits
- fix(dynamo-run): Don't exit interactive chat on error (#1155) · b226b7b0
  Graham King authored May 21, 2025
```
Previously any error would cause us to halt. Most of them are recoverable. So now we print the error and return to the prompt.
```
  b226b7b0
- fix(llmctl): Use ModelWatcher instead of direct etcd operations (#1150) · 3e8e38a9
  Graham King authored May 21, 2025
  
  3e8e38a9
- fix: register model after engine load (#1145) · 08c01d8c
  Neelay Shah authored May 21, 2025
  
  08c01d8c