Commits · ceaeba3e3858432fd8b5e696eb6c2c8a64ffd632 · OpenDAS / dynamo

08 May, 2025 1 commit

feat: Qwen3, Gemma3 and Llama4 support (#1002) · ceaeba3e

Graham King authored May 08, 2025

. New mistralrs and llamacpp version
. mistralrs: Handle Gemma 3 and Llama 4 as vision models
. Update the dynamo-run docs to use Qwen 3
. Our pre-processor now supports Llama 4's newer multi-modal `config.json`
. Upgrade minijinja to handle Qwen 3's prompt template

For Llama 4 we'll need to limit the max seq len. vllm says:
> To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed,...

I was able to run Llama 4 with llamacpp and a quantized GGUF, with Dynamo doing the pre-processing.

ceaeba3e

28 Apr, 2025 1 commit
- feat: Adding completions endpoint support to `dynamo run in=http` (#777) · b495cd83
  Olga Andreeva authored Apr 28, 2025
```
Signed-off-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
```
  b495cd83
24 Apr, 2025 1 commit
- feat: Warm‑up mistral.rs engine to reduce latency on subsequent requests (#796) · 4761baa6
  Abrar Shivani authored Apr 24, 2025
```
Send a warm‑up request to the mistralrs engine so that subsequent requests are faster.
```
  4761baa6
03 Apr, 2025 1 commit

refactor: migrate engines to standalone crates (#453) · 84985d3f

Ryan Olson authored Apr 03, 2025

Moved all of `lib/llm/src/engines` to their own crates as e.g. `lib/engines/mistralrs`. This will allow publishing of the `dynamo-llm` crate as it won't have any github dependencies.

The only engines in dynamo-llm will be the demo `echo` ones.
Co-authored-by: Graham King <grahamk@nvidia.com>

84985d3f

19 Mar, 2025 1 commit

fix(mistralrs): Disable paged attention (#234) · fd95f37b

Graham King authored Mar 19, 2025

Under load it sometimes drops a request. The request gets added to the batch (sequence) and immediately gets a FinishReason Stop. Not sure why. It doesn't happen with the default scheduler (non-paged attention), so switch to that for now.

fd95f37b

13 Mar, 2025 1 commit

feat(mistralrs): Let the engine enforce max tokens (#134) · 404a78e9

Graham King authored Mar 13, 2025

Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step.

Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.

404a78e9

11 Mar, 2025 1 commit

feat(dynamo-run): Upgrade mistral.rs (#97) · d99b188d

Graham King authored Mar 11, 2025

- Latest from repo, many improvements
- Support most of the OpenAI request features (temperature, top_p, etc)
- Download models from Hugging Face if necessary

d99b188d

08 Mar, 2025 1 commit
- chore: rename dynamo (#44) · 602352ce
  Neelay Shah authored Mar 08, 2025
```
Co-authored-by: Biswa Panda <biswa.panda@gmail.com>
```
  602352ce
05 Mar, 2025 2 commits
- fix: mistralrs use auto device map (#31) · 46ed649c
  Graham King authored Mar 05, 2025
```
Fixes a panic.
```
  46ed649c
- refactor: rename triton_distributed to dynemo (#22) · 1af7433b
  Neelay Shah authored Mar 05, 2025
```
Co-authored-by: Graham King <grahamk@nvidia.com>
```
  1af7433b
27 Feb, 2025 2 commits
- feat: llama.cpp engine for tio (#298) · e584e96f
  Graham King authored Feb 27, 2025
```
Docs in README
```
  e584e96f
- refactor: rename ChatCompletionResponseDelta to NvCreateChatCompletionStreamResponse (#292) · 110f3f8c
  Paul Hendricks authored Feb 27, 2025
  
  110f3f8c
26 Feb, 2025 1 commit
- refactor: using async_openai · 86aff237
  Paul Hendricks authored Feb 26, 2025
```
Co-authored-by: Graham King <grahamk@nvidia.com>
```
  86aff237
25 Feb, 2025 1 commit

refactor: move libs to lib dir · 08fcd7e9

Neelay Shah authored Feb 24, 2025


Signed-off-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

08fcd7e9

20 Feb, 2025 1 commit

feat(tio): Defaults for in and out, support HF repos (#223) · 7ab5df5d

Graham King authored Feb 20, 2025

You can now run an HF repo directly:
```
tio ~/llm_models/Llama-3.2-1B-Instruct/
```

or a GGUF
```
tio ~/llm_models/Llama-3.2-1B-Instruct-Q4_K_M.gguf
```

Also cleanup kv_router so I can merge.

7ab5df5d

14 Feb, 2025 2 commits

fix: Unique IDs for mistralrs requests (#186) · 45b3505c
Graham King authored Feb 14, 2025
```
Upgrade mistralrs to latest.
```
45b3505c

feat: Add a mistralrs engine to tio (#178) · 2f700421

Graham King authored Feb 14, 2025

This allows us to run a real model.

Build:
```
cargo build --release --features mistralrs,cuda
```

Run:
```
./target/release/tio in=text out=mistralrs --model-path Llama-3.2-1B-Instruct-Q4_K_M.gguf
```

Why [mistral.rs](https://github.com/EricLBuehler/mistral.rs)?

- It has no dependencies. You don't need a container or a virtual env to get started.
- It supports CUDA, Metal (MacOS) and CPU-only. Everyone can join the AI revolution.
- It starts fast and serves fast (with CUDA). That makes it fun to experiment with.
- It runs many models, not just Mistral, that's just it's name.

2f700421