- 04 Jun, 2025 2 commits
-
-
Kristen Kelleher authored
Signed-off-by:Kristen Kelleher <kkelleher@nvidia.com> - Content, format, and structural changes to the Dynamo docs for 0.3.0. - Includes copyediting and the first batch of changes from the DMO review.
-
jthomson04 authored
-
- 03 Jun, 2025 2 commits
-
-
Abrar Shivani authored
This PR modifies the mistralrs engine to ensure that the maximum output token length never exceeds the context length provided.
-
Hongkuan Zhou authored
Signed-off-by:
Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by:
jothomson <jwillthomson19@gmail.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
- 02 Jun, 2025 3 commits
-
-
Graham King authored
Do not include by default as it needs libgomp1 at runtime. Add a feature to enable it at build time.
-
Hongkuan Zhou authored
-
Graham King authored
It was confusing to have two names for one type. This tidy up started in #1064 , is now complete.
-
- 30 May, 2025 4 commits
-
-
jain-ria authored
-
Graham King authored
Unify them with all our other logs, so we can filter with DYN_LOG, they will eventually go to the log aggregation, etc.
-
Alec authored
-
jthomson04 authored
-
- 29 May, 2025 10 commits
-
-
Graham King authored
Previously `mistral.rs` was the default engine for both safetensors and GGUF models. Now it is only the default for safetensors, `llama.cpp` becomes the default for GGUF. Why? - Since #1177 `llama.cpp` is built-in by default, so we can switch. - `llama.cpp` is very very good at running GGUF (but can't run other types of model), so we should switch. Dynamo's multi-engine support gives us a secret super-power: we can use the best engine for this specific format or model. We can still run GGUF with mistralrs by doing `out=mistralrs`.
-
jthomson04 authored
-
Alec authored
-
jthomson04 authored
-
Graham King authored
- Add Granite to our tokenizer - Fix pre-processor to load context length correctly - Add strftime_now Jinja function for prompt templates - Update llama.cpp - Handle trtllm errors when not using trtllm Support depends on the engine: - `mistral.rs`, our default engine, doesn't support Granite yet. - `llama.cpp` does and works very well: ``` dynamo-run out=llamacpp ~/llms/granite-3.3-2b-instruct-Q4_K_M.gguf --context-length 16384 ``` - `vllm` also works very well: ``` dynamo-run in=http out=vllm ~/llms/granite-3.3-2b-instruct --context-length 16384 ``` - `sglang` mostly works, but it doesn't catch the stop token, so we do in the HTTP ingress, and log an error. The Text ingress doesn't catch it because I disabled it to make the raw echo engine work. A bit of work to do here. Closes: #1245
-
Ryan Olson authored
-
Jacky authored
-
Anant Sharma authored
-
Hongkuan Zhou authored
Signed-off-by:
Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
-
Alec authored
-
- 28 May, 2025 6 commits
-
-
Hongkuan Zhou authored
-
Graham King authored
Fixes #286
-
Graham King authored
It was removed from the docs in 0.2.1 and replaced with writing a [standalone Python engine](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_run.md#writing-your-own-engine-in-python). Also remove the associated `dynamo-run` feature `python`. Releasing this in 0.3.0 will resolve #784 and #1109.
-
Tanmay Verma authored
-
Alec authored
-
Alec authored
-
- 27 May, 2025 1 commit
-
-
ishandhanani authored
-
- 24 May, 2025 1 commit
-
-
jthomson04 authored
-
- 23 May, 2025 4 commits
-
-
Yan Ru Pei authored
-
Graham King authored
-
Yan Ru Pei authored
Signed-off-by:
Michael Feil <63565275+michaelfeil@users.noreply.github.com> Co-authored-by:
Michael Feil <63565275+michaelfeil@users.noreply.github.com> Co-authored-by:
jthomson04 <jwillthomson19@gmail.com> Co-authored-by:
Ryan Olson <ryanolson@users.noreply.github.com>
-
Ryan Olson authored
-
- 22 May, 2025 6 commits
-
-
Graham King authored
Example: ``` dynamo-run out=<engine> <model> --kv-cache-block-size 64 ``` In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card. Previously hard coded to 16, which is now the default. - Load context_length from model. Closes #1172 - Store context length and KV cache block size in Model Deployment Card #1170
-
Graham King authored
Removed the hard coded sleeps, explained what we're testing. Closes https://github.com/ai-dynamo/dynamo/issues/1132 The race condition is that `apply_event` sends a message on a channel, it does not directly apply the event. At some later point the tokio runtime schedules the task running the channel receiver, which applies the event. If that had not happened yet the test would fail.
-
jthomson04 authored
-
Graham King authored
Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context. Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit. Future todo: - Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor. - mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.
-
jmswen authored
-
Suman Tatiraju authored
Co-authored-by:Suman Tatiraju <statiraju@statiraju-mlt.client.nvidia.com>
-
- 21 May, 2025 1 commit
-
-
Graham King authored
-