- 27 May, 2025 10 commits
-
-
ishandhanani authored
Signed-off-by:
ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by:
Neelay Shah <neelays@nvidia.com>
-
mohammedabdulwahhab authored
Co-authored-by:Anna Tchernych <atchernych@nvidia.com>
-
kYLe authored
-
Shuaiyi Zhang authored
Signed-off-by:
Shuaiyi Zhang <zhangsy28@lenovo.com> Co-authored-by:
Shuaiyi Zhang <zhangsy28@lenovo.com> Co-authored-by:
Yan Ru Pei <yanrpei@gmail.com>
-
Akash authored
Signed-off-by:Akash <akpaul@nvidia.com>
-
ishandhanani authored
-
mohammedabdulwahhab authored
-
J Wyman authored
-
Tanmay Verma authored
-
Hyogeun Oh (오효근) authored
Signed-off-by:Hyogeun Oh <ohg3417@gmail.com>
-
- 24 May, 2025 1 commit
-
-
jthomson04 authored
-
- 23 May, 2025 8 commits
-
-
Kris Hung authored
-
Hongkuan Zhou authored
-
Yan Ru Pei authored
-
Graham King authored
-
Yan Ru Pei authored
Signed-off-by:
Michael Feil <63565275+michaelfeil@users.noreply.github.com> Co-authored-by:
Michael Feil <63565275+michaelfeil@users.noreply.github.com> Co-authored-by:
jthomson04 <jwillthomson19@gmail.com> Co-authored-by:
Ryan Olson <ryanolson@users.noreply.github.com>
-
julienmancuso authored
-
hhzhang16 authored
-
Ryan Olson authored
-
- 22 May, 2025 11 commits
-
-
julienmancuso authored
-
Tanmay Verma authored
-
Graham King authored
Example: ``` dynamo-run out=<engine> <model> --kv-cache-block-size 64 ``` In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card. Previously hard coded to 16, which is now the default. - Load context_length from model. Closes #1172 - Store context length and KV cache block size in Model Deployment Card #1170
-
Hongkuan Zhou authored
Co-authored-by:root <root@kkranen-dt.nvidia.com>
-
Graham King authored
Removed the hard coded sleeps, explained what we're testing. Closes https://github.com/ai-dynamo/dynamo/issues/1132 The race condition is that `apply_event` sends a message on a channel, it does not directly apply the event. At some later point the tokio runtime schedules the task running the channel receiver, which applies the event. If that had not happened yet the test would fail.
-
jthomson04 authored
-
Kyle McGill authored
-
Hongkuan Zhou authored
-
Graham King authored
Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context. Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit. Future todo: - Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor. - mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.
-
jmswen authored
-
Suman Tatiraju authored
Co-authored-by:Suman Tatiraju <statiraju@statiraju-mlt.client.nvidia.com>
-
- 21 May, 2025 10 commits
-
-
Graham King authored
-
Graham King authored
Previously any error would cause us to halt. Most of them are recoverable. So now we print the error and return to the prompt.
-
Graham King authored
-
mohammedabdulwahhab authored
Co-authored-by:Hannah Zhang <hannahz@nvidia.com>
-
Neelay Shah authored
-
Suman Tatiraju authored
Signed-off-by:
Suman Tatiraju <167138127+statiraju@users.noreply.github.com> Signed-off-by:
Anant Sharma <anants@nvidia.com> Co-authored-by:
Anant Sharma <anants@nvidia.com> Co-authored-by:
Dmitry Tokarev <dtokarev@nvidia.com> Co-authored-by:
ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by:
Kristen Kelleher <kkelleher@nvidia.com> Co-authored-by:
Suman Tatiraju <statiraju@statiraju-mlt.client.nvidia.com> Co-authored-by:
Hannah Zhang <hannahz@nvidia.com>
-
Biswa Panda authored
-
Graham King authored
- Stop advertising a model when it's last instance stops. Previously was when any instance stops. - Faster locks on model manager. - Move discovery code out of http, as it is used by all inputs.
-
Yan Ru Pei authored
Signed-off-by:Yan Ru Pei <yanrpei@gmail.com>
-
Tanmay Verma authored
-