- 14 May, 2025 7 commits
-
-
Harry Kim authored
Signed-off-by:Harry Kim <harry_kim@live.com>
-
Yan Ru Pei authored
-
Graham King authored
For #1006 Prints this on startup: ``` 2025-05-09T13:06:34.529Z DEBUG dynamo_run::input::http: Supported routes: ["GET /metrics", "GET /dynamo/alpha/list-models", "GET /v1/models", "POST /v1/chat/completions", "POST /v1/completions"] ```
-
wxsm authored
Add max_age to nats stream when create, 10 min should be very enough for prefill workers to consume. this prevent system crash while nats jetstream hits disk limit by endless growing messages.
-
julienmancuso authored
-
GuanLuo authored
-
ishandhanani authored
Co-authored-by:ishandhanani <ishandhananai@gmail.com>
-
- 13 May, 2025 4 commits
-
-
Tanmay Verma authored
-
Anant Sharma authored
-
Tanmay Verma authored
-
Anant Sharma authored
-
- 12 May, 2025 3 commits
-
-
Hongkuan Zhou authored
-
Anant Sharma authored
-
Hongkuan Zhou authored
Co-authored-by:Biswa Panda <biswa.panda@gmail.com>
-
- 09 May, 2025 11 commits
-
-
ishandhanani authored
Signed-off-by:
ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
Ryan Olson authored
-
Graham King authored
Example of how to connect a Python sglang engine to the message bus (NATS/etc). I In this example sglang does the pre/post processing. There is already an example where Dynamo does it. The examples teach this: - Be a chat completions engine, do your own pre-processing: ``` await register_llm(ModelType.Chat, endpoint, config.model) ``` - Have Dynamo do pre-processing. It will register us under both Chat and Completions endpoints, because that's handled before a Backend engine gets the request: ``` await register_llm(ModelType.Backend, endpoint, config.model) ```
-
Graham King authored
-
Graham King authored
-
Graham King authored
That avoids passing the `--model-config` param to dynamo-run when using llamacpp.
-
Harrison Saturley-Hall authored
-
wxsm authored
Allow both password or TLS auth, if none of these is provided fallback to no auth Closes #657
-
Biswa Panda authored
-
ishandhanani authored
Co-authored-by:ishandhanani <ishandhananai@gmail.com>
-
Adit Ranadive authored
NIXL uses UCX which will have support for EFA since 1.19. Explicitly use the 1.19 branch for UCX with Dynamo. Signed-off-by:Adit Ranadive <aranadive@nvidia.com>
-
- 08 May, 2025 9 commits
-
-
Hongkuan Zhou authored
-
julienmancuso authored
Co-authored-by:mohammedabdulwahhab <furkhan324@berkeley.edu>
-
hhzhang16 authored
-
Graham King authored
. New mistralrs and llamacpp version . mistralrs: Handle Gemma 3 and Llama 4 as vision models . Update the dynamo-run docs to use Qwen 3 . Our pre-processor now supports Llama 4's newer multi-modal `config.json` . Upgrade minijinja to handle Qwen 3's prompt template For Llama 4 we'll need to limit the max seq len. vllm says: > To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed,... I was able to run Llama 4 with llamacpp and a quantized GGUF, with Dynamo doing the pre-processing.
-
Ryan McCormick authored
-
Anthony Casagrande authored
Signed-off-by:Anthony Casagrande <acasagrande@nvidia.com>
-
Yan Ru Pei authored
-
Anant Sharma authored
-
hhzhang16 authored
-
- 07 May, 2025 6 commits
-
-
Hongkuan Zhou authored
-
Kris Hung authored
-
Graham King authored
Signed-off-by:
Graham King <graham@gkgk.org> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
Ryan McCormick authored
-
Biswa Panda authored
-
Tanmay Verma authored
Signed-off-by:
Tanmay Verma <tanmay2592@gmail.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-