- 14 May, 2025 2 commits
-
-
GuanLuo authored
-
ishandhanani authored
Co-authored-by:ishandhanani <ishandhananai@gmail.com>
-
- 13 May, 2025 4 commits
-
-
Tanmay Verma authored
-
Anant Sharma authored
-
Tanmay Verma authored
-
Anant Sharma authored
-
- 12 May, 2025 3 commits
-
-
Hongkuan Zhou authored
-
Anant Sharma authored
-
Hongkuan Zhou authored
Co-authored-by:Biswa Panda <biswa.panda@gmail.com>
-
- 09 May, 2025 11 commits
-
-
ishandhanani authored
Signed-off-by:
ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
Ryan Olson authored
-
Graham King authored
Example of how to connect a Python sglang engine to the message bus (NATS/etc). I In this example sglang does the pre/post processing. There is already an example where Dynamo does it. The examples teach this: - Be a chat completions engine, do your own pre-processing: ``` await register_llm(ModelType.Chat, endpoint, config.model) ``` - Have Dynamo do pre-processing. It will register us under both Chat and Completions endpoints, because that's handled before a Backend engine gets the request: ``` await register_llm(ModelType.Backend, endpoint, config.model) ```
-
Graham King authored
-
Graham King authored
-
Graham King authored
That avoids passing the `--model-config` param to dynamo-run when using llamacpp.
-
Harrison Saturley-Hall authored
-
wxsm authored
Allow both password or TLS auth, if none of these is provided fallback to no auth Closes #657
-
Biswa Panda authored
-
ishandhanani authored
Co-authored-by:ishandhanani <ishandhananai@gmail.com>
-
Adit Ranadive authored
NIXL uses UCX which will have support for EFA since 1.19. Explicitly use the 1.19 branch for UCX with Dynamo. Signed-off-by:Adit Ranadive <aranadive@nvidia.com>
-
- 08 May, 2025 9 commits
-
-
Hongkuan Zhou authored
-
julienmancuso authored
Co-authored-by:mohammedabdulwahhab <furkhan324@berkeley.edu>
-
hhzhang16 authored
-
Graham King authored
. New mistralrs and llamacpp version . mistralrs: Handle Gemma 3 and Llama 4 as vision models . Update the dynamo-run docs to use Qwen 3 . Our pre-processor now supports Llama 4's newer multi-modal `config.json` . Upgrade minijinja to handle Qwen 3's prompt template For Llama 4 we'll need to limit the max seq len. vllm says: > To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed,... I was able to run Llama 4 with llamacpp and a quantized GGUF, with Dynamo doing the pre-processing.
-
Ryan McCormick authored
-
Anthony Casagrande authored
Signed-off-by:Anthony Casagrande <acasagrande@nvidia.com>
-
Yan Ru Pei authored
-
Anant Sharma authored
-
hhzhang16 authored
-
- 07 May, 2025 11 commits
-
-
Hongkuan Zhou authored
-
Kris Hung authored
-
Graham King authored
Signed-off-by:
Graham King <graham@gkgk.org> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
Ryan McCormick authored
-
Biswa Panda authored
-
Tanmay Verma authored
Signed-off-by:
Tanmay Verma <tanmay2592@gmail.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
祝健聪 authored
Signed-off-by:Chasing1020 <chasing1020@gmail.com>
-
Anthony Casagrande authored
-
Graham King authored
vllm and sglang are now the sub-process engines from #954 Also updated docs on doing vllm and sglang multi-gpu (tensor parallel) and multi-node (pipeline parallel).
-
ptarasiewiczNV authored
-
ptarasiewiczNV authored
-