- 09 May, 2025 4 commits
-
-
Graham King authored
That avoids passing the `--model-config` param to dynamo-run when using llamacpp.
-
Harrison Saturley-Hall authored
-
wxsm authored
Allow both password or TLS auth, if none of these is provided fallback to no auth Closes #657
-
ishandhanani authored
Co-authored-by:ishandhanani <ishandhananai@gmail.com>
-
- 08 May, 2025 3 commits
-
-
Hongkuan Zhou authored
-
Graham King authored
. New mistralrs and llamacpp version . mistralrs: Handle Gemma 3 and Llama 4 as vision models . Update the dynamo-run docs to use Qwen 3 . Our pre-processor now supports Llama 4's newer multi-modal `config.json` . Upgrade minijinja to handle Qwen 3's prompt template For Llama 4 we'll need to limit the max seq len. vllm says: > To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed,... I was able to run Llama 4 with llamacpp and a quantized GGUF, with Dynamo doing the pre-processing.
-
Yan Ru Pei authored
-
- 07 May, 2025 3 commits
-
-
Hongkuan Zhou authored
-
Graham King authored
Signed-off-by:
Graham King <graham@gkgk.org> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
Graham King authored
vllm and sglang are now the sub-process engines from #954 Also updated docs on doing vllm and sglang multi-gpu (tensor parallel) and multi-node (pipeline parallel).
-
- 06 May, 2025 4 commits
-
-
jthomson04 authored
-
Graham King authored
New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines. Why? - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain. - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues. - Should have better performance as it's "native" vllm / sglang. - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle. -
hhzhang16 authored
-
Graham King authored
Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests: ``` from dynamo.llm import register_llm MODEL = "Qwen/Qwen2.5-0.5B-Instruct" await register_llm(endpoint, MODEL, 3) ``` Full vllm example, with pre-processing in dynamo: - `dynamo-run in=text out=dyn://dynamo.backend.generate` - `cd lib/bindings/python/examples/hello_world` - `python server_vllm.py` This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus. The `register_llm` call does this: - Download the model from HF if necessary - Load the model deployment card from the HF folder or extract from GGUF - Push the tokenizer config etc into NATS object store so ingress can access it from a different machine - Publish the model deployment card to ETCD
-
- 05 May, 2025 1 commit
-
-
Hongkuan Zhou authored
-
- 01 May, 2025 1 commit
-
-
Graham King authored
Part of https://github.com/ai-dynamo/dynamo/issues/743
-
- 29 Apr, 2025 3 commits
-
-
Abrar Shivani authored
Adds support for specifying default request parameters through a json template file that can be applied across all inference requests. This enables consistent parameter settings while still allowing per-request overrides. Changes: - Add --request-template CLI flag to specify template file path - Integrate template support in HTTP, batch and text input modes - Template values can be overridden by individual request parameters - Example template.json: ``` { "model": "Qwen2.5-3B-Instruct", "temperature": 0.7, "max_completion_tokens": 4096 } ``` -
Graham King authored
-
Graham King authored
In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side. As part of moving pre-processing back to ingress-side we need to split this into two steps: - Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card. - PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters. Part of #743
-
- 28 Apr, 2025 2 commits
-
-
ishandhanani authored
-
Olga Andreeva authored
Signed-off-by:Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
-
- 26 Apr, 2025 1 commit
-
-
Hongkuan Zhou authored
Signed-off-by:
Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by:
ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by:
ishandhanani <ishandhanani@gmail.com> Co-authored-by:
Ubuntu <ubuntu@dev-inst-2w1vokvyuts83rzn4n1k7mnzew9.us-central1-a.c.brevdevprod.internal> Co-authored-by:
Biswa Panda <biswa.panda@gmail.com> Co-authored-by:
Anant Sharma <anants@nvidia.com>
-
- 25 Apr, 2025 3 commits
-
-
Harrison Saturley-Hall authored
Signed-off-by:Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com>
-
Anant Sharma authored
-
Graham King authored
This will allow an ingress-side pre-processor to see it without needing a model checkout. Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files. To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store. The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided. Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit. Part of #743
-
- 24 Apr, 2025 1 commit
-
-
Abrar Shivani authored
Send a warm‑up request to the mistralrs engine so that subsequent requests are faster.
-
- 21 Apr, 2025 4 commits
-
-
Pankaj Gupta authored
-
ishandhanani authored
-
Graham King authored
"echo_core" is an engine that echoes the post-processed request back to you so you can see the template. Good for testing. It needed an extra flag set to work correctly.
-
Abrar Shivani authored
-
- 18 Apr, 2025 3 commits
-
-
Graham King authored
-
Hongkuan Zhou authored
Co-authored-by:ishandhanani <82981111+ishandhanani@users.noreply.github.com>
-
Graham King authored
It's different enough that I made a new engine vllm0_8 and renamed the previous engine to vllm0_7. `dynamo-run out=vllm` now expects 0.8. This matches the container change in #690. For older use `dynamo-run out=vllm0_7`.
-
- 17 Apr, 2025 3 commits
-
-
tlipoca9 authored
Co-authored-by:ishandhanani <82981111+ishandhanani@users.noreply.github.com>
-
Ryan Olson authored
-
Ryan McCormick authored
-
- 12 Apr, 2025 1 commit
-
-
Hongkuan Zhou authored
feat: ETCD prefix watcher + python binding + runtime reconfiguration for router and disagg router (#581) Signed-off-by:
Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by:
Neelay Shah <neelays@nvidia.com>
-
- 11 Apr, 2025 1 commit
-
-
Cole authored
-
- 09 Apr, 2025 2 commits
-
-
jon-chuang authored
feat: Extract Common Configs + Log Configs on Init + Add `test_` to `sdk/tests` filenames required for pytest (#434) Co-authored-by:ishandhanani <82981111+ishandhanani@users.noreply.github.com>
-
Anant Sharma authored
-