- 14 May, 2025 2 commits
-
-
Graham King authored
Router: ``` dynamo-run in=http out=dyn://dynamo.endpoint.generate --router-mode kv ``` Worker (* N): ``` dynamo-run in=dyn://dynamo.endpoint.generate out=vllm /data/llms/Qwen/Qwen3-4B ``` You need patched vllm and the C bindings `.so`. Full docs in the updated guide: `docs/guides/dynamo_run.md`. This gives us a pure-Rust ingress node: OpenAI compliant HTTP server + Pre-processor + KV-aware router.
-
Graham King authored
For #1006 Prints this on startup: ``` 2025-05-09T13:06:34.529Z DEBUG dynamo_run::input::http: Supported routes: ["GET /metrics", "GET /dynamo/alpha/list-models", "GET /v1/models", "POST /v1/chat/completions", "POST /v1/completions"] ```
-
- 01 May, 2025 1 commit
-
-
Graham King authored
Part of https://github.com/ai-dynamo/dynamo/issues/743
-
- 29 Apr, 2025 3 commits
-
-
Abrar Shivani authored
Adds support for specifying default request parameters through a json template file that can be applied across all inference requests. This enables consistent parameter settings while still allowing per-request overrides. Changes: - Add --request-template CLI flag to specify template file path - Integrate template support in HTTP, batch and text input modes - Template values can be overridden by individual request parameters - Example template.json: ``` { "model": "Qwen2.5-3B-Instruct", "temperature": 0.7, "max_completion_tokens": 4096 } ``` -
Graham King authored
-
Graham King authored
In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side. As part of moving pre-processing back to ingress-side we need to split this into two steps: - Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card. - PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters. Part of #743
-
- 21 Apr, 2025 1 commit
-
-
Graham King authored
"echo_core" is an engine that echoes the post-processed request back to you so you can see the template. Good for testing. It needed an extra flag set to work correctly.
-
- 07 Apr, 2025 1 commit
-
-
Graham King authored
As a first step towards KV routing: - introduce a `--router-mode` in dynamo-run that only does random and round-robin right now. Not that interesting yet. - Make the vllm engine publish the KV events received from our patched vllm. Now we "just" need to connect the two. Easy right?
-
- 26 Mar, 2025 1 commit
-
-
Ryan Olson authored
-
- 14 Mar, 2025 1 commit
-
-
Ryan McCormick authored
-
- 08 Mar, 2025 1 commit
-
-
Neelay Shah authored
Co-authored-by:Biswa Panda <biswa.panda@gmail.com>
-
- 07 Mar, 2025 1 commit
-
-
Graham King authored
There are two etcd keys: - The service - The model The second one is the interesting one for us. Previously we confused the two.
-
- 05 Mar, 2025 1 commit
-
-
Neelay Shah authored
Co-authored-by:Graham King <grahamk@nvidia.com>
-
- 28 Feb, 2025 1 commit
-
-
Paul Hendricks authored
-
- 27 Feb, 2025 3 commits
-
-
Paul Hendricks authored
-
Paul Hendricks authored
-
Paul Hendricks authored
-
- 26 Feb, 2025 1 commit
-
-
Paul Hendricks authored
Co-authored-by:Graham King <grahamk@nvidia.com>
-
- 25 Feb, 2025 2 commits
-
-
Alec authored
Co-authored-by:aflowers <aflowers@nvidia.com>
-
Neelay Shah authored
Signed-off-by:
Neelay Shah <neelays@nvidia.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-