- 22 May, 2025 6 commits
-
-
jthomson04 authored
-
Kyle McGill authored
-
Hongkuan Zhou authored
-
Graham King authored
Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context. Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit. Future todo: - Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor. - mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.
-
jmswen authored
-
Suman Tatiraju authored
Co-authored-by:Suman Tatiraju <statiraju@statiraju-mlt.client.nvidia.com>
-
- 21 May, 2025 10 commits
-
-
Graham King authored
-
Graham King authored
Previously any error would cause us to halt. Most of them are recoverable. So now we print the error and return to the prompt.
-
Graham King authored
-
mohammedabdulwahhab authored
Co-authored-by:Hannah Zhang <hannahz@nvidia.com>
-
Neelay Shah authored
-
Suman Tatiraju authored
Signed-off-by:
Suman Tatiraju <167138127+statiraju@users.noreply.github.com> Signed-off-by:
Anant Sharma <anants@nvidia.com> Co-authored-by:
Anant Sharma <anants@nvidia.com> Co-authored-by:
Dmitry Tokarev <dtokarev@nvidia.com> Co-authored-by:
ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by:
Kristen Kelleher <kkelleher@nvidia.com> Co-authored-by:
Suman Tatiraju <statiraju@statiraju-mlt.client.nvidia.com> Co-authored-by:
Hannah Zhang <hannahz@nvidia.com>
-
Biswa Panda authored
-
Graham King authored
- Stop advertising a model when it's last instance stops. Previously was when any instance stops. - Faster locks on model manager. - Move discovery code out of http, as it is used by all inputs.
-
Yan Ru Pei authored
Signed-off-by:Yan Ru Pei <yanrpei@gmail.com>
-
Tanmay Verma authored
-
- 20 May, 2025 5 commits
-
-
julienmancuso authored
-
Tanmay Verma authored
-
Hongkuan Zhou authored
-
Faradawn Yang authored
Remove RouterType and ModelMetaData in `lib/runtime/src/protocols.rs`, which are unused (no outside reference). It is because that the routing has been moved to its own module, `pipeline/network/egress/push_router.rs`. Therefore, the legacy definition of RouterType in `protocols.rs` is no longer used.
-
Ryan Olson authored
-
- 19 May, 2025 9 commits
-
-
Jacky authored
-
Graham King authored
We can now do this: - Node 1: ``` dynamo-run in=http out=dyn ``` - Node 2 and 3, two instances of component 'backend' in the nemotron_ultra pipeline: ``` dynamo-run in=dyn://nemotron_ultra.backend.generate out=vllm /data/models/NemotronUltra ``` - Node 4 and 5, two instances of the 'backend' component in nemotron_super pipeline: ``` dynamo-run in=dyn://nemotron_super.backend.generate out=vllm /data/models/NemotronSuper ``` The ingress node will discover all four instances and route correctly. We have been planning for this for a long time now. As part of this auto-discovery is now always `out=dyn`, with no extra URL parts. Previously it could only route to a single pipeline. Also: - Refactor endpoint / instance naming now that I understand them - Fix removing models when their instance stops.
-
jthomson04 authored
-
Rohan Varma authored
Co-authored-by:
Rohan Varma <rohanv@rohanv-mlt.client.nvidia.com> Co-authored-by:
Julien Mancuso <jmancuso@nvidia.com> Co-authored-by:
julienmancuso <161955438+julienmancuso@users.noreply.github.com>
-
ishandhanani authored
-
Jacky authored
-
hhzhang16 authored
-
Tom O'Brien authored
Implements OpenAI embeddings (interface only). - Adds ModelType::Embedding - Adds OpenAI embedding request/response structs - Adds support for embedding model discovery
-
Alec authored
-
- 17 May, 2025 1 commit
-
-
Biswa Panda authored
-
- 16 May, 2025 5 commits
-
-
Ryan McCormick authored
-
ptarasiewiczNV authored
-
Tanmay Verma authored
-
Ryan McCormick authored
-
Biswa Panda authored
-
- 15 May, 2025 4 commits
-
-
Graham King authored
Each namespace is for a single pipeline, so a component must be model-unique. The means we can have several components with the same name running the same model (data parallel), their traffic will be routed according to `--router-mode`, but we cannot have several components with the same name running different models. Add an `ensure_unique` check to prevent that happening.
-
Ryan McCormick authored
-
mohammedabdulwahhab authored
-
Graham King authored
The Python bindings use the default value for RouterMode. Previously that was Random (good), but now it became None (bad). Remove the option and clean up the duplicate RouterMode. I was trying to avoid putting the `KV` enum in dynamo-runtime. Turns out adding those two characters gives us a healthy simplification, and restores the old default router value. Also clean up two noisy log messages when waiting for KV routing metrics to start in worker.
-