".buildkite/vscode:/vscode.git/clone" did not exist on "e757a629e7336c4e436dac62daee20fd81776aeb"
- 13 Jan, 2026 1 commit
-
-
Qi Wang authored
-
- 02 Jan, 2026 1 commit
-
-
Tushar Sharma authored
Signed-off-by:Tushar Sharma <tusharma@nvidia.com>
-
- 13 Dec, 2025 1 commit
-
-
KrishnanPrash authored
Signed-off-by:Krishnan Prashanth <kprashanth@nvidia.com>
-
- 10 Nov, 2025 1 commit
-
-
Graham King authored
Signed-off-by:Graham King <grahamk@nvidia.com>
-
- 07 Nov, 2025 1 commit
-
-
Yan Ru Pei authored
Signed-off-by:PeaBrane <yanrpei@gmail.com>
-
- 29 Sep, 2025 1 commit
-
-
akshaver authored
-
- 19 Sep, 2025 1 commit
-
-
Jacky authored
Signed-off-by:Jacky <18255193+kthui@users.noreply.github.com>
-
- 16 Sep, 2025 1 commit
-
-
Graham King authored
Signed-off-by:Graham King <grahamk@nvidia.com>
-
- 22 Aug, 2025 1 commit
-
-
Graham King authored
-
- 24 Jun, 2025 1 commit
-
-
zxyy-bys authored
-
- 22 May, 2025 2 commits
-
-
Graham King authored
Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context. Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit. Future todo: - Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor. - mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.
-
jmswen authored
-
- 29 Apr, 2025 1 commit
-
-
Graham King authored
In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side. As part of moving pre-processing back to ingress-side we need to split this into two steps: - Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card. - PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters. Part of #743
-
- 18 Apr, 2025 1 commit
-
-
Hongkuan Zhou authored
Co-authored-by:ishandhanani <82981111+ishandhanani@users.noreply.github.com>
-
- 25 Feb, 2025 1 commit
-
-
Neelay Shah authored
Signed-off-by:
Neelay Shah <neelays@nvidia.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-