- 29 Apr, 2025 12 commits
-
-
julienmancuso authored
-
wxsm authored
Signed-off-by:
wxsm <wxsms@foxmail.com> Co-authored-by:
ptarasiewiczNV <104908264+ptarasiewiczNV@users.noreply.github.com>
-
Abrar Shivani authored
Adds support for specifying default request parameters through a json template file that can be applied across all inference requests. This enables consistent parameter settings while still allowing per-request overrides. Changes: - Add --request-template CLI flag to specify template file path - Integrate template support in HTTP, batch and text input modes - Template values can be overridden by individual request parameters - Example template.json: ``` { "model": "Qwen2.5-3B-Instruct", "temperature": 0.7, "max_completion_tokens": 4096 } ``` -
Graham King authored
-
Hongkuan Zhou authored
-
Biswa Panda authored
-
Graham King authored
In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side. As part of moving pre-processing back to ingress-side we need to split this into two steps: - Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card. - PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters. Part of #743
-
Anant Sharma authored
-
Neelay Shah authored
-
nnshah1 authored
-
Ziqi Fan authored
refactor: change trtllm example kv routing use python bindings | deal with trtllm partial blocks | trtllm event change (#866)
-
- 28 Apr, 2025 11 commits
-
-
richardhuo-nv authored
We were observing a 40% performance drop compared with trtllm serve when benchmarking with isl=1000 and osl=200 at a concurrency level > 128. The number of the tokenization worker is the bottleneck. After bumping the tokenization processors number to 5, dynamo's benchmarking perf could match the trtllm serve's perf.
-
Graham King authored
-
Biswa Panda authored
-
ishandhanani authored
-
Ryan McCormick authored
-
Zhongdongming Dai authored
Co-authored-by:ishandhanani <82981111+ishandhanani@users.noreply.github.com>
-
Biswa Panda authored
-
ishandhanani authored
-
Anant Sharma authored
-
Olga Andreeva authored
Signed-off-by:Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
-
Hongkuan Zhou authored
Signed-off-by:
Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
- 26 Apr, 2025 2 commits
-
-
mohammedabdulwahhab authored
-
Hongkuan Zhou authored
Signed-off-by:
Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by:
ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by:
ishandhanani <ishandhanani@gmail.com> Co-authored-by:
Ubuntu <ubuntu@dev-inst-2w1vokvyuts83rzn4n1k7mnzew9.us-central1-a.c.brevdevprod.internal> Co-authored-by:
Biswa Panda <biswa.panda@gmail.com> Co-authored-by:
Anant Sharma <anants@nvidia.com>
-
- 25 Apr, 2025 12 commits
-
-
Harrison Saturley-Hall authored
Signed-off-by:Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com>
-
Alec authored
-
hhzhang16 authored
-
Anant Sharma authored
-
Ziqi Fan authored
-
julienmancuso authored
-
Anant Sharma authored
-
Piotr Marcinkiewicz authored
-
mohammedabdulwahhab authored
-
Graham King authored
This will allow an ingress-side pre-processor to see it without needing a model checkout. Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files. To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store. The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided. Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit. Part of #743 -
Biswa Panda authored
Co-authored-by:ishandhanani <ishandhanani@gmail.com>
-
julienmancuso authored
-
- 24 Apr, 2025 3 commits
-
-
Alec authored
Signed-off-by:Alec <35311602+alec-flowers@users.noreply.github.com>
-
ishandhanani authored
Co-authored-by:mohammedabdulwahhab <furkhan324@berkeley.edu>
-
julienmancuso authored
-