- 06 May, 2025 4 commits
-
-
Graham King authored
Approved by OSRB in Slack. Note we don't check for the closing delimiter to allow the longer copyright format. Motivation is that it reduces the context usage by 12 lines for every file in the project. That helps things like Cursor and Claude Code fit more, go faster, and cost less.
-
hhzhang16 authored
-
hhzhang16 authored
-
Graham King authored
Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests: ``` from dynamo.llm import register_llm MODEL = "Qwen/Qwen2.5-0.5B-Instruct" await register_llm(endpoint, MODEL, 3) ``` Full vllm example, with pre-processing in dynamo: - `dynamo-run in=text out=dyn://dynamo.backend.generate` - `cd lib/bindings/python/examples/hello_world` - `python server_vllm.py` This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus. The `register_llm` call does this: - Download the model from HF if necessary - Load the model deployment card from the HF folder or extract from GGUF - Push the tokenizer config etc into NATS object store so ingress can access it from a different machine - Publish the model deployment card to ETCD
-
- 05 May, 2025 6 commits
-
-
julienmancuso authored
-
Hongkuan Zhou authored
-
richardhuo-nv authored
-
julienmancuso authored
-
Harrison Saturley-Hall authored
Signed-off-by:
Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com> Co-authored-by:
Anant Sharma <anants@nvidia.com>
-
Hongkuan Zhou authored
-
- 02 May, 2025 3 commits
-
-
Tanmay Verma authored
-
Ryan McCormick authored
-
Kris Hung authored
-
- 01 May, 2025 7 commits
-
-
hhzhang16 authored
-
Graham King authored
Part of https://github.com/ai-dynamo/dynamo/issues/743
-
Biswa Panda authored
-
Abrar Shivani authored
The build script currently fails on macOS due to an incompatible Bash version. This PR adds a version check to ensure the correct Bash version is being used before proceeding. Closes GitHub issue: https://github.com/ai-dynamo/dynamo/issues/318
-
Abrar Shivani authored
Allow `hf://` prefix on command line. Closes GitHub issue: https://github.com/ai-dynamo/dynamo/issues/829
-
Yan Ru Pei authored
-
Ziqi Fan authored
-
- 30 Apr, 2025 5 commits
-
-
Biswa Panda authored
-
ishandhanani authored
-
Yan Ru Pei authored
-
hhzhang16 authored
Signed-off-by:
hhzhang16 <54051230+hhzhang16@users.noreply.github.com> Co-authored-by:
mohammedabdulwahhab <furkhan324@berkeley.edu>
-
julienmancuso authored
-
- 29 Apr, 2025 13 commits
-
-
mohammedabdulwahhab authored
Signed-off-by:
mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by:
hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
-
julienmancuso authored
-
wxsm authored
Signed-off-by:
wxsm <wxsms@foxmail.com> Co-authored-by:
ptarasiewiczNV <104908264+ptarasiewiczNV@users.noreply.github.com>
-
Abrar Shivani authored
Adds support for specifying default request parameters through a json template file that can be applied across all inference requests. This enables consistent parameter settings while still allowing per-request overrides. Changes: - Add --request-template CLI flag to specify template file path - Integrate template support in HTTP, batch and text input modes - Template values can be overridden by individual request parameters - Example template.json: ``` { "model": "Qwen2.5-3B-Instruct", "temperature": 0.7, "max_completion_tokens": 4096 } ``` -
Graham King authored
-
Hongkuan Zhou authored
-
Biswa Panda authored
-
Graham King authored
In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side. As part of moving pre-processing back to ingress-side we need to split this into two steps: - Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card. - PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters. Part of #743
-
Anant Sharma authored
-
Neelay Shah authored
-
nnshah1 authored
-
Ziqi Fan authored
refactor: change trtllm example kv routing use python bindings | deal with trtllm partial blocks | trtllm event change (#866)
-
- 28 Apr, 2025 2 commits
-
-
richardhuo-nv authored
We were observing a 40% performance drop compared with trtllm serve when benchmarking with isl=1000 and osl=200 at a concurrency level > 128. The number of the tokenization worker is the bottleneck. After bumping the tokenization processors number to 5, dynamo's benchmarking perf could match the trtllm serve's perf.
-
Graham King authored
-