- 03 Sep, 2025 2 commits
-
-
KrishnanPrash authored
Signed-off-by:Krishnan Prashanth <kprashanth@nvidia.com>
-
Biswa Panda authored
Signed-off-by:Biswa Panda <biswa.panda@gmail.com>
-
- 29 Aug, 2025 2 commits
-
-
Ayush Agarwal authored
-
ryan-lempka authored
Signed-off-by:Ryan Lempka <rlempka@nvidia.com>
-
- 28 Aug, 2025 2 commits
-
-
atchernych authored
-
Keiven C authored
Co-authored-by:Keiven Chang <keivenchang@users.noreply.github.com>
-
- 27 Aug, 2025 1 commit
-
-
GuanLuo authored
-
- 26 Aug, 2025 1 commit
-
-
Chi McIsaac authored
-
- 25 Aug, 2025 2 commits
-
-
nachiketb-nvidia authored
-
nachiketb-nvidia authored
- couple of refactors - added a new dependency, openai-harmony - implemented the gpt oss parser
-
- 22 Aug, 2025 2 commits
-
-
Graham King authored
-
Ayush Agarwal authored
-
- 21 Aug, 2025 1 commit
-
-
nachiketb-nvidia authored
-
- 20 Aug, 2025 1 commit
-
-
nachiketb-nvidia authored
Changing the chat completions response objects from structs to types of dynamo_async_openai Implement aggregator traits for them chat completion structs add reasoning_content under message and delta message in lib/async-openai
-
- 19 Aug, 2025 2 commits
-
-
nachiketb-nvidia authored
Co-authored-by:Graham King <grahamk@nvidia.com>
-
Ryan Olson authored
Signed-off-by:
Ryan Olson <rolson@nvidia.com> Co-authored-by:
Olga Andreeva <oandreeva@nvidia.com> Co-authored-by:
Ziqi Fan <ziqif@nvidia.com> Co-authored-by:
John Thompson <jothomson@nvidia.com> Co-authored-by:
Richard Huo <rihuo@nvidia.com> Co-authored-by:
Zicheng Ma <zichengm@nvidia.com>
-
- 18 Aug, 2025 1 commit
-
-
ryan-lempka authored
-
- 15 Aug, 2025 1 commit
-
-
Abrar Shivani authored
-
- 13 Aug, 2025 2 commits
-
-
ryan-lempka authored
-
jthomson04 authored
Signed-off-by:jthomson04 <jwillthomson19@gmail.com>
-
- 12 Aug, 2025 1 commit
-
-
KrishnanPrash authored
feat: Add frontend support for `min_tokens` and `ignore_eos` (outside of `nvext`) and Structured Output / Guided Decoding (#2380) Signed-off-by:
KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com> Co-authored-by:
Ayush Agarwal <ayushag@nvidia.com>
-
- 07 Aug, 2025 2 commits
-
-
Graham King authored
-
Keiven C authored
Co-authored-by:Keiven Chang <keivenchang@users.noreply.github.com>
-
- 01 Aug, 2025 1 commit
-
-
Keiven C authored
Co-authored-by:Keiven Chang <keivenchang@users.noreply.github.com>
-
- 18 Jul, 2025 1 commit
-
-
Ryan Olson authored
-
- 17 Jul, 2025 1 commit
-
-
Ryan Olson authored
-
- 15 Jul, 2025 1 commit
-
-
Ryan Olson authored
-
- 10 Jul, 2025 1 commit
-
-
Graham King authored
-
- 01 Jul, 2025 2 commits
-
-
Nathan Barry authored
-
Paul Hendricks authored
-
- 26 Jun, 2025 1 commit
-
-
Paul Hendricks authored
-
- 06 Jun, 2025 1 commit
-
-
Olga Andreeva authored
-
- 04 Jun, 2025 2 commits
-
-
Paul Hendricks authored
-
Graham King authored
Publish `generation_config.json` from worker to ingress, as part of Model Deployment Card. That allows ingress to read key fields out of it. Gemma 3 4B+ has some important information that's only in there.
-
- 22 May, 2025 2 commits
-
-
Graham King authored
Example: ``` dynamo-run out=<engine> <model> --kv-cache-block-size 64 ``` In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card. Previously hard coded to 16, which is now the default. - Load context_length from model. Closes #1172 - Store context length and KV cache block size in Model Deployment Card #1170
-
Graham King authored
Removed the hard coded sleeps, explained what we're testing. Closes https://github.com/ai-dynamo/dynamo/issues/1132 The race condition is that `apply_event` sends a message on a channel, it does not directly apply the event. At some later point the tokio runtime schedules the task running the channel receiver, which applies the event. If that had not happened yet the test would fail.
-
- 21 May, 2025 1 commit
-
-
Graham King authored
- Stop advertising a model when it's last instance stops. Previously was when any instance stops. - Faster locks on model manager. - Move discovery code out of http, as it is used by all inputs.
-
- 19 May, 2025 1 commit
-
-
Tom O'Brien authored
Implements OpenAI embeddings (interface only). - Adds ModelType::Embedding - Adds OpenAI embedding request/response structs - Adds support for embedding model discovery
-
- 08 May, 2025 1 commit
-
-
Graham King authored
. New mistralrs and llamacpp version . mistralrs: Handle Gemma 3 and Llama 4 as vision models . Update the dynamo-run docs to use Qwen 3 . Our pre-processor now supports Llama 4's newer multi-modal `config.json` . Upgrade minijinja to handle Qwen 3's prompt template For Llama 4 we'll need to limit the max seq len. vllm says: > To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed,... I was able to run Llama 4 with llamacpp and a quantized GGUF, with Dynamo doing the pre-processing.
-
- 06 May, 2025 1 commit
-
-
Graham King authored
Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests: ``` from dynamo.llm import register_llm MODEL = "Qwen/Qwen2.5-0.5B-Instruct" await register_llm(endpoint, MODEL, 3) ``` Full vllm example, with pre-processing in dynamo: - `dynamo-run in=text out=dyn://dynamo.backend.generate` - `cd lib/bindings/python/examples/hello_world` - `python server_vllm.py` This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus. The `register_llm` call does this: - Download the model from HF if necessary - Load the model deployment card from the HF folder or extract from GGUF - Push the tokenizer config etc into NATS object store so ingress can access it from a different machine - Publish the model deployment card to ETCD
-