- 02 Jun, 2025 1 commit
-
-
Graham King authored
This allows building: - only `mistral.rs` engine: `--no-default-features --features mistralrs` - or only `llama.cpp` engine: `--no-default-features --features llamacpp`. Since llama.cpp became a default we'd only tested building both at once. The docs already said we supported that but there was some combo of Rust features that didn't build. This is the fix.
-
- 29 May, 2025 1 commit
-
-
Graham King authored
Previously `mistral.rs` was the default engine for both safetensors and GGUF models. Now it is only the default for safetensors, `llama.cpp` becomes the default for GGUF. Why? - Since #1177 `llama.cpp` is built-in by default, so we can switch. - `llama.cpp` is very very good at running GGUF (but can't run other types of model), so we should switch. Dynamo's multi-engine support gives us a secret super-power: we can use the best engine for this specific format or model. We can still run GGUF with mistralrs by doing `out=mistralrs`.
-
- 28 May, 2025 2 commits
-
-
Graham King authored
It was removed from the docs in 0.2.1 and replaced with writing a [standalone Python engine](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_run.md#writing-your-own-engine-in-python). Also remove the associated `dynamo-run` feature `python`. Releasing this in 0.3.0 will resolve #784 and #1109.
-
Tanmay Verma authored
-
- 22 May, 2025 2 commits
-
-
Graham King authored
Example: ``` dynamo-run out=<engine> <model> --kv-cache-block-size 64 ``` In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card. Previously hard coded to 16, which is now the default. - Load context_length from model. Closes #1172 - Store context length and KV cache block size in Model Deployment Card #1170
-
Graham King authored
Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context. Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit. Future todo: - Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor. - mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.
-
- 21 May, 2025 1 commit
-
-
Graham King authored
-
- 19 May, 2025 1 commit
-
-
Graham King authored
We can now do this: - Node 1: ``` dynamo-run in=http out=dyn ``` - Node 2 and 3, two instances of component 'backend' in the nemotron_ultra pipeline: ``` dynamo-run in=dyn://nemotron_ultra.backend.generate out=vllm /data/models/NemotronUltra ``` - Node 4 and 5, two instances of the 'backend' component in nemotron_super pipeline: ``` dynamo-run in=dyn://nemotron_super.backend.generate out=vllm /data/models/NemotronSuper ``` The ingress node will discover all four instances and route correctly. We have been planning for this for a long time now. As part of this auto-discovery is now always `out=dyn`, with no extra URL parts. Previously it could only route to a single pipeline. Also: - Refactor endpoint / instance naming now that I understand them - Fix removing models when their instance stops.
-
- 15 May, 2025 1 commit
-
-
Graham King authored
The Python bindings use the default value for RouterMode. Previously that was Random (good), but now it became None (bad). Remove the option and clean up the duplicate RouterMode. I was trying to avoid putting the `KV` enum in dynamo-runtime. Turns out adding those two characters gives us a healthy simplification, and restores the old default router value. Also clean up two noisy log messages when waiting for KV routing metrics to start in worker.
-
- 14 May, 2025 1 commit
-
-
Graham King authored
Router: ``` dynamo-run in=http out=dyn://dynamo.endpoint.generate --router-mode kv ``` Worker (* N): ``` dynamo-run in=dyn://dynamo.endpoint.generate out=vllm /data/llms/Qwen/Qwen3-4B ``` You need patched vllm and the C bindings `.so`. Full docs in the updated guide: `docs/guides/dynamo_run.md`. This gives us a pure-Rust ingress node: OpenAI compliant HTTP server + Pre-processor + KV-aware router.
-
- 08 May, 2025 1 commit
-
-
Graham King authored
. New mistralrs and llamacpp version . mistralrs: Handle Gemma 3 and Llama 4 as vision models . Update the dynamo-run docs to use Qwen 3 . Our pre-processor now supports Llama 4's newer multi-modal `config.json` . Upgrade minijinja to handle Qwen 3's prompt template For Llama 4 we'll need to limit the max seq len. vllm says: > To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed,... I was able to run Llama 4 with llamacpp and a quantized GGUF, with Dynamo doing the pre-processing.
-
- 07 May, 2025 2 commits
-
-
Graham King authored
Signed-off-by:
Graham King <graham@gkgk.org> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
Graham King authored
vllm and sglang are now the sub-process engines from #954 Also updated docs on doing vllm and sglang multi-gpu (tensor parallel) and multi-node (pipeline parallel).
-
- 06 May, 2025 2 commits
-
-
Graham King authored
New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines. Why? - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain. - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues. - Should have better performance as it's "native" vllm / sglang. - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle. -
Graham King authored
Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests: ``` from dynamo.llm import register_llm MODEL = "Qwen/Qwen2.5-0.5B-Instruct" await register_llm(endpoint, MODEL, 3) ``` Full vllm example, with pre-processing in dynamo: - `dynamo-run in=text out=dyn://dynamo.backend.generate` - `cd lib/bindings/python/examples/hello_world` - `python server_vllm.py` This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus. The `register_llm` call does this: - Download the model from HF if necessary - Load the model deployment card from the HF folder or extract from GGUF - Push the tokenizer config etc into NATS object store so ingress can access it from a different machine - Publish the model deployment card to ETCD
-
- 01 May, 2025 2 commits
-
-
Graham King authored
Part of https://github.com/ai-dynamo/dynamo/issues/743
-
Abrar Shivani authored
Allow `hf://` prefix on command line. Closes GitHub issue: https://github.com/ai-dynamo/dynamo/issues/829
-
- 29 Apr, 2025 2 commits
-
-
Abrar Shivani authored
Adds support for specifying default request parameters through a json template file that can be applied across all inference requests. This enables consistent parameter settings while still allowing per-request overrides. Changes: - Add --request-template CLI flag to specify template file path - Integrate template support in HTTP, batch and text input modes - Template values can be overridden by individual request parameters - Example template.json: ``` { "model": "Qwen2.5-3B-Instruct", "temperature": 0.7, "max_completion_tokens": 4096 } ``` -
Graham King authored
In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side. As part of moving pre-processing back to ingress-side we need to split this into two steps: - Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card. - PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters. Part of #743
-
- 28 Apr, 2025 1 commit
-
-
Olga Andreeva authored
Signed-off-by:Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
-
- 25 Apr, 2025 1 commit
-
-
Graham King authored
This will allow an ingress-side pre-processor to see it without needing a model checkout. Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files. To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store. The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided. Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit. Part of #743
-
- 21 Apr, 2025 2 commits
-
-
ishandhanani authored
-
Zhongdongming Dai authored
-
- 18 Apr, 2025 2 commits
-
-
Graham King authored
-
Graham King authored
It's different enough that I made a new engine vllm0_8 and renamed the previous engine to vllm0_7. `dynamo-run out=vllm` now expects 0.8. This matches the container change in #690. For older use `dynamo-run out=vllm0_7`.
-
- 07 Apr, 2025 1 commit
-
-
Graham King authored
As a first step towards KV routing: - introduce a `--router-mode` in dynamo-run that only does random and round-robin right now. Not that interesting yet. - Make the vllm engine publish the KV events received from our patched vllm. Now we "just" need to connect the two. Easy right?
-
- 04 Apr, 2025 1 commit
-
-
Graham King authored
Also upgrade the cargo resolver to v3, the default. New clippy lints: - `next_back()` instead of `last()` for a double-ended iterator. That avoids walking the whole list. - ` repeat_n` instead of `repeat.take`. That avoids cloning. - Doc indenting
-
- 03 Apr, 2025 1 commit
-
-
Ryan Olson authored
Moved all of `lib/llm/src/engines` to their own crates as e.g. `lib/engines/mistralrs`. This will allow publishing of the `dynamo-llm` crate as it won't have any github dependencies. The only engines in dynamo-llm will be the demo `echo` ones. Co-authored-by:Graham King <grahamk@nvidia.com>
-
- 25 Mar, 2025 1 commit
-
-
Graham King authored
Put the arguments in a JSON file: ``` { "dtype": "half", "trust_remote_code": true } ``` Pass it like this: ``` dynamo-run out=sglang ~/llm_models/Llama-3.2-3B-Instruct --extra-engine-args sglang_extra.json ``` Requested here https://github.com/ai-dynamo/dynamo/issues/290 (`dtype`) and here https://github.com/ai-dynamo/dynamo/issues/360 (`trust_remote_code`).
-
- 24 Mar, 2025 1 commit
-
-
Graham King authored
This lets us do: ``` dynamo-run out=llamacpp <gguf_file> ``` Previously a `--model-config <hf-repo>` was also required, to configure our tokenizer.
-
- 21 Mar, 2025 1 commit
-
-
Olga Andreeva authored
Co-authored-by:Olga Andreeva <oandreeva@oandreeva-mlt.client.nvidia.com>
-
- 15 Mar, 2025 1 commit
-
-
Graham King authored
``` dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/ ``` The file has genai format, one entry per line: ``` {"text": "the prompt"} {"text": ..etc ``` The prompt is evaluated and the output written to `output.jsonl` in the same folder as the input. At the end of the run various statistics are printed: > Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s) This is also helpful for pushing load into the system and stressing the various components. Not intended for performance measurement, it's a batch inference tool.
-
- 13 Mar, 2025 1 commit
-
-
Graham King authored
- Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine. - The default engine (previously always mistralrs) depends on what is compiled in. - Text can be piped in and will result in a single run of the model. All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit: ``` echo "What is the capital of Costa Rica?" | dynamo-run Qwen/Qwen2.5-3B-Instruct ``` Co-authored-by:Ryan McCormick <rmccormick@nvidia.com>
-
- 12 Mar, 2025 1 commit
-
-
Graham King authored
Command line arguments are passed to the python engine like this: ``` dynamo-run out=pystr:my_python_engine.py -- -n 42 --custom-arg Orange --yes ``` The python engine receives the arguments in `sys.argv`. The argument list will include some standard ones as well as anything after the `--`. This input: ``` dynamo-run out=pystr:my_engine.py /opt/models/Llama-3.2-3B-Instruct/ --model-name llama_3.2 --tensor-parallel-size 4 -- -n 1 ``` is read like this: ``` async def generate(request): .. as before .. if __name__ == "__main__": print(f"MAIN: {sys.argv}") ``` and produces this output: ``` MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--model-name', 'llama3.2', '--http-port', '8080', '--tensor-parallel-size', '4', '--base-gpu-id', '0', '--num-nodes', '1', '--node-rank', '0', '-n', '1'] ``` This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`.
-
- 11 Mar, 2025 1 commit
-
-
Graham King authored
- Latest from repo, many improvements - Support most of the OpenAI request features (temperature, top_p, etc) - Download models from Hugging Face if necessary
-
- 10 Mar, 2025 1 commit
-
-
Graham King authored
For the `echo` and `pystr` engines we previously required the user to pass `--model-name <x>` so we would have a name for the model. If the input is HTTP we do need this to match on the users' JSON request. If the input is Text we don't need a name. So if the input is Text and we don't already have a name for the model, give it one.
-
- 08 Mar, 2025 1 commit
-
-
Neelay Shah authored
Co-authored-by:Biswa Panda <biswa.panda@gmail.com>
-
- 07 Mar, 2025 3 commits
-
-
Graham King authored
There are two etcd keys: - The service - The model The second one is the interesting one for us. Previously we confused the two.
-
Graham King authored
Instead of using `out=pystr:<my.py>` we can now do this: ``` dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout> ``` That engine will receive and respond with tokens. Here's an example engine file: ``` import asyncio async def generate(request): yield {"token_ids":[791]} await asyncio.sleep(0.1) yield {"token_ids":[6864]} await asyncio.sleep(0.1) yield {"token_ids":[315]} await asyncio.sleep(0.1) yield {"token_ids":[9822]} await asyncio.sleep(0.1) yield {"token_ids":[374]} await asyncio.sleep(0.1) yield {"token_ids":[12366]} await asyncio.sleep(0.1) yield {"token_ids":[13]} ``` Also reduce duplication by making the bindings engine use the llm lib engine. -
Graham King authored
1. Create `my_engine.py` ``` import asyncio async def generate(request): yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} ``` 2. Build ``` cargo build --release --feature python ``` 3. Run ``` dynemo-run out=pystr:my_engine.py --name test ``` And here's a distributed system, with your engine: - Node 1: `dynemo-run in=http out=dyn://test` - Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`
-