- 01 Jul, 2025 1 commit
-
-
Paul Hendricks authored
-
- 30 Jun, 2025 2 commits
-
-
Graham King authored
Move much of what was in the `dynamo-run` crate into `dynamo-llm` so that everyone can use it. Example usage: 1. Create a `LocalModel`: ``` let local_model = LocalModelBuilder::default() .model_path("Qwen/Qwen3-0.6B") .http_port(8080) .build().await?; ``` 2. Make an engine: ``` let engine_config = EngineConfig::StaticFull { engine: dynamo_engine_mistralrs::make_engine(&local_model).await?, model: Box::new(local_model), }; ``` 3. Connect it to an input and run it ``` dynamo_llm::entrypoint::input::run_input(Input::Http, runtime, engine_config).await?; ``` For https://github.com/ai-dynamo/dynamo/issues/1647 Code Rabbit summary, thanks: * Introduced a flexible builder pattern for local model configuration, allowing advanced customization and easier initialization. * Added new input modes and unified input handling, supporting interactive chat, HTTP server, batch file, and distributed endpoint modes. * Centralized engine configuration and routing, enabling more extensible and maintainable engine management. * Simplified and modularized the codebase by moving input and engine logic into dedicated modules. * Replaced direct model construction with an asynchronous builder for improved clarity and extensibility. * Streamlined configuration and validation for flags and router settings. * Added validation to prevent incompatible input and output combinations in endpoint and dynamic modes. -
Paul Hendricks authored
-
- 25 Jun, 2025 1 commit
-
-
Nathan Barry authored
-
- 03 Jun, 2025 1 commit
-
-
Graham King authored
To talk to the vllm/sglang/trtllm engine we previously hardcoded an endpoint. The user never sees it so it doesn't matter which one. However if you try to run _two_ instances of Dynamo on one machine they will conflict. Use a UUID as the component name to resolve that. Part of the solution for: https://github.com/ai-dynamo/dynamo/issues/1073
-
- 02 Jun, 2025 1 commit
-
-
Graham King authored
Do not include by default as it needs libgomp1 at runtime. Add a feature to enable it at build time.
-
- 29 May, 2025 1 commit
-
-
Graham King authored
-
- 28 May, 2025 1 commit
-
-
Graham King authored
It was removed from the docs in 0.2.1 and replaced with writing a [standalone Python engine](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_run.md#writing-your-own-engine-in-python). Also remove the associated `dynamo-run` feature `python`. Releasing this in 0.3.0 will resolve #784 and #1109.
-
- 07 May, 2025 1 commit
-
-
Graham King authored
vllm and sglang are now the sub-process engines from #954 Also updated docs on doing vllm and sglang multi-gpu (tensor parallel) and multi-node (pipeline parallel).
-
- 06 May, 2025 2 commits
-
-
Graham King authored
New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines. Why? - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain. - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues. - Should have better performance as it's "native" vllm / sglang. - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle. -
Graham King authored
Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests: ``` from dynamo.llm import register_llm MODEL = "Qwen/Qwen2.5-0.5B-Instruct" await register_llm(endpoint, MODEL, 3) ``` Full vllm example, with pre-processing in dynamo: - `dynamo-run in=text out=dyn://dynamo.backend.generate` - `cd lib/bindings/python/examples/hello_world` - `python server_vllm.py` This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus. The `register_llm` call does this: - Download the model from HF if necessary - Load the model deployment card from the HF folder or extract from GGUF - Push the tokenizer config etc into NATS object store so ingress can access it from a different machine - Publish the model deployment card to ETCD
-
- 01 May, 2025 1 commit
-
-
Graham King authored
Part of https://github.com/ai-dynamo/dynamo/issues/743
-
- 24 Apr, 2025 1 commit
-
-
Ryan McCormick authored
-
- 18 Apr, 2025 2 commits
-
-
Graham King authored
-
Graham King authored
It's different enough that I made a new engine vllm0_8 and renamed the previous engine to vllm0_7. `dynamo-run out=vllm` now expects 0.8. This matches the container change in #690. For older use `dynamo-run out=vllm0_7`.
-
- 03 Apr, 2025 1 commit
-
-
Ryan Olson authored
Moved all of `lib/llm/src/engines` to their own crates as e.g. `lib/engines/mistralrs`. This will allow publishing of the `dynamo-llm` crate as it won't have any github dependencies. The only engines in dynamo-llm will be the demo `echo` ones. Co-authored-by:Graham King <grahamk@nvidia.com>
-
- 19 Mar, 2025 2 commits
-
-
Anant Sharma authored
Co-authored-by:Dmitry Tokarev <dtokarev@nvidia.com>
-
Graham King authored
This makes the Rust parts all use ring / rustls library instead of local install of openssl. It's a step on the journey to being statically linked. Pieces: - `tokenizers` and `mistralrs` now support rustls (mistralrs by default, tokenizers with feature flag). - Move shared dependencies up into workspace - New `rand` crate has some renames for future rust - Ensure the dependency doesn't creep back in by enforcing it with cargo deny.
-
- 17 Mar, 2025 1 commit
-
-
Graham King authored
-
- 15 Mar, 2025 1 commit
-
-
Graham King authored
``` dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/ ``` The file has genai format, one entry per line: ``` {"text": "the prompt"} {"text": ..etc ``` The prompt is evaluated and the output written to `output.jsonl` in the same folder as the input. At the end of the run various statistics are printed: > Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s) This is also helpful for pushing load into the system and stressing the various components. Not intended for performance measurement, it's a batch inference tool.
-
- 14 Mar, 2025 1 commit
-
-
Graham King authored
Engines mistralrs, sglang and vllm included by default. Can be disabled like this: `cargo build --no-default-features --features <add-back-what-you-want>`. Added `--feature vulkan` option, for llamacpp. Build time message if CUDA or Metal would help and are missing. That's the best we can do: > warning: dynamo-run@0.1.0: CUDA not enabled, re-run with `--features cuda` Runtime message if CUDA, Metal or Vulkan are enabled: > 2025-03-14T21:59:26.501937Z INFO dynamo_run: CUDA on Runtime message if they are missing: > 2025-03-14T22:02:37.439404Z INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance Defaut engine message includes available engines: > 2025-03-14T21:59:26.503612Z INFO dynamo_run: Using default engine: mistralrs. Use out=<engine> to specify one of echo_core, echo_full, mistralrs, llamacpp, sglang, vllm, pystr, pytok The really important outcome is that this should now "just work": ``` cargo install dynamo-run dynamo-run Qwen/Qwen2.5-3B-Instruct ``` Sadly you still need `--features cuda|metal` for performance, I couldn't automate that.
-
- 13 Mar, 2025 3 commits
-
-
Anant Sharma authored
-
Graham King authored
"netlink" doesn't exist on Mac. We print the primary network interface to help multi-node setup, which is also unlikely on Mac.
-
Graham King authored
- Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine. - The default engine (previously always mistralrs) depends on what is compiled in. - Text can be piped in and will result in a single run of the model. All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit: ``` echo "What is the capital of Costa Rica?" | dynamo-run Qwen/Qwen2.5-3B-Instruct ``` Co-authored-by:Ryan McCormick <rmccormick@nvidia.com>
-
- 09 Mar, 2025 1 commit
-
-
Neelay Shah authored
Co-authored-by:
Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com> Co-authored-by:
Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
-
- 08 Mar, 2025 1 commit
-
-
Neelay Shah authored
Co-authored-by:Biswa Panda <biswa.panda@gmail.com>
-
- 07 Mar, 2025 1 commit
-
-
Graham King authored
1. Create `my_engine.py` ``` import asyncio async def generate(request): yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} ``` 2. Build ``` cargo build --release --feature python ``` 3. Run ``` dynemo-run out=pystr:my_engine.py --name test ``` And here's a distributed system, with your engine: - Node 1: `dynemo-run in=http out=dyn://test` - Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`
-
- 05 Mar, 2025 2 commits
-
-
Neelay Shah authored
Co-authored-by:Graham King <grahamk@nvidia.com>
-
Graham King authored
-
- 04 Mar, 2025 1 commit
-
-
Graham King authored
Needs more testing but good enough for now. I get the same results with this as with `vllm serve`.
-
- 28 Feb, 2025 2 commits
-
-
Graham King authored
Engine, `tio` support and docs. Proof of concept / experimental.
-
Graham King authored
triton-distributed-llm component and support in tio
-
- 27 Feb, 2025 1 commit
-
-
Graham King authored
Docs in README
-
- 26 Feb, 2025 2 commits
-
-
Paul Hendricks authored
Co-authored-by:Graham King <grahamk@nvidia.com>
-
Anant Sharma authored
-
- 25 Feb, 2025 3 commits
-
-
Graham King authored
- Setup venv ``` uv venv source .venv/bin/activate uv pip install pip uv pip install sgl-kernel --force-reinstall --no-deps uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/ ``` - Build: `cargo build --release --features sglang` - Run single node (make sure you're in the venv): `./tio out=sglang ~/llm_models/my_model` - Run Deepseek multi-gpu / multi-node: Node 1: ``` tio in=http out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 0 --dist-init-addr 10.217.98.122:9876 ``` Node 2: ``` tio in=none out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 1 --dist-init-addr 10.217.98.122:9876 ```
-
Neelay Shah authored
-
Neelay Shah authored
Signed-off-by:
Neelay Shah <neelays@nvidia.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
- 21 Feb, 2025 1 commit
-
-
Ryan Olson authored
Signed-off-by:
Ryan Olson <ryanolson@users.noreply.github.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
- 14 Feb, 2025 1 commit
-
-
Graham King authored
This allows us to run a real model. Build: ``` cargo build --release --features mistralrs,cuda ``` Run: ``` ./target/release/tio in=text out=mistralrs --model-path Llama-3.2-1B-Instruct-Q4_K_M.gguf ``` Why [mistral.rs](https://github.com/EricLBuehler/mistral.rs)? - It has no dependencies. You don't need a container or a virtual env to get started. - It supports CUDA, Metal (MacOS) and CPU-only. Everyone can join the AI revolution. - It starts fast and serves fast (with CUDA). That makes it fun to experiment with. - It runs many models, not just Mistral, that's just it's name.
-