"launch/dynamo-run/src/vscode:/vscode.git/clone" did not exist on "adad2ecdd7485826d7ac926bf0e62caa958784ed"
- 27 Feb, 2025 2 commits
-
-
Paul Hendricks authored
-
Paul Hendricks authored
-
- 25 Feb, 2025 6 commits
-
-
Graham King authored
- Setup venv ``` uv venv source .venv/bin/activate uv pip install pip uv pip install sgl-kernel --force-reinstall --no-deps uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/ ``` - Build: `cargo build --release --features sglang` - Run single node (make sure you're in the venv): `./tio out=sglang ~/llm_models/my_model` - Run Deepseek multi-gpu / multi-node: Node 1: ``` tio in=http out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 0 --dist-init-addr 10.217.98.122:9876 ``` Node 2: ``` tio in=none out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 1 --dist-init-addr 10.217.98.122:9876 ```
-
Neelay Shah authored
-
Paul Hendricks authored
-
Graham King authored
Add backend type `EngineConfig::StaticCore` that wraps the engine in a preprocessor (prompt templating and tokenization). Add example engine `echo_core` (`out=echo_core`) which takes and returns tokens. A nice side effect is that it echos the full prompt template with system prompt, whereas `echo_full` echos only user prompt. 
-
Ryan McCormick authored
Signed-off-by:Ryan McCormick <rmccormick@nvidia.com>
-
Neelay Shah authored
Signed-off-by:
Neelay Shah <neelays@nvidia.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
- 21 Feb, 2025 2 commits
-
-
Graham King authored
Add support in tio for distributed components and discovery. Node 1: ``` tio in=http out=tdr://ns/backend/mistralrs ``` Node 2: ``` tio in=tdr://ns/backend/mistralrs out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct ``` This will use etcd to auto-discover the model and NATS to talk to it. You can run multiple workers on the same endpoint and it will pick one at random each time. The `ns/backend/mistralrs` are purely symbolic, pick anything as long as it has three parts, and it matches the other node.
-
Ryan Olson authored
Signed-off-by:
Ryan Olson <ryanolson@users.noreply.github.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
- 20 Feb, 2025 1 commit
-
-
Graham King authored
You can now run an HF repo directly: ``` tio ~/llm_models/Llama-3.2-1B-Instruct/ ``` or a GGUF ``` tio ~/llm_models/Llama-3.2-1B-Instruct-Q4_K_M.gguf ``` Also cleanup kv_router so I can merge.
-
- 14 Feb, 2025 1 commit
-
-
Graham King authored
This allows us to run a real model. Build: ``` cargo build --release --features mistralrs,cuda ``` Run: ``` ./target/release/tio in=text out=mistralrs --model-path Llama-3.2-1B-Instruct-Q4_K_M.gguf ``` Why [mistral.rs](https://github.com/EricLBuehler/mistral.rs)? - It has no dependencies. You don't need a container or a virtual env to get started. - It supports CUDA, Metal (MacOS) and CPU-only. Everyone can join the AI revolution. - It starts fast and serves fast (with CUDA). That makes it fun to experiment with. - It runs many models, not just Mistral, that's just it's name.
-
- 13 Feb, 2025 1 commit
-
-
Graham King authored
This provides a simple example of how to write a triton-llm engine, and how to connect it to the OpenAI HTTP server. This is the tool previously called `nio` and `llmctl`. - **Inputs**: Text and HTTP. - **Engines**: Echo, which streams your prompt back with a slight delay. Build: `cargo build` Pre-requisites: `nats-server` and `etcd` must be running locally, even though they are not yet used by `tio`. Run with text input: ``` ./target/debug/tio in=text out=echo_full --model-name test ``` Run with the triton-llm HTTP server: ``` ./target/debug/tio in=http out=echo_full --http-port 8080 --model-name Echo-0B ``` List models: ``` curl localhost:8080/v1/models | jq ``` Will output ``` { "object": "list", "data": [ { "id": "Echo-0B", "object": "object", "created": 1739400430, "owned_by": "nvidia" } ] } ``` #### What's next As triton-distributed gains features `tio` will be able to grow: - When we get the pre-processor we can have token-in token-out engines. - When we get a pull-router we can have `in=nats` and `out=nats`. - When we get discovery we can have dynamic engines.
-