- 25 Apr, 2025 1 commit
-
-
Graham King authored
This will allow an ingress-side pre-processor to see it without needing a model checkout. Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files. To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store. The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided. Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit. Part of #743
-
- 18 Apr, 2025 2 commits
-
-
Graham King authored
-
Graham King authored
It's different enough that I made a new engine vllm0_8 and renamed the previous engine to vllm0_7. `dynamo-run out=vllm` now expects 0.8. This matches the container change in #690. For older use `dynamo-run out=vllm0_7`.
-
- 17 Apr, 2025 1 commit
-
-
Ryan Olson authored
-
- 09 Apr, 2025 1 commit
-
-
Anant Sharma authored
-
- 03 Apr, 2025 1 commit
-
-
Ryan Olson authored
Moved all of `lib/llm/src/engines` to their own crates as e.g. `lib/engines/mistralrs`. This will allow publishing of the `dynamo-llm` crate as it won't have any github dependencies. The only engines in dynamo-llm will be the demo `echo` ones. Co-authored-by:Graham King <grahamk@nvidia.com>
-
- 02 Apr, 2025 1 commit
-
-
Ryan Olson authored
-
- 31 Mar, 2025 1 commit
-
-
Ryan Olson authored
-
- 24 Mar, 2025 1 commit
-
-
Graham King authored
This lets us do: ``` dynamo-run out=llamacpp <gguf_file> ``` Previously a `--model-config <hf-repo>` was also required, to configure our tokenizer.
-
- 19 Mar, 2025 1 commit
-
-
Graham King authored
This makes the Rust parts all use ring / rustls library instead of local install of openssl. It's a step on the journey to being statically linked. Pieces: - `tokenizers` and `mistralrs` now support rustls (mistralrs by default, tokenizers with feature flag). - Move shared dependencies up into workspace - New `rand` crate has some renames for future rust - Ensure the dependency doesn't creep back in by enforcing it with cargo deny.
-
- 17 Mar, 2025 1 commit
-
-
Graham King authored
-
- 15 Mar, 2025 1 commit
-
-
Graham King authored
``` dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/ ``` The file has genai format, one entry per line: ``` {"text": "the prompt"} {"text": ..etc ``` The prompt is evaluated and the output written to `output.jsonl` in the same folder as the input. At the end of the run various statistics are printed: > Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s) This is also helpful for pushing load into the system and stressing the various components. Not intended for performance measurement, it's a batch inference tool.
-
- 14 Mar, 2025 1 commit
-
-
Ryan Olson authored
-
- 13 Mar, 2025 3 commits
-
-
Anant Sharma authored
-
Graham King authored
"netlink" doesn't exist on Mac. We print the primary network interface to help multi-node setup, which is also unlikely on Mac.
-
Graham King authored
- Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine. - The default engine (previously always mistralrs) depends on what is compiled in. - Text can be piped in and will result in a single run of the model. All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit: ``` echo "What is the capital of Costa Rica?" | dynamo-run Qwen/Qwen2.5-3B-Instruct ``` Co-authored-by:Ryan McCormick <rmccormick@nvidia.com>
-
- 11 Mar, 2025 2 commits
-
-
Graham King authored
- Latest from repo, many improvements - Support most of the OpenAI request features (temperature, top_p, etc) - Download models from Hugging Face if necessary
-
Neelay Shah authored
Co-authored-by:Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com>
-
- 10 Mar, 2025 1 commit
-
-
Anant Sharma authored
-
- 08 Mar, 2025 1 commit
-
-
Neelay Shah authored
Co-authored-by:Biswa Panda <biswa.panda@gmail.com>
-
- 07 Mar, 2025 2 commits
-
-
Ryan McCormick authored
-
Neelay Shah authored
-
- 06 Mar, 2025 1 commit
-
-
Ryan McCormick authored
-
- 05 Mar, 2025 1 commit
-
-
Neelay Shah authored
Co-authored-by:Graham King <grahamk@nvidia.com>
-
- 28 Feb, 2025 2 commits
-
-
Graham King authored
Engine, `tio` support and docs. Proof of concept / experimental.
-
Ryan McCormick authored
-
- 27 Feb, 2025 1 commit
-
-
Anant Sharma authored
-
- 26 Feb, 2025 2 commits
-
-
Paul Hendricks authored
Co-authored-by:Graham King <grahamk@nvidia.com>
-
Ryan McCormick authored
Co-authored-by:Ryan Olson <rolson@nvidia.com>
-
- 25 Feb, 2025 3 commits
-
-
Graham King authored
- Setup venv ``` uv venv source .venv/bin/activate uv pip install pip uv pip install sgl-kernel --force-reinstall --no-deps uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/ ``` - Build: `cargo build --release --features sglang` - Run single node (make sure you're in the venv): `./tio out=sglang ~/llm_models/my_model` - Run Deepseek multi-gpu / multi-node: Node 1: ``` tio in=http out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 0 --dist-init-addr 10.217.98.122:9876 ``` Node 2: ``` tio in=none out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 1 --dist-init-addr 10.217.98.122:9876 ```
-
Alec authored
Co-authored-by:aflowers <aflowers@nvidia.com>
-
Neelay Shah authored
Signed-off-by:
Neelay Shah <neelays@nvidia.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
- 24 Feb, 2025 1 commit
-
-
Biswa Panda authored
-
- 21 Feb, 2025 1 commit
-
-
Ryan Olson authored
Signed-off-by:
Ryan Olson <ryanolson@users.noreply.github.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
- 19 Feb, 2025 1 commit
-
-
Thomas Montfort authored
-
- 18 Feb, 2025 2 commits
-
-
Ryan Olson authored
-
GuanLuo authored
Signed-off-by:
Neelay Shah <neelays@nvidia.com> Co-authored-by:
aflowers <aflowers@nvidia.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com> Co-authored-by:
hongkuanz <hongkuanz@nvidia.com> Co-authored-by:
Neelay Shah <neelays@nvidia.com>
-
- 13 Feb, 2025 1 commit
-
-
Ryan Olson authored
-
- 12 Feb, 2025 1 commit
-
-
Ryan Olson authored
Signed-off-by:
Ryan Olson <ryanolson@users.noreply.github.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
- 11 Feb, 2025 1 commit
-
-
Anant Sharma authored
Co-authored-by:Ryan McCormick <rmccormick@nvidia.com>
-