- 04 Apr, 2025 3 commits
-
-
Yan Ru Pei authored
-
Graham King authored
Also upgrade the cargo resolver to v3, the default. New clippy lints: - `next_back()` instead of `last()` for a double-ended iterator. That avoids walking the whole list. - ` repeat_n` instead of `repeat.take`. That avoids cloning. - Doc indenting
-
Graham King authored
Adds `@dynamo_worker(static = True)` to create a static worker which has a predictable name and hence does not require discovery or `etcd` to be running. There can only be a single static worker per namespace / component / endpoint trio. This contrasts with the default dynamic `dynamo_worker` endpoints we have now, which get a unique random name (based on namespace/component/endpoint), and are discovered by ingress components using etcd. Also change the hello_world example to use `dynamo_worker(static = True)` so that it is exercised and demonstrated somewhere. For NIM.
-
- 03 Apr, 2025 3 commits
-
-
Ryan Olson authored
Moved all of `lib/llm/src/engines` to their own crates as e.g. `lib/engines/mistralrs`. This will allow publishing of the `dynamo-llm` crate as it won't have any github dependencies. The only engines in dynamo-llm will be the demo `echo` ones. Co-authored-by:Graham King <grahamk@nvidia.com>
-
tlipoca9 authored
-
Ryan Olson authored
-
- 02 Apr, 2025 1 commit
-
-
Ryan Olson authored
-
- 01 Apr, 2025 2 commits
-
-
Ryan Olson authored
-
Kiv Chen authored
-
- 31 Mar, 2025 3 commits
-
-
Ryan Olson authored
-
Graham King authored
-
Tianer Zhou authored
Signed-off-by:Tianer Zhou <ezhoureal@gmail.com>
-
- 28 Mar, 2025 1 commit
-
-
Biswa Panda authored
-
- 26 Mar, 2025 1 commit
-
-
Ryan Olson authored
-
- 25 Mar, 2025 1 commit
-
-
Graham King authored
Put the arguments in a JSON file: ``` { "dtype": "half", "trust_remote_code": true } ``` Pass it like this: ``` dynamo-run out=sglang ~/llm_models/Llama-3.2-3B-Instruct --extra-engine-args sglang_extra.json ``` Requested here https://github.com/ai-dynamo/dynamo/issues/290 (`dtype`) and here https://github.com/ai-dynamo/dynamo/issues/360 (`trust_remote_code`).
-
- 24 Mar, 2025 1 commit
-
-
Graham King authored
This lets us do: ``` dynamo-run out=llamacpp <gguf_file> ``` Previously a `--model-config <hf-repo>` was also required, to configure our tokenizer.
-
- 21 Mar, 2025 1 commit
-
-
zhaohaidao authored
-
- 20 Mar, 2025 1 commit
-
-
Nora authored
Add `AsMut`, `DerefMut` and `IntoIterator` trait impl for the `Tokens` structure. Signed-off-by:
nora-coder-dot <nora6677@gmail.com> Co-authored-by:
nora-coder-dot <nora6677@gmail.com>
-
- 19 Mar, 2025 4 commits
-
-
Anant Sharma authored
Co-authored-by:Dmitry Tokarev <dtokarev@nvidia.com>
-
Graham King authored
This makes the Rust parts all use ring / rustls library instead of local install of openssl. It's a step on the journey to being statically linked. Pieces: - `tokenizers` and `mistralrs` now support rustls (mistralrs by default, tokenizers with feature flag). - Move shared dependencies up into workspace - New `rand` crate has some renames for future rust - Ensure the dependency doesn't creep back in by enforcing it with cargo deny.
-
Graham King authored
Under load it sometimes drops a request. The request gets added to the batch (sequence) and immediately gets a FinishReason Stop. Not sure why. It doesn't happen with the default scheduler (non-paged attention), so switch to that for now.
-
Graham King authored
-
- 18 Mar, 2025 2 commits
-
-
Dmitry Tokarev authored
Co-authored-by:Anant Sharma <anants@nvidia.com>
-
Harrison Saturley-Hall authored
Co-authored-by:Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com>
-
- 17 Mar, 2025 3 commits
-
-
Graham King authored
-
Graham King authored
Previously several parts of the stack ensured max tokens (for this single request) was set. Now only text input sets it (to 8k). Everything else leaves as is, potentially blank. The engines themselves have very small defaults, 16 for vllm and 128 for sglang. Also fix dynamo-run CUDA startup message to only print if we're using an engine that would benefit from it (mistralrs, llamacpp).
-
GuanLuo authored
-
- 16 Mar, 2025 1 commit
-
-
April Yang authored
Co-authored-by:
Julien Mancuso <jmancuso@nvidia.com> Co-authored-by:
Hannah Zhang <hannahz@nvidia.com> Co-authored-by:
Biswa Panda <biswa.panda@gmail.com> Co-authored-by:
Maksim Khadkevich <mkhadkevich@nvidia.com>
-
- 15 Mar, 2025 1 commit
-
-
Graham King authored
``` dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/ ``` The file has genai format, one entry per line: ``` {"text": "the prompt"} {"text": ..etc ``` The prompt is evaluated and the output written to `output.jsonl` in the same folder as the input. At the end of the run various statistics are printed: > Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s) This is also helpful for pushing load into the system and stressing the various components. Not intended for performance measurement, it's a batch inference tool.
-
- 14 Mar, 2025 9 commits
-
-
Graham King authored
Engines mistralrs, sglang and vllm included by default. Can be disabled like this: `cargo build --no-default-features --features <add-back-what-you-want>`. Added `--feature vulkan` option, for llamacpp. Build time message if CUDA or Metal would help and are missing. That's the best we can do: > warning: dynamo-run@0.1.0: CUDA not enabled, re-run with `--features cuda` Runtime message if CUDA, Metal or Vulkan are enabled: > 2025-03-14T21:59:26.501937Z INFO dynamo_run: CUDA on Runtime message if they are missing: > 2025-03-14T22:02:37.439404Z INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance Defaut engine message includes available engines: > 2025-03-14T21:59:26.503612Z INFO dynamo_run: Using default engine: mistralrs. Use out=<engine> to specify one of echo_core, echo_full, mistralrs, llamacpp, sglang, vllm, pystr, pytok The really important outcome is that this should now "just work": ``` cargo install dynamo-run dynamo-run Qwen/Qwen2.5-3B-Instruct ``` Sadly you still need `--features cuda|metal` for performance, I couldn't automate that.
-
Graham King authored
On Mac embedded python interpreters don't pick up the virtual env. This seems to be a known problem. Fix the sys.path.
-
Ryan McCormick authored
-
Anant Sharma authored
-
Tanmay Verma authored
-
Graham King authored
- Mac doesn't have `pipe2` syscall so use plain `pipe`. - rtnetlink isn't a dependency on mac so don't use the type
-
Ryan McCormick authored
-
Ryan McCormick authored
-
Ryan Olson authored
-
- 13 Mar, 2025 2 commits
-
-
Anant Sharma authored
-
Graham King authored
Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step. Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.
-