- 17 Mar, 2025 17 commits
-
-
Neelay Shah authored
-
Graham King authored
-
Anant Sharma authored
-
Harrison Saturley-Hall authored
-
Anant Sharma authored
-
Graham King authored
Previously several parts of the stack ensured max tokens (for this single request) was set. Now only text input sets it (to 8k). Everything else leaves as is, potentially blank. The engines themselves have very small defaults, 16 for vllm and 128 for sglang. Also fix dynamo-run CUDA startup message to only print if we're using an engine that would benefit from it (mistralrs, llamacpp).
-
nnshah1 authored
-
Anant Sharma authored
Co-authored-by:Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
-
ptarasiewiczNV authored
Co-authored-by:
hongkuanz <hongkuanz@nvidia.com> Co-authored-by:
Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com> Co-authored-by:
Dmitry Tokarev <dtokarev@nvidia.com>
-
Suman Tatiraju authored
-
GuanLuo authored
-
ishandhanani authored
-
Ryan McCormick authored
-
Ryan McCormick authored
-
Neelay Shah authored
-
Anant Sharma authored
-
Anant Sharma authored
-
- 16 Mar, 2025 10 commits
-
-
Dmitry Tokarev authored
-
Anant Sharma authored
-
David Zier authored
-
Neelay Shah authored
-
ptarasiewiczNV authored
Co-authored-by:hongkuanz <hongkuanz@nvidia.com>
-
Harrison Saturley-Hall authored
-
julienmancuso authored
Co-authored-by:Maksim Khadkevich <mkhadkevich@nvidia.com>
-
Maksim Khadkevich authored
-
ishandhanani authored
-
April Yang authored
Co-authored-by:
Julien Mancuso <jmancuso@nvidia.com> Co-authored-by:
Hannah Zhang <hannahz@nvidia.com> Co-authored-by:
Biswa Panda <biswa.panda@gmail.com> Co-authored-by:
Maksim Khadkevich <mkhadkevich@nvidia.com>
-
- 15 Mar, 2025 10 commits
-
-
Biswa Panda authored
-
Neelay Shah authored
-
ptarasiewiczNV authored
-
Matthew Kotila authored
-
Biswa Panda authored
Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com> Co-authored-by:
Neelay Shah <neelays@nvidia.com>
-
ptarasiewiczNV authored
-
Harrison Saturley-Hall authored
-
Maksim Khadkevich authored
-
julienmancuso authored
-
Graham King authored
``` dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/ ``` The file has genai format, one entry per line: ``` {"text": "the prompt"} {"text": ..etc ``` The prompt is evaluated and the output written to `output.jsonl` in the same folder as the input. At the end of the run various statistics are printed: > Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s) This is also helpful for pushing load into the system and stressing the various components. Not intended for performance measurement, it's a batch inference tool.
-
- 14 Mar, 2025 3 commits
-
-
Anant Sharma authored
-
Graham King authored
Engines mistralrs, sglang and vllm included by default. Can be disabled like this: `cargo build --no-default-features --features <add-back-what-you-want>`. Added `--feature vulkan` option, for llamacpp. Build time message if CUDA or Metal would help and are missing. That's the best we can do: > warning: dynamo-run@0.1.0: CUDA not enabled, re-run with `--features cuda` Runtime message if CUDA, Metal or Vulkan are enabled: > 2025-03-14T21:59:26.501937Z INFO dynamo_run: CUDA on Runtime message if they are missing: > 2025-03-14T22:02:37.439404Z INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance Defaut engine message includes available engines: > 2025-03-14T21:59:26.503612Z INFO dynamo_run: Using default engine: mistralrs. Use out=<engine> to specify one of echo_core, echo_full, mistralrs, llamacpp, sglang, vllm, pystr, pytok The really important outcome is that this should now "just work": ``` cargo install dynamo-run dynamo-run Qwen/Qwen2.5-3B-Instruct ``` Sadly you still need `--features cuda|metal` for performance, I couldn't automate that.
-
Pavithra Vijayakrishnan authored
-