- 15 Mar, 2025 5 commits
-
-
ptarasiewiczNV authored
-
Harrison Saturley-Hall authored
-
Maksim Khadkevich authored
-
julienmancuso authored
-
Graham King authored
``` dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/ ``` The file has genai format, one entry per line: ``` {"text": "the prompt"} {"text": ..etc ``` The prompt is evaluated and the output written to `output.jsonl` in the same folder as the input. At the end of the run various statistics are printed: > Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s) This is also helpful for pushing load into the system and stressing the various components. Not intended for performance measurement, it's a batch inference tool.
-
- 14 Mar, 2025 21 commits
-
-
Anant Sharma authored
-
Graham King authored
Engines mistralrs, sglang and vllm included by default. Can be disabled like this: `cargo build --no-default-features --features <add-back-what-you-want>`. Added `--feature vulkan` option, for llamacpp. Build time message if CUDA or Metal would help and are missing. That's the best we can do: > warning: dynamo-run@0.1.0: CUDA not enabled, re-run with `--features cuda` Runtime message if CUDA, Metal or Vulkan are enabled: > 2025-03-14T21:59:26.501937Z INFO dynamo_run: CUDA on Runtime message if they are missing: > 2025-03-14T22:02:37.439404Z INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance Defaut engine message includes available engines: > 2025-03-14T21:59:26.503612Z INFO dynamo_run: Using default engine: mistralrs. Use out=<engine> to specify one of echo_core, echo_full, mistralrs, llamacpp, sglang, vllm, pystr, pytok The really important outcome is that this should now "just work": ``` cargo install dynamo-run dynamo-run Qwen/Qwen2.5-3B-Instruct ``` Sadly you still need `--features cuda|metal` for performance, I couldn't automate that.
-
Pavithra Vijayakrishnan authored
-
Hongkuan Zhou authored
-
Ryan McCormick authored
-
ishandhanani authored
Co-authored-by:Biswa Panda <biswa.panda@gmail.com>
-
Graham King authored
On Mac embedded python interpreters don't pick up the virtual env. This seems to be a known problem. Fix the sys.path.
-
Hongkuan Zhou authored
-
Hongkuan Zhou authored
-
Anant Sharma authored
-
Ryan McCormick authored
-
Ryan McCormick authored
-
Anant Sharma authored
-
Tanmay Verma authored
-
Graham King authored
- Mac doesn't have `pipe2` syscall so use plain `pipe`. - rtnetlink isn't a dependency on mac so don't use the type
-
hhzhang16 authored
Co-authored-by:Julien Mancuso <jmancuso@nvidia.com>
-
Ryan McCormick authored
-
Ryan McCormick authored
-
Ryan McCormick authored
-
Ryan McCormick authored
-
Ryan Olson authored
-
- 13 Mar, 2025 11 commits
-
-
Ryan McCormick authored
-
Ziqi Fan authored
-
Anant Sharma authored
-
Graham King authored
Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step. Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.
-
Anant Sharma authored
-
Graham King authored
"netlink" doesn't exist on Mac. We print the primary network interface to help multi-node setup, which is also unlikely on Mac.
-
Pawel Ziecina authored
-
Dmitry Tokarev authored
-
Biswa Panda authored
-
Tanmay Verma authored
Co-authored-by:Ryan McCormick <rmccormick@nvidia.com>
-
Graham King authored
- Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine. - The default engine (previously always mistralrs) depends on what is compiled in. - Text can be piped in and will result in a single run of the model. All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit: ``` echo "What is the capital of Costa Rica?" | dynamo-run Qwen/Qwen2.5-3B-Instruct ``` Co-authored-by:Ryan McCormick <rmccormick@nvidia.com>
-
- 12 Mar, 2025 3 commits
-
-
hhzhang16 authored
-
Graham King authored
Command line arguments are passed to the python engine like this: ``` dynamo-run out=pystr:my_python_engine.py -- -n 42 --custom-arg Orange --yes ``` The python engine receives the arguments in `sys.argv`. The argument list will include some standard ones as well as anything after the `--`. This input: ``` dynamo-run out=pystr:my_engine.py /opt/models/Llama-3.2-3B-Instruct/ --model-name llama_3.2 --tensor-parallel-size 4 -- -n 1 ``` is read like this: ``` async def generate(request): .. as before .. if __name__ == "__main__": print(f"MAIN: {sys.argv}") ``` and produces this output: ``` MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--model-name', 'llama3.2', '--http-port', '8080', '--tensor-parallel-size', '4', '--base-gpu-id', '0', '--num-nodes', '1', '--node-rank', '0', '-n', '1'] ``` This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`. -
Ryan McCormick authored
-