- 25 Mar, 2025 1 commit
-
-
Graham King authored
Put the arguments in a JSON file: ``` { "dtype": "half", "trust_remote_code": true } ``` Pass it like this: ``` dynamo-run out=sglang ~/llm_models/Llama-3.2-3B-Instruct --extra-engine-args sglang_extra.json ``` Requested here https://github.com/ai-dynamo/dynamo/issues/290 (`dtype`) and here https://github.com/ai-dynamo/dynamo/issues/360 (`trust_remote_code`).
-
- 24 Mar, 2025 1 commit
-
-
Graham King authored
This lets us do: ``` dynamo-run out=llamacpp <gguf_file> ``` Previously a `--model-config <hf-repo>` was also required, to configure our tokenizer.
-
- 21 Mar, 2025 1 commit
-
-
zhaohaidao authored
-
- 19 Mar, 2025 2 commits
-
-
Graham King authored
Under load it sometimes drops a request. The request gets added to the batch (sequence) and immediately gets a FinishReason Stop. Not sure why. It doesn't happen with the default scheduler (non-paged attention), so switch to that for now.
-
Graham King authored
-
- 17 Mar, 2025 1 commit
-
-
Graham King authored
Previously several parts of the stack ensured max tokens (for this single request) was set. Now only text input sets it (to 8k). Everything else leaves as is, potentially blank. The engines themselves have very small defaults, 16 for vllm and 128 for sglang. Also fix dynamo-run CUDA startup message to only print if we're using an engine that would benefit from it (mistralrs, llamacpp).
-
- 14 Mar, 2025 3 commits
-
-
Graham King authored
On Mac embedded python interpreters don't pick up the virtual env. This seems to be a known problem. Fix the sys.path.
-
Tanmay Verma authored
-
Graham King authored
- Mac doesn't have `pipe2` syscall so use plain `pipe`. - rtnetlink isn't a dependency on mac so don't use the type
-
- 13 Mar, 2025 3 commits
-
-
Graham King authored
Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step. Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.
-
Graham King authored
"netlink" doesn't exist on Mac. We print the primary network interface to help multi-node setup, which is also unlikely on Mac.
-
Graham King authored
- Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine. - The default engine (previously always mistralrs) depends on what is compiled in. - Text can be piped in and will result in a single run of the model. All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit: ``` echo "What is the capital of Costa Rica?" | dynamo-run Qwen/Qwen2.5-3B-Instruct ``` Co-authored-by:Ryan McCormick <rmccormick@nvidia.com>
-
- 12 Mar, 2025 1 commit
-
-
Graham King authored
Command line arguments are passed to the python engine like this: ``` dynamo-run out=pystr:my_python_engine.py -- -n 42 --custom-arg Orange --yes ``` The python engine receives the arguments in `sys.argv`. The argument list will include some standard ones as well as anything after the `--`. This input: ``` dynamo-run out=pystr:my_engine.py /opt/models/Llama-3.2-3B-Instruct/ --model-name llama_3.2 --tensor-parallel-size 4 -- -n 1 ``` is read like this: ``` async def generate(request): .. as before .. if __name__ == "__main__": print(f"MAIN: {sys.argv}") ``` and produces this output: ``` MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--model-name', 'llama3.2', '--http-port', '8080', '--tensor-parallel-size', '4', '--base-gpu-id', '0', '--num-nodes', '1', '--node-rank', '0', '-n', '1'] ``` This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`.
-
- 11 Mar, 2025 3 commits
-
-
Graham King authored
If the python file raises an exception we print it like Python would. ``` $ ./target/debug/dynamo-run in=http out=pystr:~/Temp/cn47/1_e.py --model-name test Traceback (most recent call last): File "/home/graham/Temp/cn47/1_e.py", line 17, in generate raise MyException("The message") 1_e.MyException: The message ``` -
Graham King authored
- Latest from repo, many improvements - Support most of the OpenAI request features (temperature, top_p, etc) - Download models from Hugging Face if necessary
-
Ryan McCormick authored
-
- 08 Mar, 2025 1 commit
-
-
Neelay Shah authored
Co-authored-by:Biswa Panda <biswa.panda@gmail.com>
-
- 07 Mar, 2025 2 commits
-
-
Graham King authored
Instead of using `out=pystr:<my.py>` we can now do this: ``` dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout> ``` That engine will receive and respond with tokens. Here's an example engine file: ``` import asyncio async def generate(request): yield {"token_ids":[791]} await asyncio.sleep(0.1) yield {"token_ids":[6864]} await asyncio.sleep(0.1) yield {"token_ids":[315]} await asyncio.sleep(0.1) yield {"token_ids":[9822]} await asyncio.sleep(0.1) yield {"token_ids":[374]} await asyncio.sleep(0.1) yield {"token_ids":[12366]} await asyncio.sleep(0.1) yield {"token_ids":[13]} ``` Also reduce duplication by making the bindings engine use the llm lib engine. -
Graham King authored
1. Create `my_engine.py` ``` import asyncio async def generate(request): yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} ``` 2. Build ``` cargo build --release --feature python ``` 3. Run ``` dynemo-run out=pystr:my_engine.py --name test ``` And here's a distributed system, with your engine: - Node 1: `dynemo-run in=http out=dyn://test` - Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`
-
- 05 Mar, 2025 2 commits
-
-
Graham King authored
Fixes a panic.
-
Neelay Shah authored
Co-authored-by:Graham King <grahamk@nvidia.com>
-
- 04 Mar, 2025 1 commit
-
-
Graham King authored
Needs more testing but good enough for now. I get the same results with this as with `vllm serve`.
-
- 28 Feb, 2025 2 commits
-
-
Graham King authored
Engine, `tio` support and docs. Proof of concept / experimental.
-
Graham King authored
triton-distributed-llm component and support in tio
-
- 27 Feb, 2025 2 commits
-
-
Graham King authored
Docs in README
-
Paul Hendricks authored
-
- 26 Feb, 2025 1 commit
-
-
Paul Hendricks authored
Co-authored-by:Graham King <grahamk@nvidia.com>
-
- 25 Feb, 2025 2 commits
-
-
Graham King authored
- Setup venv ``` uv venv source .venv/bin/activate uv pip install pip uv pip install sgl-kernel --force-reinstall --no-deps uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/ ``` - Build: `cargo build --release --features sglang` - Run single node (make sure you're in the venv): `./tio out=sglang ~/llm_models/my_model` - Run Deepseek multi-gpu / multi-node: Node 1: ``` tio in=http out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 0 --dist-init-addr 10.217.98.122:9876 ``` Node 2: ``` tio in=none out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 1 --dist-init-addr 10.217.98.122:9876 ```
-
Neelay Shah authored
Signed-off-by:
Neelay Shah <neelays@nvidia.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-