1. 14 Mar, 2025 1 commit
    • Graham King's avatar
      fix: Various for MacOS (#155) · 76b79149
      Graham King authored
      - Mac doesn't have `pipe2` syscall so use plain `pipe`.
      - rtnetlink isn't a dependency on mac so don't use the type
      76b79149
  2. 13 Mar, 2025 3 commits
    • Graham King's avatar
      feat(mistralrs): Let the engine enforce max tokens (#134) · 404a78e9
      Graham King authored
      Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step.
      
      Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.
      404a78e9
    • Graham King's avatar
      fix(dynamo-run): Network interface detection is Linux only (#133) · b0d3eba1
      Graham King authored
      "netlink" doesn't exist on Mac. We print the primary network interface to help multi-node setup, which is also unlikely on Mac.
      b0d3eba1
    • Graham King's avatar
      feat(dynamo-run): Download models from HF, smart model defaults (#126) · 089f8e1b
      Graham King authored
      
      
      - Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine.
      
      - The default engine (previously always mistralrs) depends on what is compiled in.
      
      - Text can be piped in and will result in a single run of the model.
      
      All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit:
      ```
      echo "What is the capital of Costa Rica?"  | dynamo-run Qwen/Qwen2.5-3B-Instruct
      ```
      Co-authored-by: default avatarRyan McCormick <rmccormick@nvidia.com>
      089f8e1b
  3. 12 Mar, 2025 1 commit
    • Graham King's avatar
      feat(pystr): Pass command line arguments (#123) · 995f71cc
      Graham King authored
      Command line arguments are passed to the python engine like this:
      ```
      dynamo-run out=pystr:my_python_engine.py -- -n 42 --custom-arg Orange --yes
      ```
      
      The python engine receives the arguments in `sys.argv`. The argument list will include some standard ones as well as anything after the `--`.
      
      This input:
      ```
      dynamo-run out=pystr:my_engine.py /opt/models/Llama-3.2-3B-Instruct/ --model-name llama_3.2 --tensor-parallel-size 4 -- -n 1
      ```
      
      is read like this:
      ```
      async def generate(request):
          .. as before ..
      
      if __name__ == "__main__":
          print(f"MAIN: {sys.argv}")
      ```
      
      and produces this output:
      ```
      MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--model-name', 'llama3.2', '--http-port', '8080', '--tensor-parallel-size', '4', '--base-gpu-id', '0', '--num-nodes', '1', '--node-rank', '0', '-n', '1']
      ```
      
      This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`.
      995f71cc
  4. 11 Mar, 2025 3 commits
  5. 08 Mar, 2025 1 commit
  6. 07 Mar, 2025 2 commits
    • Graham King's avatar
      feat: Python bring-your-own-engine with our tokenizer (#47) · 12714d90
      Graham King authored
      Instead of using `out=pystr:<my.py>` we can now do this:
      ```
      dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout>
      ```
      
      That engine will receive and respond with tokens. Here's an example engine file:
      ```
      import asyncio
      
      async def generate(request):
          yield {"token_ids":[791]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[6864]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[315]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[9822]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[374]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[12366]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[13]}
      ```
      
      Also reduce duplication by making the bindings engine use the llm lib engine.
      12714d90
    • Graham King's avatar
      feat: Bring-your-own engine for dynemo-run (#43) · 1b96c2c4
      Graham King authored
      1. Create `my_engine.py`
      
      ```
      import asyncio
      
      async def generate(request):
          yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
      ```
      
      2. Build
      
      ```
      cargo build --release --feature python
      ```
      
      3. Run
      
      ```
      dynemo-run out=pystr:my_engine.py --name test
      ```
      
      And here's a distributed system, with your engine:
      
      - Node 1: `dynemo-run in=http out=dyn://test`
      - Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`
      1b96c2c4
  7. 05 Mar, 2025 2 commits
  8. 04 Mar, 2025 1 commit
  9. 28 Feb, 2025 2 commits
  10. 27 Feb, 2025 2 commits
  11. 26 Feb, 2025 1 commit
  12. 25 Feb, 2025 2 commits
    • Graham King's avatar
      feat: sglang backend for tio (#271) · e97493eb
      Graham King authored
      - Setup venv
      
      ```
      uv venv
      source .venv/bin/activate
      uv pip install pip
      uv pip install sgl-kernel --force-reinstall --no-deps
      uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
      ```
      
      - Build: `cargo build --release --features sglang`
      
      - Run single node (make sure you're in the venv): `./tio out=sglang ~/llm_models/my_model`
      
      - Run Deepseek multi-gpu / multi-node:
      
      Node 1:
      ```
      tio in=http out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 0 --dist-init-addr 10.217.98.122:9876
      ```
      
      Node 2:
      ```
      tio in=none out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 1 --dist-init-addr 10.217.98.122:9876
      ```
      e97493eb
    • Neelay Shah's avatar
      refactor: move libs to lib dir · 08fcd7e9
      Neelay Shah authored
      
      Signed-off-by: default avatarNeelay Shah <neelays@nvidia.com>
      Co-authored-by: default avatarRyan McCormick <rmccormick@nvidia.com>
      08fcd7e9