1. 17 Mar, 2025 1 commit
    • Graham King's avatar
      fix(vllm,sglang): Let the engine enforce max tokens (#216) · 05765cd4
      Graham King authored
      Previously several parts of the stack ensured max tokens (for this single request) was set.
      
      Now only text input sets it (to 8k). Everything else leaves as is, potentially blank. The engines themselves have very small defaults, 16 for vllm and 128 for sglang.
      
      Also fix dynamo-run CUDA startup message to only print if we're using an engine that would benefit from it (mistralrs, llamacpp).
      05765cd4
  2. 15 Mar, 2025 1 commit
    • Graham King's avatar
      feat(dynamo-run): Batch mode (#142) · 2cca070c
      Graham King authored
      ```
      dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/
      ```
      
      The file has genai format, one entry per line:
      ```
      {"text": "the prompt"}
      {"text": ..etc
      ```
      
      The prompt is evaluated and the output written to `output.jsonl` in the
      same folder as the input.
      
      At the end of the run various statistics are printed:
      > Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s)
      
      This is also helpful for pushing load into the system and stressing the
      various components. Not intended for performance measurement, it's a
      batch inference tool.
      2cca070c
  3. 14 Mar, 2025 4 commits
    • Graham King's avatar
      feat(dynamo-run): Various UX improvements (#168) · 1fb31d6a
      Graham King authored
      Engines mistralrs, sglang and vllm included by default. Can be disabled like this: `cargo build --no-default-features --features <add-back-what-you-want>`.
      
      Added `--feature vulkan` option, for llamacpp.
      
      Build time message if CUDA or Metal would help and are missing. That's the best we can do:
      > warning: dynamo-run@0.1.0: CUDA not enabled, re-run with `--features cuda`
      
      Runtime message if CUDA, Metal or Vulkan are enabled:
      > 2025-03-14T21:59:26.501937Z  INFO dynamo_run: CUDA on
      
      Runtime message if they are missing:
      > 2025-03-14T22:02:37.439404Z  INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance
      
      Defaut engine message includes available engines:
      > 2025-03-14T21:59:26.503612Z  INFO dynamo_run: Using default engine: mistralrs. Use out=<engine> to specify one of echo_core, echo_full, mistralrs, llamacpp, sglang, vllm, pystr, pytok
      
      The really important outcome is that this should now "just work":
      ```
      cargo install dynamo-run
      dynamo-run Qwen/Qwen2.5-3B-Instruct
      ```
      
      Sadly you still need `--features cuda|metal` for performance, I couldn't automate that.
      1fb31d6a
    • Ryan McCormick's avatar
    • Ryan McCormick's avatar
    • Graham King's avatar
      fix: Various for MacOS (#155) · 76b79149
      Graham King authored
      - Mac doesn't have `pipe2` syscall so use plain `pipe`.
      - rtnetlink isn't a dependency on mac so don't use the type
      76b79149
  4. 13 Mar, 2025 5 commits
  5. 12 Mar, 2025 1 commit
    • Graham King's avatar
      feat(pystr): Pass command line arguments (#123) · 995f71cc
      Graham King authored
      Command line arguments are passed to the python engine like this:
      ```
      dynamo-run out=pystr:my_python_engine.py -- -n 42 --custom-arg Orange --yes
      ```
      
      The python engine receives the arguments in `sys.argv`. The argument list will include some standard ones as well as anything after the `--`.
      
      This input:
      ```
      dynamo-run out=pystr:my_engine.py /opt/models/Llama-3.2-3B-Instruct/ --model-name llama_3.2 --tensor-parallel-size 4 -- -n 1
      ```
      
      is read like this:
      ```
      async def generate(request):
          .. as before ..
      
      if __name__ == "__main__":
          print(f"MAIN: {sys.argv}")
      ```
      
      and produces this output:
      ```
      MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--model-name', 'llama3.2', '--http-port', '8080', '--tensor-parallel-size', '4', '--base-gpu-id', '0', '--num-nodes', '1', '--node-rank', '0', '-n', '1']
      ```
      
      This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`.
      995f71cc
  6. 11 Mar, 2025 5 commits
  7. 10 Mar, 2025 3 commits
  8. 09 Mar, 2025 1 commit
  9. 08 Mar, 2025 1 commit
  10. 07 Mar, 2025 3 commits
    • Graham King's avatar
      fix: dynemo-run model discovery working again (#52) · 9f53922a
      Graham King authored
      There are two etcd keys:
      - The service
      - The model
      
      The second one is the interesting one for us. Previously we confused the two.
      9f53922a
    • Graham King's avatar
      feat: Python bring-your-own-engine with our tokenizer (#47) · 12714d90
      Graham King authored
      Instead of using `out=pystr:<my.py>` we can now do this:
      ```
      dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout>
      ```
      
      That engine will receive and respond with tokens. Here's an example engine file:
      ```
      import asyncio
      
      async def generate(request):
          yield {"token_ids":[791]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[6864]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[315]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[9822]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[374]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[12366]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[13]}
      ```
      
      Also reduce duplication by making the bindings engine use the llm lib engine.
      12714d90
    • Graham King's avatar
      feat: Bring-your-own engine for dynemo-run (#43) · 1b96c2c4
      Graham King authored
      1. Create `my_engine.py`
      
      ```
      import asyncio
      
      async def generate(request):
          yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
      ```
      
      2. Build
      
      ```
      cargo build --release --feature python
      ```
      
      3. Run
      
      ```
      dynemo-run out=pystr:my_engine.py --name test
      ```
      
      And here's a distributed system, with your engine:
      
      - Node 1: `dynemo-run in=http out=dyn://test`
      - Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`
      1b96c2c4
  11. 05 Mar, 2025 3 commits
  12. 04 Mar, 2025 1 commit
  13. 03 Mar, 2025 1 commit
  14. 28 Feb, 2025 2 commits
  15. 27 Feb, 2025 4 commits
  16. 26 Feb, 2025 2 commits
  17. 25 Feb, 2025 2 commits
    • Graham King's avatar
      feat: sglang backend for tio (#271) · e97493eb
      Graham King authored
      - Setup venv
      
      ```
      uv venv
      source .venv/bin/activate
      uv pip install pip
      uv pip install sgl-kernel --force-reinstall --no-deps
      uv pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
      ```
      
      - Build: `cargo build --release --features sglang`
      
      - Run single node (make sure you're in the venv): `./tio out=sglang ~/llm_models/my_model`
      
      - Run Deepseek multi-gpu / multi-node:
      
      Node 1:
      ```
      tio in=http out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 0 --dist-init-addr 10.217.98.122:9876
      ```
      
      Node 2:
      ```
      tio in=none out=sglang --model-path ~/llm_models/DeepSeek-R1-Distill-Llama-70B/ --tensor-parallel-size 8 --num-nodes 2 --node-rank 1 --dist-init-addr 10.217.98.122:9876
      ```
      e97493eb
    • Alec's avatar
      b760c569