1. 21 Apr, 2025 1 commit
  2. 09 Apr, 2025 1 commit
  3. 07 Apr, 2025 1 commit
    • Graham King's avatar
      feat(dynamo-run): Basic routing choice (#524) · ec2e7307
      Graham King authored
      As a first step towards KV routing:
      - introduce a `--router-mode` in dynamo-run that only does random and round-robin right now. Not that interesting yet.
      - Make the vllm engine publish the KV events received from our patched vllm.
      
      Now we "just" need to connect the two. Easy right?
      ec2e7307
  4. 04 Apr, 2025 3 commits
    • Yan Ru Pei's avatar
    • Graham King's avatar
      chore: Upgrade Rust to 1.86 (#518) · e99aa1e1
      Graham King authored
      Also upgrade the cargo resolver to v3, the default.
      
      New clippy lints:
      - `next_back()` instead of `last()` for a double-ended iterator. That avoids walking the whole list.
      - ` repeat_n` instead of `repeat.take`. That avoids cloning.
      - Doc indenting
      e99aa1e1
    • Graham King's avatar
      feat: Python decorator dynamo_worker takes optional `static` parameter without etcd (#494) · 88ad3425
      Graham King authored
      Adds `@dynamo_worker(static = True)` to create a static worker which has a predictable name and hence does not require discovery or `etcd` to be running. There can only be a single static worker per namespace / component / endpoint trio.
      
      This contrasts with the default dynamic `dynamo_worker` endpoints we have now, which get a unique random name (based on namespace/component/endpoint), and are discovered by ingress components using etcd.
      
      Also change the hello_world example to use `dynamo_worker(static = True)` so that it is exercised and demonstrated somewhere.
      
      For NIM.
      88ad3425
  5. 03 Apr, 2025 1 commit
  6. 02 Apr, 2025 1 commit
  7. 01 Apr, 2025 1 commit
  8. 31 Mar, 2025 1 commit
  9. 26 Mar, 2025 1 commit
  10. 25 Mar, 2025 1 commit
  11. 24 Mar, 2025 1 commit
  12. 21 Mar, 2025 1 commit
  13. 20 Mar, 2025 1 commit
  14. 19 Mar, 2025 2 commits
  15. 17 Mar, 2025 2 commits
  16. 15 Mar, 2025 1 commit
    • Graham King's avatar
      feat(dynamo-run): Batch mode (#142) · 2cca070c
      Graham King authored
      ```
      dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/
      ```
      
      The file has genai format, one entry per line:
      ```
      {"text": "the prompt"}
      {"text": ..etc
      ```
      
      The prompt is evaluated and the output written to `output.jsonl` in the
      same folder as the input.
      
      At the end of the run various statistics are printed:
      > Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s)
      
      This is also helpful for pushing load into the system and stressing the
      various components. Not intended for performance measurement, it's a
      batch inference tool.
      2cca070c
  17. 14 Mar, 2025 6 commits
    • Graham King's avatar
      feat(dynamo-run): Various UX improvements (#168) · 1fb31d6a
      Graham King authored
      Engines mistralrs, sglang and vllm included by default. Can be disabled like this: `cargo build --no-default-features --features <add-back-what-you-want>`.
      
      Added `--feature vulkan` option, for llamacpp.
      
      Build time message if CUDA or Metal would help and are missing. That's the best we can do:
      > warning: dynamo-run@0.1.0: CUDA not enabled, re-run with `--features cuda`
      
      Runtime message if CUDA, Metal or Vulkan are enabled:
      > 2025-03-14T21:59:26.501937Z  INFO dynamo_run: CUDA on
      
      Runtime message if they are missing:
      > 2025-03-14T22:02:37.439404Z  INFO dynamo_run: CPU mode. Rebuild with `--features cuda|metal|vulkan` for better performance
      
      Defaut engine message includes available engines:
      > 2025-03-14T21:59:26.503612Z  INFO dynamo_run: Using default engine: mistralrs. Use out=<engine> to specify one of echo_core, echo_full, mistralrs, llamacpp, sglang, vllm, pystr, pytok
      
      The really important outcome is that this should now "just work":
      ```
      cargo install dynamo-run
      dynamo-run Qwen/Qwen2.5-3B-Instruct
      ```
      
      Sadly you still need `--features cuda|metal` for performance, I couldn't automate that.
      1fb31d6a
    • Graham King's avatar
      fix(mac): Fix for virtual env (#164) · 4f7f4b40
      Graham King authored
      On Mac embedded python interpreters don't pick up the virtual env. This seems to be a known problem. Fix the sys.path.
      4f7f4b40
    • Tanmay Verma's avatar
      e0bb5bd3
    • Graham King's avatar
      fix: Various for MacOS (#155) · 76b79149
      Graham King authored
      - Mac doesn't have `pipe2` syscall so use plain `pipe`.
      - rtnetlink isn't a dependency on mac so don't use the type
      76b79149
    • Ryan McCormick's avatar
      dac63127
    • Ryan Olson's avatar
      feat: global kv block manager (#45) · f04359cf
      Ryan Olson authored
      f04359cf
  18. 13 Mar, 2025 3 commits
    • Graham King's avatar
      feat(mistralrs): Let the engine enforce max tokens (#134) · 404a78e9
      Graham King authored
      Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step.
      
      Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.
      404a78e9
    • Graham King's avatar
      fix(dynamo-run): Network interface detection is Linux only (#133) · b0d3eba1
      Graham King authored
      "netlink" doesn't exist on Mac. We print the primary network interface to help multi-node setup, which is also unlikely on Mac.
      b0d3eba1
    • Graham King's avatar
      feat(dynamo-run): Download models from HF, smart model defaults (#126) · 089f8e1b
      Graham King authored
      
      
      - Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine.
      
      - The default engine (previously always mistralrs) depends on what is compiled in.
      
      - Text can be piped in and will result in a single run of the model.
      
      All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit:
      ```
      echo "What is the capital of Costa Rica?"  | dynamo-run Qwen/Qwen2.5-3B-Instruct
      ```
      Co-authored-by: default avatarRyan McCormick <rmccormick@nvidia.com>
      089f8e1b
  19. 12 Mar, 2025 1 commit
    • Graham King's avatar
      feat(pystr): Pass command line arguments (#123) · 995f71cc
      Graham King authored
      Command line arguments are passed to the python engine like this:
      ```
      dynamo-run out=pystr:my_python_engine.py -- -n 42 --custom-arg Orange --yes
      ```
      
      The python engine receives the arguments in `sys.argv`. The argument list will include some standard ones as well as anything after the `--`.
      
      This input:
      ```
      dynamo-run out=pystr:my_engine.py /opt/models/Llama-3.2-3B-Instruct/ --model-name llama_3.2 --tensor-parallel-size 4 -- -n 1
      ```
      
      is read like this:
      ```
      async def generate(request):
          .. as before ..
      
      if __name__ == "__main__":
          print(f"MAIN: {sys.argv}")
      ```
      
      and produces this output:
      ```
      MAIN: ['my_engine.py', '--model-path', '/opt/models/Llama-3.2-3B-Instruct/', '--model-name', 'llama3.2', '--http-port', '8080', '--tensor-parallel-size', '4', '--base-gpu-id', '0', '--num-nodes', '1', '--node-rank', '0', '-n', '1']
      ```
      
      This allows quick iteration on the engine setup. Note how the `-n` `1` is included. Flags `--leader-addr` and `--model-config` will also be added if provided to `dynamo-run`.
      995f71cc
  20. 11 Mar, 2025 4 commits
  21. 09 Mar, 2025 2 commits
  22. 08 Mar, 2025 1 commit
  23. 07 Mar, 2025 3 commits
    • Graham King's avatar
      fix: dynemo-run model discovery working again (#52) · 9f53922a
      Graham King authored
      There are two etcd keys:
      - The service
      - The model
      
      The second one is the interesting one for us. Previously we confused the two.
      9f53922a
    • Ryan McCormick's avatar
      refactor: Use library constant for kv-hit-rate subject (#48) · 2ee29443
      Ryan McCormick authored
      Replaces hard-coded "kv-hit-rate" string in multiple places with KV_HIT_RATE_SUBJECT constant in lib/llm.
      2ee29443
    • Graham King's avatar
      feat: Python bring-your-own-engine with our tokenizer (#47) · 12714d90
      Graham King authored
      Instead of using `out=pystr:<my.py>` we can now do this:
      ```
      dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout>
      ```
      
      That engine will receive and respond with tokens. Here's an example engine file:
      ```
      import asyncio
      
      async def generate(request):
          yield {"token_ids":[791]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[6864]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[315]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[9822]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[374]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[12366]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[13]}
      ```
      
      Also reduce duplication by making the bindings engine use the llm lib engine.
      12714d90