1. 09 Jun, 2025 3 commits
  2. 06 Jun, 2025 1 commit
  3. 04 Jun, 2025 4 commits
  4. 03 Jun, 2025 1 commit
  5. 02 Jun, 2025 2 commits
  6. 30 May, 2025 3 commits
  7. 29 May, 2025 8 commits
  8. 28 May, 2025 3 commits
  9. 27 May, 2025 1 commit
  10. 24 May, 2025 1 commit
  11. 23 May, 2025 4 commits
  12. 22 May, 2025 4 commits
    • Graham King's avatar
      feat(dynamo-run): Allow setting KV cache block size (#1175) · 183f2b32
      Graham King authored
      Example:
      ```
      dynamo-run out=<engine> <model> --kv-cache-block-size 64
      ```
      
      In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card.
      
      Previously hard coded to 16, which is now the default.
      
      - Load context_length from model. Closes #1172
      - Store context length and KV cache block size in Model Deployment Card #1170
      183f2b32
    • Graham King's avatar
      fix: Fix race condition in kv_router unit test (#1174) · 3bde1e45
      Graham King authored
      Removed the hard coded sleeps, explained what we're testing.
      
      Closes https://github.com/ai-dynamo/dynamo/issues/1132
      
      The race condition is that `apply_event` sends a message on a channel, it does not directly apply the event. At some later point the tokio runtime schedules the task running the channel receiver, which applies the event. If that had not happened yet the test would fail.
      3bde1e45
    • jthomson04's avatar
      feat: Various KVBM improvements (#1134) · 5d5080ba
      jthomson04 authored
      5d5080ba
    • Graham King's avatar
      feat(dynamo-run): Allow setting context-length (#1157) · 6d5da821
      Graham King authored
      Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context.
      
      Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit.
      
      Future todo:
      - Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor.
      - mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.
      6d5da821
  13. 21 May, 2025 3 commits
  14. 20 May, 2025 1 commit
  15. 19 May, 2025 1 commit
    • Graham King's avatar
      feat: Support multiple models on single ingress node (#1127) · aeb79e62
      Graham King authored
      We can now do this:
      
      - Node 1:
      
      ```
      dynamo-run in=http out=dyn
      ```
      
      - Node 2 and 3, two instances of component 'backend' in the nemotron_ultra pipeline:
      
      ```
      dynamo-run in=dyn://nemotron_ultra.backend.generate out=vllm /data/models/NemotronUltra
      ```
      
      - Node 4 and 5, two instances of the 'backend' component in nemotron_super pipeline:
      
      ```
      dynamo-run in=dyn://nemotron_super.backend.generate out=vllm /data/models/NemotronSuper
      ```
      
      The ingress node will discover all four instances and route correctly. We have been planning for this for a long time now.
      
      As part of this auto-discovery is now always `out=dyn`, with no extra URL parts. Previously it could only route to a single pipeline.
      
      Also:
      - Refactor endpoint / instance naming now that I understand them
      - Fix removing models when their instance stops.
      aeb79e62