1. 05 May, 2025 1 commit
  2. 01 May, 2025 1 commit
  3. 29 Apr, 2025 3 commits
    • Abrar Shivani's avatar
      feat: Add request template support for default inference parameters (#841) · adad2ecd
      Abrar Shivani authored
      Adds support for specifying default request parameters through a json template file that can be applied across all inference requests. This enables consistent parameter settings while still allowing per-request overrides.
      
      Changes:
      - Add --request-template CLI flag to specify template file path
      - Integrate template support in HTTP, batch and text input modes
      - Template values can be overridden by individual request parameters
      - Example template.json:
      ```
      {
          "model": "Qwen2.5-3B-Instruct",
          "temperature": 0.7,
          "max_completion_tokens": 4096
      }
      ```
      adad2ecd
    • Graham King's avatar
      904730b9
    • Graham King's avatar
      chore: Split PushRouter from Client (#817) · a1a10365
      Graham King authored
      In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side.
      
      As part of moving pre-processing back to ingress-side we need to split this into two steps:
      - Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card.
      - PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters.
      
      Part of #743
      a1a10365
  4. 28 Apr, 2025 2 commits
  5. 26 Apr, 2025 1 commit
  6. 25 Apr, 2025 3 commits
    • Harrison Saturley-Hall's avatar
    • Anant Sharma's avatar
      448e79a6
    • Graham King's avatar
      chore: Publish Model Deployment Card to NATS (#799) · d346782c
      Graham King authored
      This will allow an ingress-side pre-processor to see it without needing a model checkout.
      
      Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files.
      
      To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store. 
      
      The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided.
      
      Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit.
      
      Part of #743 
      d346782c
  7. 24 Apr, 2025 1 commit
  8. 21 Apr, 2025 4 commits
  9. 18 Apr, 2025 3 commits
  10. 17 Apr, 2025 3 commits
  11. 12 Apr, 2025 1 commit
  12. 11 Apr, 2025 1 commit
  13. 09 Apr, 2025 2 commits
  14. 07 Apr, 2025 1 commit
    • Graham King's avatar
      feat(dynamo-run): Basic routing choice (#524) · ec2e7307
      Graham King authored
      As a first step towards KV routing:
      - introduce a `--router-mode` in dynamo-run that only does random and round-robin right now. Not that interesting yet.
      - Make the vllm engine publish the KV events received from our patched vllm.
      
      Now we "just" need to connect the two. Easy right?
      ec2e7307
  15. 04 Apr, 2025 3 commits
    • Yan Ru Pei's avatar
    • Graham King's avatar
      chore: Upgrade Rust to 1.86 (#518) · e99aa1e1
      Graham King authored
      Also upgrade the cargo resolver to v3, the default.
      
      New clippy lints:
      - `next_back()` instead of `last()` for a double-ended iterator. That avoids walking the whole list.
      - ` repeat_n` instead of `repeat.take`. That avoids cloning.
      - Doc indenting
      e99aa1e1
    • Graham King's avatar
      feat: Python decorator dynamo_worker takes optional `static` parameter without etcd (#494) · 88ad3425
      Graham King authored
      Adds `@dynamo_worker(static = True)` to create a static worker which has a predictable name and hence does not require discovery or `etcd` to be running. There can only be a single static worker per namespace / component / endpoint trio.
      
      This contrasts with the default dynamic `dynamo_worker` endpoints we have now, which get a unique random name (based on namespace/component/endpoint), and are discovered by ingress components using etcd.
      
      Also change the hello_world example to use `dynamo_worker(static = True)` so that it is exercised and demonstrated somewhere.
      
      For NIM.
      88ad3425
  16. 03 Apr, 2025 3 commits
  17. 02 Apr, 2025 1 commit
  18. 01 Apr, 2025 2 commits
  19. 31 Mar, 2025 3 commits
  20. 28 Mar, 2025 1 commit