1. 03 Jun, 2025 1 commit
  2. 29 May, 2025 2 commits
  3. 23 May, 2025 2 commits
  4. 22 May, 2025 2 commits
    • Graham King's avatar
      feat(dynamo-run): Allow setting context-length (#1157) · 6d5da821
      Graham King authored
      Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context.
      
      Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit.
      
      Future todo:
      - Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor.
      - mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.
      6d5da821
    • jmswen's avatar
  5. 21 May, 2025 2 commits
  6. 20 May, 2025 1 commit
    • Faradawn Yang's avatar
      chore: Remove unused RouterType and ModelMetaData (#1138) · eb821bee
      Faradawn Yang authored
      Remove RouterType and ModelMetaData in `lib/runtime/src/protocols.rs`, which are unused (no outside reference). It is because that the routing has been moved to its own module, `pipeline/network/egress/push_router.rs`. Therefore, the legacy definition of RouterType in `protocols.rs` is no longer used.
      eb821bee
  7. 19 May, 2025 1 commit
    • Graham King's avatar
      feat: Support multiple models on single ingress node (#1127) · aeb79e62
      Graham King authored
      We can now do this:
      
      - Node 1:
      
      ```
      dynamo-run in=http out=dyn
      ```
      
      - Node 2 and 3, two instances of component 'backend' in the nemotron_ultra pipeline:
      
      ```
      dynamo-run in=dyn://nemotron_ultra.backend.generate out=vllm /data/models/NemotronUltra
      ```
      
      - Node 4 and 5, two instances of the 'backend' component in nemotron_super pipeline:
      
      ```
      dynamo-run in=dyn://nemotron_super.backend.generate out=vllm /data/models/NemotronSuper
      ```
      
      The ingress node will discover all four instances and route correctly. We have been planning for this for a long time now.
      
      As part of this auto-discovery is now always `out=dyn`, with no extra URL parts. Previously it could only route to a single pipeline.
      
      Also:
      - Refactor endpoint / instance naming now that I understand them
      - Fix removing models when their instance stops.
      aeb79e62
  8. 16 May, 2025 1 commit
  9. 15 May, 2025 4 commits
    • Graham King's avatar
      chore: Prevent duplicate components with different models. (#1103) · 641234cd
      Graham King authored
      Each namespace is for a single pipeline, so a component must be model-unique. The means we can have several components with the same name running the same model (data parallel), their traffic will be routed according to `--router-mode`, but we cannot have several components with the same name running different models.
      
      Add an `ensure_unique` check to prevent that happening.
      641234cd
    • Ryan McCormick's avatar
    • Graham King's avatar
      fix: Fix default RouterMode value (#1092) · 889ab67e
      Graham King authored
      The Python bindings use the default value for RouterMode. Previously that was Random (good), but now it became None (bad).
      
      Remove the option and clean up the duplicate RouterMode. I was trying to avoid putting the `KV` enum in dynamo-runtime. Turns out adding those two characters gives us a healthy simplification, and restores the old default router value.
      
      Also clean up two noisy log messages when waiting for KV routing metrics to start in worker.
      889ab67e
    • Abrar Shivani's avatar
      feat: Use existing Tokio runtime in components (#941) · 2a5eb7e7
      Abrar Shivani authored
      The runtime library already provides a from_current method that creates and returns a Runtime object initialized with the current Tokio runtime handle. Since components do not use the runtime library directly but access it through the worker, the worker needs to be updated to create itself using a Runtime instance derived from the current Tokio runtime.
      This PR updates the http component and the worker to use the existing Tokio runtime instead of creating a new one. Other components can be similarly updated to run using the existing runtime.
      2a5eb7e7
  10. 14 May, 2025 2 commits
    • Graham King's avatar
      feat(dynamo-run): KV-aware routing (#1064) · 29813508
      Graham King authored
      Router:
      ```
      dynamo-run in=http out=dyn://dynamo.endpoint.generate --router-mode kv
      ```
      
      Worker (* N):
      ```
      dynamo-run in=dyn://dynamo.endpoint.generate out=vllm /data/llms/Qwen/Qwen3-4B
      ```
      
      You need patched vllm and the C bindings `.so`. Full docs in the updated guide: `docs/guides/dynamo_run.md`.
      
      This gives us a pure-Rust ingress node: OpenAI compliant HTTP server + Pre-processor + KV-aware router.
      29813508
    • wxsm's avatar
      fix: add maxage to nats stream (#1053) · 087d398d
      wxsm authored
      Add max_age to nats stream when create, 10 min should be very enough for prefill workers to consume. this prevent system crash while nats jetstream hits disk limit by endless growing messages.
      087d398d
  11. 09 May, 2025 2 commits
  12. 07 May, 2025 1 commit
  13. 06 May, 2025 4 commits
    • jthomson04's avatar
      c4213899
    • Graham King's avatar
      feat(dynamo-run): vllm and sglang subprocess engines (#954) · 28fd481c
      Graham King authored
      New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines.
          
      Why?
          
        - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain.
        - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues.
        - Should have better performance as it's "native" vllm / sglang.
        - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle.
      28fd481c
    • hhzhang16's avatar
      403344e5
    • Graham King's avatar
      feat: dynamo-run <-> python interop (#934) · 99cd9d85
      Graham King authored
      Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests:
      ```
      from dynamo.llm import register_llm
      
      MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
      await register_llm(endpoint, MODEL, 3)
      ```
      
      Full vllm example, with pre-processing in dynamo:
      - `dynamo-run in=text out=dyn://dynamo.backend.generate`
      - `cd lib/bindings/python/examples/hello_world`
      - `python server_vllm.py`
      
      This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus.
      
      The `register_llm` call does this:
      
      - Download the model from HF if necessary
      - Load the model deployment card from the HF folder or extract from GGUF
      - Push the tokenizer config etc into NATS object store so ingress can access it from a different machine
      - Publish the model deployment card to ETCD
      99cd9d85
  14. 01 May, 2025 1 commit
  15. 29 Apr, 2025 1 commit
    • Graham King's avatar
      chore: Split PushRouter from Client (#817) · a1a10365
      Graham King authored
      In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side.
      
      As part of moving pre-processing back to ingress-side we need to split this into two steps:
      - Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card.
      - PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters.
      
      Part of #743
      a1a10365
  16. 26 Apr, 2025 1 commit
  17. 25 Apr, 2025 2 commits
    • Harrison Saturley-Hall's avatar
    • Graham King's avatar
      chore: Publish Model Deployment Card to NATS (#799) · d346782c
      Graham King authored
      This will allow an ingress-side pre-processor to see it without needing a model checkout.
      
      Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files.
      
      To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store. 
      
      The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided.
      
      Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit.
      
      Part of #743 
      d346782c
  18. 21 Apr, 2025 1 commit
  19. 18 Apr, 2025 1 commit
  20. 12 Apr, 2025 1 commit
  21. 09 Apr, 2025 1 commit
  22. 07 Apr, 2025 1 commit
    • Graham King's avatar
      feat(dynamo-run): Basic routing choice (#524) · ec2e7307
      Graham King authored
      As a first step towards KV routing:
      - introduce a `--router-mode` in dynamo-run that only does random and round-robin right now. Not that interesting yet.
      - Make the vllm engine publish the KV events received from our patched vllm.
      
      Now we "just" need to connect the two. Easy right?
      ec2e7307
  23. 04 Apr, 2025 2 commits
    • Graham King's avatar
      chore: Upgrade Rust to 1.86 (#518) · e99aa1e1
      Graham King authored
      Also upgrade the cargo resolver to v3, the default.
      
      New clippy lints:
      - `next_back()` instead of `last()` for a double-ended iterator. That avoids walking the whole list.
      - ` repeat_n` instead of `repeat.take`. That avoids cloning.
      - Doc indenting
      e99aa1e1
    • Graham King's avatar
      feat: Python decorator dynamo_worker takes optional `static` parameter without etcd (#494) · 88ad3425
      Graham King authored
      Adds `@dynamo_worker(static = True)` to create a static worker which has a predictable name and hence does not require discovery or `etcd` to be running. There can only be a single static worker per namespace / component / endpoint trio.
      
      This contrasts with the default dynamic `dynamo_worker` endpoints we have now, which get a unique random name (based on namespace/component/endpoint), and are discovered by ingress components using etcd.
      
      Also change the hello_world example to use `dynamo_worker(static = True)` so that it is exercised and demonstrated somewhere.
      
      For NIM.
      88ad3425
  24. 03 Apr, 2025 1 commit
  25. 02 Apr, 2025 1 commit
  26. 01 Apr, 2025 1 commit