1. 16 Jul, 2025 1 commit
  2. 15 Jul, 2025 2 commits
  3. 14 Jul, 2025 1 commit
  4. 11 Jul, 2025 1 commit
  5. 10 Jul, 2025 3 commits
  6. 08 Jul, 2025 2 commits
  7. 07 Jul, 2025 1 commit
  8. 03 Jul, 2025 2 commits
  9. 30 Jun, 2025 2 commits
    • Graham King's avatar
      chore(dynamo-run): Refactor to library (#1687) · 92f06b0e
      Graham King authored
      Move much of what was in the `dynamo-run` crate into `dynamo-llm` so that everyone can use it.
      
      Example usage:
      
      1. Create a `LocalModel`:
      
      ```
          let local_model = LocalModelBuilder::default()
      	.model_path("Qwen/Qwen3-0.6B")
      	.http_port(8080)
      	.build().await?;
      ```
      
      2. Make an engine:
      
      ```
          let engine_config = EngineConfig::StaticFull {
      	engine: dynamo_engine_mistralrs::make_engine(&local_model).await?,
      	model: Box::new(local_model),
          };
      ```
      
      3. Connect it to an input and run it
      
      ```
          dynamo_llm::entrypoint::input::run_input(Input::Http, runtime, engine_config).await?;
      ```
      
      For https://github.com/ai-dynamo/dynamo/issues/1647
      
      Code Rabbit summary, thanks:
        * Introduced a flexible builder pattern for local model configuration, allowing advanced customization and easier initialization.
        * Added new input modes and unified input handling, supporting interactive chat, HTTP server, batch file, and distributed endpoint modes.
        * Centralized engine configuration and routing, enabling more extensible and maintainable engine management.
        * Simplified and modularized the codebase by moving input and engine logic into dedicated modules.
        * Replaced direct model construction with an asynchronous builder for improved clarity and extensibility.
        * Streamlined configuration and validation for flags and router settings.
        * Added validation to prevent incompatible input and output combinations in endpoint and dynamic modes.
      92f06b0e
    • Paul Hendricks's avatar
      refactor: Upgrade async-openai (#1693) · 82eae1fd
      Paul Hendricks authored
      82eae1fd
  10. 25 Jun, 2025 1 commit
  11. 17 Jun, 2025 1 commit
  12. 13 Jun, 2025 1 commit
  13. 03 Jun, 2025 1 commit
  14. 29 May, 2025 3 commits
    • Graham King's avatar
      feat: Initial Granite support (#1271) · 7d0c9386
      Graham King authored
      - Add Granite to our tokenizer
      - Fix pre-processor to load context length correctly
      - Add strftime_now Jinja function for prompt templates
      - Update llama.cpp
      - Handle trtllm errors when not using trtllm
      
      Support depends on the engine:
      
      - `mistral.rs`, our default engine, doesn't support Granite yet.
      
      - `llama.cpp` does and works very well:
      ```
      dynamo-run out=llamacpp ~/llms/granite-3.3-2b-instruct-Q4_K_M.gguf --context-length 16384
      ```
      
      - `vllm` also works very well:
      ```
      dynamo-run in=http out=vllm ~/llms/granite-3.3-2b-instruct --context-length 16384
      ```
      
      - `sglang` mostly works, but it doesn't catch the stop token, so we do in the HTTP ingress, and log an error. The Text ingress doesn't catch it because I disabled it to make the raw echo engine work. A bit of work to do here.
      
      Closes: #1245 
      7d0c9386
    • Anant Sharma's avatar
      9d9a1d9b
    • Alec's avatar
      feat: add KV Event Publishing to vLLM v1 (#1181) · 0df6d462
      Alec authored
      0df6d462
  15. 28 May, 2025 1 commit
  16. 23 May, 2025 1 commit
  17. 21 May, 2025 1 commit
  18. 19 May, 2025 1 commit
  19. 13 May, 2025 1 commit
  20. 09 May, 2025 3 commits
  21. 08 May, 2025 1 commit
    • Graham King's avatar
      feat: Qwen3, Gemma3 and Llama4 support (#1002) · ceaeba3e
      Graham King authored
      . New mistralrs and llamacpp version
      . mistralrs: Handle Gemma 3 and Llama 4 as vision models
      . Update the dynamo-run docs to use Qwen 3
      . Our pre-processor now supports Llama 4's newer multi-modal `config.json`
      . Upgrade minijinja to handle Qwen 3's prompt template
      
      For Llama 4 we'll need to limit the max seq len. vllm says:
      > To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed,...
      
      I was able to run Llama 4 with llamacpp and a quantized GGUF, with Dynamo doing the pre-processing.
      ceaeba3e
  22. 07 May, 2025 1 commit
  23. 06 May, 2025 2 commits
    • Graham King's avatar
      feat(dynamo-run): vllm and sglang subprocess engines (#954) · 28fd481c
      Graham King authored
      New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines.
          
      Why?
          
        - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain.
        - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues.
        - Should have better performance as it's "native" vllm / sglang.
        - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle.
      28fd481c
    • Graham King's avatar
      feat: dynamo-run <-> python interop (#934) · 99cd9d85
      Graham King authored
      Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests:
      ```
      from dynamo.llm import register_llm
      
      MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
      await register_llm(endpoint, MODEL, 3)
      ```
      
      Full vllm example, with pre-processing in dynamo:
      - `dynamo-run in=text out=dyn://dynamo.backend.generate`
      - `cd lib/bindings/python/examples/hello_world`
      - `python server_vllm.py`
      
      This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus.
      
      The `register_llm` call does this:
      
      - Download the model from HF if necessary
      - Load the model deployment card from the HF folder or extract from GGUF
      - Push the tokenizer config etc into NATS object store so ingress can access it from a different machine
      - Publish the model deployment card to ETCD
      99cd9d85
  24. 01 May, 2025 1 commit
  25. 29 Apr, 2025 1 commit
    • Graham King's avatar
      chore: Split PushRouter from Client (#817) · a1a10365
      Graham King authored
      In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side.
      
      As part of moving pre-processing back to ingress-side we need to split this into two steps:
      - Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card.
      - PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters.
      
      Part of #743
      a1a10365
  26. 25 Apr, 2025 3 commits
    • Harrison Saturley-Hall's avatar
    • Anant Sharma's avatar
      448e79a6
    • Graham King's avatar
      chore: Publish Model Deployment Card to NATS (#799) · d346782c
      Graham King authored
      This will allow an ingress-side pre-processor to see it without needing a model checkout.
      
      Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files.
      
      To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store. 
      
      The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided.
      
      Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit.
      
      Part of #743 
      d346782c
  27. 18 Apr, 2025 1 commit