1. 16 Sep, 2025 1 commit
  2. 22 Aug, 2025 1 commit
  3. 20 Aug, 2025 1 commit
  4. 19 Aug, 2025 1 commit
  5. 12 Aug, 2025 1 commit
  6. 30 Jun, 2025 1 commit
    • Graham King's avatar
      chore(dynamo-run): Refactor to library (#1687) · 92f06b0e
      Graham King authored
      Move much of what was in the `dynamo-run` crate into `dynamo-llm` so that everyone can use it.
      
      Example usage:
      
      1. Create a `LocalModel`:
      
      ```
          let local_model = LocalModelBuilder::default()
      	.model_path("Qwen/Qwen3-0.6B")
      	.http_port(8080)
      	.build().await?;
      ```
      
      2. Make an engine:
      
      ```
          let engine_config = EngineConfig::StaticFull {
      	engine: dynamo_engine_mistralrs::make_engine(&local_model).await?,
      	model: Box::new(local_model),
          };
      ```
      
      3. Connect it to an input and run it
      
      ```
          dynamo_llm::entrypoint::input::run_input(Input::Http, runtime, engine_config).await?;
      ```
      
      For https://github.com/ai-dynamo/dynamo/issues/1647
      
      Code Rabbit summary, thanks:
        * Introduced a flexible builder pattern for local model configuration, allowing advanced customization and easier initialization.
        * Added new input modes and unified input handling, supporting interactive chat, HTTP server, batch file, and distributed endpoint modes.
        * Centralized engine configuration and routing, enabling more extensible and maintainable engine management.
        * Simplified and modularized the codebase by moving input and engine logic into dedicated modules.
        * Replaced direct model construction with an asynchronous builder for improved clarity and extensibility.
        * Streamlined configuration and validation for flags and router settings.
        * Added validation to prevent incompatible input and output combinations in endpoint and dynamic modes.
      92f06b0e
  7. 21 May, 2025 1 commit
  8. 19 May, 2025 1 commit
    • Graham King's avatar
      feat: Support multiple models on single ingress node (#1127) · aeb79e62
      Graham King authored
      We can now do this:
      
      - Node 1:
      
      ```
      dynamo-run in=http out=dyn
      ```
      
      - Node 2 and 3, two instances of component 'backend' in the nemotron_ultra pipeline:
      
      ```
      dynamo-run in=dyn://nemotron_ultra.backend.generate out=vllm /data/models/NemotronUltra
      ```
      
      - Node 4 and 5, two instances of the 'backend' component in nemotron_super pipeline:
      
      ```
      dynamo-run in=dyn://nemotron_super.backend.generate out=vllm /data/models/NemotronSuper
      ```
      
      The ingress node will discover all four instances and route correctly. We have been planning for this for a long time now.
      
      As part of this auto-discovery is now always `out=dyn`, with no extra URL parts. Previously it could only route to a single pipeline.
      
      Also:
      - Refactor endpoint / instance naming now that I understand them
      - Fix removing models when their instance stops.
      aeb79e62
  9. 06 May, 2025 1 commit
    • Graham King's avatar
      feat(dynamo-run): vllm and sglang subprocess engines (#954) · 28fd481c
      Graham King authored
      New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines.
          
      Why?
          
        - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain.
        - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues.
        - Should have better performance as it's "native" vllm / sglang.
        - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle.
      28fd481c
  10. 01 May, 2025 1 commit
  11. 29 Apr, 2025 1 commit
    • Abrar Shivani's avatar
      feat: Add request template support for default inference parameters (#841) · adad2ecd
      Abrar Shivani authored
      Adds support for specifying default request parameters through a json template file that can be applied across all inference requests. This enables consistent parameter settings while still allowing per-request overrides.
      
      Changes:
      - Add --request-template CLI flag to specify template file path
      - Integrate template support in HTTP, batch and text input modes
      - Template values can be overridden by individual request parameters
      - Example template.json:
      ```
      {
          "model": "Qwen2.5-3B-Instruct",
          "temperature": 0.7,
          "max_completion_tokens": 4096
      }
      ```
      adad2ecd
  12. 21 Apr, 2025 1 commit
  13. 07 Apr, 2025 1 commit
    • Graham King's avatar
      feat(dynamo-run): Basic routing choice (#524) · ec2e7307
      Graham King authored
      As a first step towards KV routing:
      - introduce a `--router-mode` in dynamo-run that only does random and round-robin right now. Not that interesting yet.
      - Make the vllm engine publish the KV events received from our patched vllm.
      
      Now we "just" need to connect the two. Easy right?
      ec2e7307
  14. 17 Mar, 2025 1 commit
    • Graham King's avatar
      fix(vllm,sglang): Let the engine enforce max tokens (#216) · 05765cd4
      Graham King authored
      Previously several parts of the stack ensured max tokens (for this single request) was set.
      
      Now only text input sets it (to 8k). Everything else leaves as is, potentially blank. The engines themselves have very small defaults, 16 for vllm and 128 for sglang.
      
      Also fix dynamo-run CUDA startup message to only print if we're using an engine that would benefit from it (mistralrs, llamacpp).
      05765cd4
  15. 15 Mar, 2025 1 commit
    • Graham King's avatar
      feat(dynamo-run): Batch mode (#142) · 2cca070c
      Graham King authored
      ```
      dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/
      ```
      
      The file has genai format, one entry per line:
      ```
      {"text": "the prompt"}
      {"text": ..etc
      ```
      
      The prompt is evaluated and the output written to `output.jsonl` in the
      same folder as the input.
      
      At the end of the run various statistics are printed:
      > Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s)
      
      This is also helpful for pushing load into the system and stressing the
      various components. Not intended for performance measurement, it's a
      batch inference tool.
      2cca070c
  16. 13 Mar, 2025 2 commits
    • Graham King's avatar
      feat(mistralrs): Let the engine enforce max tokens (#134) · 404a78e9
      Graham King authored
      Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step.
      
      Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.
      404a78e9
    • Graham King's avatar
      feat(dynamo-run): Download models from HF, smart model defaults (#126) · 089f8e1b
      Graham King authored
      
      
      - Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine.
      
      - The default engine (previously always mistralrs) depends on what is compiled in.
      
      - Text can be piped in and will result in a single run of the model.
      
      All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit:
      ```
      echo "What is the capital of Costa Rica?"  | dynamo-run Qwen/Qwen2.5-3B-Instruct
      ```
      Co-authored-by: default avatarRyan McCormick <rmccormick@nvidia.com>
      089f8e1b
  17. 11 Mar, 2025 1 commit
    • Graham King's avatar
      fix(pystr): Output python errors (#99) · 9c7b1ead
      Graham King authored
      If the python file raises an exception we print it like Python would.
      
      ```
      $ ./target/debug/dynamo-run in=http out=pystr:~/Temp/cn47/1_e.py --model-name test
      
      Traceback (most recent call last):
        File "/home/graham/Temp/cn47/1_e.py", line 17, in generate
          raise MyException("The message")
      1_e.MyException: The message
      ```
      9c7b1ead
  18. 08 Mar, 2025 1 commit
  19. 07 Mar, 2025 1 commit
  20. 05 Mar, 2025 2 commits
  21. 04 Mar, 2025 1 commit
  22. 27 Feb, 2025 2 commits
  23. 26 Feb, 2025 1 commit
  24. 25 Feb, 2025 4 commits
  25. 21 Feb, 2025 2 commits
  26. 13 Feb, 2025 1 commit
    • Graham King's avatar
      feat: Add `tio` your friendly cmd line uncle to run triton-llm services (#174) · 418ae5e8
      Graham King authored
      This provides a simple example of how to write a triton-llm engine, and how to connect it to the OpenAI HTTP server.
      
      This is the tool previously called `nio` and `llmctl`.
      
      - **Inputs**: Text and HTTP.
      - **Engines**: Echo, which streams your prompt back with a slight delay.
      
      Build: `cargo build`
      
      Pre-requisites: `nats-server` and `etcd` must be running locally, even though they are not yet used by `tio`.
      
      Run with text input:
      ```
      ./target/debug/tio in=text out=echo_full --model-name test
      ```
      
      Run with the triton-llm HTTP server:
      ```
      ./target/debug/tio in=http out=echo_full --http-port 8080 --model-name Echo-0B
      ```
      
      List models:
      ```
      curl localhost:8080/v1/models | jq
      ```
      
      Will output
      ```
      {
        "object": "list",
        "data": [
          {
            "id": "Echo-0B",
            "object": "object",
            "created": 1739400430,
            "owned_by": "nvidia"
          }
        ]
      }
      ```
      
      #### What's next
      
      As triton-distributed gains features `tio` will be able to grow:
      - When we get the pre-processor we can have token-in token-out engines. 
      - When we get a pull-router we can have `in=nats` and `out=nats`.
      - When we get discovery we can have dynamic engines.
      418ae5e8