1. 06 Aug, 2025 1 commit
  2. 03 Jul, 2025 1 commit
  3. 30 Jun, 2025 1 commit
    • Graham King's avatar
      chore(dynamo-run): Refactor to library (#1687) · 92f06b0e
      Graham King authored
      Move much of what was in the `dynamo-run` crate into `dynamo-llm` so that everyone can use it.
      
      Example usage:
      
      1. Create a `LocalModel`:
      
      ```
          let local_model = LocalModelBuilder::default()
      	.model_path("Qwen/Qwen3-0.6B")
      	.http_port(8080)
      	.build().await?;
      ```
      
      2. Make an engine:
      
      ```
          let engine_config = EngineConfig::StaticFull {
      	engine: dynamo_engine_mistralrs::make_engine(&local_model).await?,
      	model: Box::new(local_model),
          };
      ```
      
      3. Connect it to an input and run it
      
      ```
          dynamo_llm::entrypoint::input::run_input(Input::Http, runtime, engine_config).await?;
      ```
      
      For https://github.com/ai-dynamo/dynamo/issues/1647
      
      Code Rabbit summary, thanks:
        * Introduced a flexible builder pattern for local model configuration, allowing advanced customization and easier initialization.
        * Added new input modes and unified input handling, supporting interactive chat, HTTP server, batch file, and distributed endpoint modes.
        * Centralized engine configuration and routing, enabling more extensible and maintainable engine management.
        * Simplified and modularized the codebase by moving input and engine logic into dedicated modules.
        * Replaced direct model construction with an asynchronous builder for improved clarity and extensibility.
        * Streamlined configuration and validation for flags and router settings.
        * Added validation to prevent incompatible input and output combinations in endpoint and dynamic modes.
      92f06b0e
  4. 26 Jun, 2025 1 commit
  5. 11 Jun, 2025 1 commit
  6. 04 Jun, 2025 1 commit
  7. 03 Jun, 2025 2 commits
  8. 22 May, 2025 2 commits
    • Graham King's avatar
      feat(dynamo-run): Allow setting KV cache block size (#1175) · 183f2b32
      Graham King authored
      Example:
      ```
      dynamo-run out=<engine> <model> --kv-cache-block-size 64
      ```
      
      In a distributed system this goes on the worker node and is propagated to ingress via the model deployment card.
      
      Previously hard coded to 16, which is now the default.
      
      - Load context_length from model. Closes #1172
      - Store context length and KV cache block size in Model Deployment Card #1170
      183f2b32
    • Graham King's avatar
      feat(dynamo-run): Allow setting context-length (#1157) · 6d5da821
      Graham King authored
      Llama 4 has a very large context length (aka n_ctx, model_max_length, max_model_len), and vllm won't start unless it can allocate enough KV cache for the entire context.
      
      Allow passing `--context-length <N>` to `dynamo-run` to limit it so long-context models will fit.
      
      Future todo:
      - Restrict every request's `max_tokens` to below the context length. Our pre-processor should do this by setting stop_conditions.max_tokens. mistralrs engine wrapper must do it itself because it does not use the pre-processor.
      - mistralrs and llamacpp currently have a hard-coded max context length if one is not provided on the command line. Change those to be the model's built-in max, read from the GGUF or tokenizer_config.json.
      6d5da821
  9. 21 May, 2025 1 commit
  10. 19 May, 2025 1 commit
  11. 08 May, 2025 1 commit
    • Graham King's avatar
      feat: Qwen3, Gemma3 and Llama4 support (#1002) · ceaeba3e
      Graham King authored
      . New mistralrs and llamacpp version
      . mistralrs: Handle Gemma 3 and Llama 4 as vision models
      . Update the dynamo-run docs to use Qwen 3
      . Our pre-processor now supports Llama 4's newer multi-modal `config.json`
      . Upgrade minijinja to handle Qwen 3's prompt template
      
      For Llama 4 we'll need to limit the max seq len. vllm says:
      > To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed,...
      
      I was able to run Llama 4 with llamacpp and a quantized GGUF, with Dynamo doing the pre-processing.
      ceaeba3e
  12. 28 Apr, 2025 1 commit
  13. 24 Apr, 2025 1 commit
  14. 03 Apr, 2025 1 commit
  15. 19 Mar, 2025 1 commit
    • Graham King's avatar
      fix(mistralrs): Disable paged attention (#234) · fd95f37b
      Graham King authored
      Under load it sometimes drops a request. The request gets added to the batch (sequence) and immediately gets a FinishReason Stop. Not sure why. It doesn't happen with the default scheduler (non-paged attention), so switch to that for now.
      fd95f37b
  16. 13 Mar, 2025 1 commit
    • Graham King's avatar
      feat(mistralrs): Let the engine enforce max tokens (#134) · 404a78e9
      Graham King authored
      Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step.
      
      Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.
      404a78e9
  17. 11 Mar, 2025 1 commit
  18. 08 Mar, 2025 1 commit
  19. 05 Mar, 2025 2 commits
  20. 27 Feb, 2025 2 commits
  21. 26 Feb, 2025 1 commit
  22. 25 Feb, 2025 1 commit
  23. 20 Feb, 2025 1 commit
  24. 14 Feb, 2025 2 commits
    • Graham King's avatar
      fix: Unique IDs for mistralrs requests (#186) · 45b3505c
      Graham King authored
      Upgrade mistralrs to latest.
      45b3505c
    • Graham King's avatar
      feat: Add a mistralrs engine to tio (#178) · 2f700421
      Graham King authored
      This allows us to run a real model.
      
      Build:
      ```
      cargo build --release --features mistralrs,cuda
      ```
      
      Run:
      ```
      ./target/release/tio in=text out=mistralrs --model-path Llama-3.2-1B-Instruct-Q4_K_M.gguf
      ```
      
      Why [mistral.rs](https://github.com/EricLBuehler/mistral.rs)?
      
      - It has no dependencies. You don't need a container or a virtual env to get started.
      - It supports CUDA, Metal (MacOS) and CPU-only. Everyone can join the AI revolution.
      - It starts fast and serves fast (with CUDA). That makes it fun to experiment with.
      - It runs many models, not just Mistral, that's just it's name.
      2f700421