"vscode:/vscode.git/clone" did not exist on "695754aeb1e541839c38da7658fb8d71dbc27ae9"
  1. 07 May, 2025 1 commit
  2. 06 May, 2025 2 commits
    • Graham King's avatar
      feat(dynamo-run): vllm and sglang subprocess engines (#954) · 28fd481c
      Graham King authored
      New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines.
          
      Why?
          
        - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain.
        - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues.
        - Should have better performance as it's "native" vllm / sglang.
        - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle.
      28fd481c
    • Graham King's avatar
      feat: dynamo-run <-> python interop (#934) · 99cd9d85
      Graham King authored
      Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests:
      ```
      from dynamo.llm import register_llm
      
      MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
      await register_llm(endpoint, MODEL, 3)
      ```
      
      Full vllm example, with pre-processing in dynamo:
      - `dynamo-run in=text out=dyn://dynamo.backend.generate`
      - `cd lib/bindings/python/examples/hello_world`
      - `python server_vllm.py`
      
      This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus.
      
      The `register_llm` call does this:
      
      - Download the model from HF if necessary
      - Load the model deployment card from the HF folder or extract from GGUF
      - Push the tokenizer config etc into NATS object store so ingress can access it from a different machine
      - Publish the model deployment card to ETCD
      99cd9d85
  3. 01 May, 2025 1 commit
  4. 29 Apr, 2025 2 commits
    • Abrar Shivani's avatar
      feat: Add request template support for default inference parameters (#841) · adad2ecd
      Abrar Shivani authored
      Adds support for specifying default request parameters through a json template file that can be applied across all inference requests. This enables consistent parameter settings while still allowing per-request overrides.
      
      Changes:
      - Add --request-template CLI flag to specify template file path
      - Integrate template support in HTTP, batch and text input modes
      - Template values can be overridden by individual request parameters
      - Example template.json:
      ```
      {
          "model": "Qwen2.5-3B-Instruct",
          "temperature": 0.7,
          "max_completion_tokens": 4096
      }
      ```
      adad2ecd
    • Graham King's avatar
      chore: Split PushRouter from Client (#817) · a1a10365
      Graham King authored
      In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side.
      
      As part of moving pre-processing back to ingress-side we need to split this into two steps:
      - Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card.
      - PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters.
      
      Part of #743
      a1a10365
  5. 28 Apr, 2025 1 commit
  6. 25 Apr, 2025 1 commit
    • Graham King's avatar
      chore: Publish Model Deployment Card to NATS (#799) · d346782c
      Graham King authored
      This will allow an ingress-side pre-processor to see it without needing a model checkout.
      
      Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files.
      
      To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store. 
      
      The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided.
      
      Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit.
      
      Part of #743 
      d346782c
  7. 21 Apr, 2025 1 commit
  8. 07 Apr, 2025 1 commit
    • Graham King's avatar
      feat(dynamo-run): Basic routing choice (#524) · ec2e7307
      Graham King authored
      As a first step towards KV routing:
      - introduce a `--router-mode` in dynamo-run that only does random and round-robin right now. Not that interesting yet.
      - Make the vllm engine publish the KV events received from our patched vllm.
      
      Now we "just" need to connect the two. Easy right?
      ec2e7307
  9. 04 Apr, 2025 1 commit
    • Graham King's avatar
      feat: Python decorator dynamo_worker takes optional `static` parameter without etcd (#494) · 88ad3425
      Graham King authored
      Adds `@dynamo_worker(static = True)` to create a static worker which has a predictable name and hence does not require discovery or `etcd` to be running. There can only be a single static worker per namespace / component / endpoint trio.
      
      This contrasts with the default dynamic `dynamo_worker` endpoints we have now, which get a unique random name (based on namespace/component/endpoint), and are discovered by ingress components using etcd.
      
      Also change the hello_world example to use `dynamo_worker(static = True)` so that it is exercised and demonstrated somewhere.
      
      For NIM.
      88ad3425
  10. 24 Mar, 2025 2 commits
  11. 19 Mar, 2025 1 commit
    • Graham King's avatar
      fix(mistralrs): Disable paged attention (#234) · fd95f37b
      Graham King authored
      Under load it sometimes drops a request. The request gets added to the batch (sequence) and immediately gets a FinishReason Stop. Not sure why. It doesn't happen with the default scheduler (non-paged attention), so switch to that for now.
      fd95f37b
  12. 17 Mar, 2025 1 commit
    • Graham King's avatar
      fix(vllm,sglang): Let the engine enforce max tokens (#216) · 05765cd4
      Graham King authored
      Previously several parts of the stack ensured max tokens (for this single request) was set.
      
      Now only text input sets it (to 8k). Everything else leaves as is, potentially blank. The engines themselves have very small defaults, 16 for vllm and 128 for sglang.
      
      Also fix dynamo-run CUDA startup message to only print if we're using an engine that would benefit from it (mistralrs, llamacpp).
      05765cd4
  13. 15 Mar, 2025 1 commit
    • Graham King's avatar
      feat(dynamo-run): Batch mode (#142) · 2cca070c
      Graham King authored
      ```
      dynamo-run in=batch:prompts.jsonl out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct/
      ```
      
      The file has genai format, one entry per line:
      ```
      {"text": "the prompt"}
      {"text": ..etc
      ```
      
      The prompt is evaluated and the output written to `output.jsonl` in the
      same folder as the input.
      
      At the end of the run various statistics are printed:
      > Ran 5 files in 8s 679ms. Tokens in: 40 (5/s). Tokens out: 346 (43/s)
      
      This is also helpful for pushing load into the system and stressing the
      various components. Not intended for performance measurement, it's a
      batch inference tool.
      2cca070c
  14. 13 Mar, 2025 2 commits
    • Graham King's avatar
      feat(mistralrs): Let the engine enforce max tokens (#134) · 404a78e9
      Graham King authored
      Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step.
      
      Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.
      404a78e9
    • Graham King's avatar
      feat(dynamo-run): Download models from HF, smart model defaults (#126) · 089f8e1b
      Graham King authored
      
      
      - Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine.
      
      - The default engine (previously always mistralrs) depends on what is compiled in.
      
      - Text can be piped in and will result in a single run of the model.
      
      All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit:
      ```
      echo "What is the capital of Costa Rica?"  | dynamo-run Qwen/Qwen2.5-3B-Instruct
      ```
      Co-authored-by: default avatarRyan McCormick <rmccormick@nvidia.com>
      089f8e1b
  15. 11 Mar, 2025 1 commit
    • Graham King's avatar
      fix(pystr): Output python errors (#99) · 9c7b1ead
      Graham King authored
      If the python file raises an exception we print it like Python would.
      
      ```
      $ ./target/debug/dynamo-run in=http out=pystr:~/Temp/cn47/1_e.py --model-name test
      
      Traceback (most recent call last):
        File "/home/graham/Temp/cn47/1_e.py", line 17, in generate
          raise MyException("The message")
      1_e.MyException: The message
      ```
      9c7b1ead
  16. 08 Mar, 2025 1 commit