"lib/runtime/src/component/endpoint.rs" did not exist on "6e0ccccbc47292387c3af496809337b7720c9a7f"
  1. 14 Mar, 2025 1 commit
  2. 13 Mar, 2025 2 commits
    • Graham King's avatar
      feat(mistralrs): Let the engine enforce max tokens (#134) · 404a78e9
      Graham King authored
      Previously we tokenized and counted tokens to stop when max tokens was reached. Now we let the mistral.rs engine do it which saves the extra tokenization step.
      
      Also dynamo-run prints which engines are compiled in in help message, and some minor lint fixes.
      404a78e9
    • Graham King's avatar
      feat(dynamo-run): Download models from HF, smart model defaults (#126) · 089f8e1b
      Graham King authored
      
      
      - Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine.
      
      - The default engine (previously always mistralrs) depends on what is compiled in.
      
      - Text can be piped in and will result in a single run of the model.
      
      All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit:
      ```
      echo "What is the capital of Costa Rica?"  | dynamo-run Qwen/Qwen2.5-3B-Instruct
      ```
      Co-authored-by: default avatarRyan McCormick <rmccormick@nvidia.com>
      089f8e1b
  3. 08 Mar, 2025 1 commit
  4. 07 Mar, 2025 2 commits
    • Graham King's avatar
      feat: Python bring-your-own-engine with our tokenizer (#47) · 12714d90
      Graham King authored
      Instead of using `out=pystr:<my.py>` we can now do this:
      ```
      dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout>
      ```
      
      That engine will receive and respond with tokens. Here's an example engine file:
      ```
      import asyncio
      
      async def generate(request):
          yield {"token_ids":[791]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[6864]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[315]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[9822]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[374]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[12366]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[13]}
      ```
      
      Also reduce duplication by making the bindings engine use the llm lib engine.
      12714d90
    • Graham King's avatar
      feat: Bring-your-own engine for dynemo-run (#43) · 1b96c2c4
      Graham King authored
      1. Create `my_engine.py`
      
      ```
      import asyncio
      
      async def generate(request):
          yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
      ```
      
      2. Build
      
      ```
      cargo build --release --feature python
      ```
      
      3. Run
      
      ```
      dynemo-run out=pystr:my_engine.py --name test
      ```
      
      And here's a distributed system, with your engine:
      
      - Node 1: `dynemo-run in=http out=dyn://test`
      - Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`
      1b96c2c4
  5. 05 Mar, 2025 2 commits
  6. 04 Mar, 2025 1 commit
  7. 28 Feb, 2025 2 commits
  8. 27 Feb, 2025 1 commit
  9. 25 Feb, 2025 3 commits
  10. 21 Feb, 2025 2 commits
  11. 20 Feb, 2025 1 commit
  12. 18 Feb, 2025 1 commit
  13. 14 Feb, 2025 1 commit
    • Graham King's avatar
      feat: Add a mistralrs engine to tio (#178) · 2f700421
      Graham King authored
      This allows us to run a real model.
      
      Build:
      ```
      cargo build --release --features mistralrs,cuda
      ```
      
      Run:
      ```
      ./target/release/tio in=text out=mistralrs --model-path Llama-3.2-1B-Instruct-Q4_K_M.gguf
      ```
      
      Why [mistral.rs](https://github.com/EricLBuehler/mistral.rs)?
      
      - It has no dependencies. You don't need a container or a virtual env to get started.
      - It supports CUDA, Metal (MacOS) and CPU-only. Everyone can join the AI revolution.
      - It starts fast and serves fast (with CUDA). That makes it fun to experiment with.
      - It runs many models, not just Mistral, that's just it's name.
      2f700421
  14. 13 Feb, 2025 1 commit
    • Graham King's avatar
      feat: Add `tio` your friendly cmd line uncle to run triton-llm services (#174) · 418ae5e8
      Graham King authored
      This provides a simple example of how to write a triton-llm engine, and how to connect it to the OpenAI HTTP server.
      
      This is the tool previously called `nio` and `llmctl`.
      
      - **Inputs**: Text and HTTP.
      - **Engines**: Echo, which streams your prompt back with a slight delay.
      
      Build: `cargo build`
      
      Pre-requisites: `nats-server` and `etcd` must be running locally, even though they are not yet used by `tio`.
      
      Run with text input:
      ```
      ./target/debug/tio in=text out=echo_full --model-name test
      ```
      
      Run with the triton-llm HTTP server:
      ```
      ./target/debug/tio in=http out=echo_full --http-port 8080 --model-name Echo-0B
      ```
      
      List models:
      ```
      curl localhost:8080/v1/models | jq
      ```
      
      Will output
      ```
      {
        "object": "list",
        "data": [
          {
            "id": "Echo-0B",
            "object": "object",
            "created": 1739400430,
            "owned_by": "nvidia"
          }
        ]
      }
      ```
      
      #### What's next
      
      As triton-distributed gains features `tio` will be able to grow:
      - When we get the pre-processor we can have token-in token-out engines. 
      - When we get a pull-router we can have `in=nats` and `out=nats`.
      - When we get discovery we can have dynamic engines.
      418ae5e8