1. 07 Mar, 2025 1 commit
    • Graham King's avatar
      feat: Bring-your-own engine for dynemo-run (#43) · 1b96c2c4
      Graham King authored
      1. Create `my_engine.py`
      
      ```
      import asyncio
      
      async def generate(request):
          yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
      ```
      
      2. Build
      
      ```
      cargo build --release --feature python
      ```
      
      3. Run
      
      ```
      dynemo-run out=pystr:my_engine.py --name test
      ```
      
      And here's a distributed system, with your engine:
      
      - Node 1: `dynemo-run in=http out=dyn://test`
      - Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`
      1b96c2c4
  2. 05 Mar, 2025 2 commits
  3. 28 Feb, 2025 1 commit
  4. 27 Feb, 2025 2 commits
  5. 26 Feb, 2025 1 commit
  6. 25 Feb, 2025 4 commits
  7. 24 Feb, 2025 1 commit
  8. 21 Feb, 2025 1 commit
  9. 20 Feb, 2025 1 commit
  10. 18 Feb, 2025 1 commit
  11. 14 Feb, 2025 2 commits
    • Graham King's avatar
      fix: Unique IDs for mistralrs requests (#186) · 45b3505c
      Graham King authored
      Upgrade mistralrs to latest.
      45b3505c
    • Graham King's avatar
      feat: Add a mistralrs engine to tio (#178) · 2f700421
      Graham King authored
      This allows us to run a real model.
      
      Build:
      ```
      cargo build --release --features mistralrs,cuda
      ```
      
      Run:
      ```
      ./target/release/tio in=text out=mistralrs --model-path Llama-3.2-1B-Instruct-Q4_K_M.gguf
      ```
      
      Why [mistral.rs](https://github.com/EricLBuehler/mistral.rs)?
      
      - It has no dependencies. You don't need a container or a virtual env to get started.
      - It supports CUDA, Metal (MacOS) and CPU-only. Everyone can join the AI revolution.
      - It starts fast and serves fast (with CUDA). That makes it fun to experiment with.
      - It runs many models, not just Mistral, that's just it's name.
      2f700421
  12. 13 Feb, 2025 2 commits
    • Graham King's avatar
      feat: Add `tio` your friendly cmd line uncle to run triton-llm services (#174) · 418ae5e8
      Graham King authored
      This provides a simple example of how to write a triton-llm engine, and how to connect it to the OpenAI HTTP server.
      
      This is the tool previously called `nio` and `llmctl`.
      
      - **Inputs**: Text and HTTP.
      - **Engines**: Echo, which streams your prompt back with a slight delay.
      
      Build: `cargo build`
      
      Pre-requisites: `nats-server` and `etcd` must be running locally, even though they are not yet used by `tio`.
      
      Run with text input:
      ```
      ./target/debug/tio in=text out=echo_full --model-name test
      ```
      
      Run with the triton-llm HTTP server:
      ```
      ./target/debug/tio in=http out=echo_full --http-port 8080 --model-name Echo-0B
      ```
      
      List models:
      ```
      curl localhost:8080/v1/models | jq
      ```
      
      Will output
      ```
      {
        "object": "list",
        "data": [
          {
            "id": "Echo-0B",
            "object": "object",
            "created": 1739400430,
            "owned_by": "nvidia"
          }
        ]
      }
      ```
      
      #### What's next
      
      As triton-distributed gains features `tio` will be able to grow:
      - When we get the pre-processor we can have token-in token-out engines. 
      - When we get a pull-router we can have `in=nats` and `out=nats`.
      - When we get discovery we can have dynamic engines.
      418ae5e8
    • Ryan Olson's avatar
      fix: tcp updates + initial zmq (#176) · 2fd6592f
      Ryan Olson authored
      2fd6592f
  13. 12 Feb, 2025 2 commits
  14. 11 Feb, 2025 1 commit
  15. 10 Feb, 2025 3 commits
  16. 05 Feb, 2025 1 commit
  17. 04 Feb, 2025 1 commit