1. 11 Mar, 2025 4 commits
  2. 10 Mar, 2025 8 commits
  3. 09 Mar, 2025 8 commits
  4. 08 Mar, 2025 7 commits
  5. 07 Mar, 2025 10 commits
    • Anant Sharma's avatar
      test: add gpu sanity test for ci job (#49) · 6705d483
      Anant Sharma authored
      6705d483
    • Ryan McCormick's avatar
    • Graham King's avatar
      fix: dynemo-run model discovery working again (#52) · 9f53922a
      Graham King authored
      There are two etcd keys:
      - The service
      - The model
      
      The second one is the interesting one for us. Previously we confused the two.
      9f53922a
    • Biswa Panda's avatar
      aacc5d76
    • Ryan McCormick's avatar
      refactor: Use library constant for kv-hit-rate subject (#48) · 2ee29443
      Ryan McCormick authored
      Replaces hard-coded "kv-hit-rate" string in multiple places with KV_HIT_RATE_SUBJECT constant in lib/llm.
      2ee29443
    • ptarasiewiczNV's avatar
      chore: remove ucx-py from requirements and fix UCX env variable (#46) · 44bde250
      ptarasiewiczNV authored
      Co-authored-by: ptarasiewicz@nvidia.com <Piotr Tarasiewicz>
      44bde250
    • Graham King's avatar
      feat: Python bring-your-own-engine with our tokenizer (#47) · 12714d90
      Graham King authored
      Instead of using `out=pystr:<my.py>` we can now do this:
      ```
      dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout>
      ```
      
      That engine will receive and respond with tokens. Here's an example engine file:
      ```
      import asyncio
      
      async def generate(request):
          yield {"token_ids":[791]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[6864]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[315]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[9822]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[374]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[12366]}
          await asyncio.sleep(0.1)
          yield {"token_ids":[13]}
      ```
      
      Also reduce duplication by making the bindings engine use the llm lib engine.
      12714d90
    • Piotr Marcinkiewicz's avatar
      d752a1a2
    • Neelay Shah's avatar
      ac13ed06
    • Graham King's avatar
      feat: Bring-your-own engine for dynemo-run (#43) · 1b96c2c4
      Graham King authored
      1. Create `my_engine.py`
      
      ```
      import asyncio
      
      async def generate(request):
          yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
          await asyncio.sleep(0.1)
          yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"}
      ```
      
      2. Build
      
      ```
      cargo build --release --feature python
      ```
      
      3. Run
      
      ```
      dynemo-run out=pystr:my_engine.py --name test
      ```
      
      And here's a distributed system, with your engine:
      
      - Node 1: `dynemo-run in=http out=dyn://test`
      - Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`
      1b96c2c4
  6. 06 Mar, 2025 3 commits