- 08 Mar, 2025 6 commits
-
-
Harrison Saturley-Hall authored
-
Dmitry Tokarev authored
-
Neelay Shah authored
-
Neelay Shah authored
Co-authored-by:Biswa Panda <biswa.panda@gmail.com>
-
Pavithra Vijayakrishnan authored
-
GuanLuo authored
-
- 07 Mar, 2025 10 commits
-
-
Anant Sharma authored
-
Ryan McCormick authored
-
Graham King authored
There are two etcd keys: - The service - The model The second one is the interesting one for us. Previously we confused the two.
-
Biswa Panda authored
Co-authored-by:Neelay Shah <neelays@nvidia.com>
-
Ryan McCormick authored
Replaces hard-coded "kv-hit-rate" string in multiple places with KV_HIT_RATE_SUBJECT constant in lib/llm.
-
ptarasiewiczNV authored
Co-authored-by: ptarasiewicz@nvidia.com <Piotr Tarasiewicz>
-
Graham King authored
Instead of using `out=pystr:<my.py>` we can now do this: ``` dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout> ``` That engine will receive and respond with tokens. Here's an example engine file: ``` import asyncio async def generate(request): yield {"token_ids":[791]} await asyncio.sleep(0.1) yield {"token_ids":[6864]} await asyncio.sleep(0.1) yield {"token_ids":[315]} await asyncio.sleep(0.1) yield {"token_ids":[9822]} await asyncio.sleep(0.1) yield {"token_ids":[374]} await asyncio.sleep(0.1) yield {"token_ids":[12366]} await asyncio.sleep(0.1) yield {"token_ids":[13]} ``` Also reduce duplication by making the bindings engine use the llm lib engine. -
Piotr Marcinkiewicz authored
-
Neelay Shah authored
-
Graham King authored
1. Create `my_engine.py` ``` import asyncio async def generate(request): yield {"id":"1","choices":[{"index":0,"delta":{"content":"The","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" capital","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" of","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" France","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" is","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":" Paris","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":".","role":"assistant"}}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} await asyncio.sleep(0.1) yield {"id":"1","choices":[{"index":0,"delta":{"content":"","role":"assistant"},"finish_reason":"stop"}],"created":1841762283,"model":"Llama-3.2-1B-Instruct","system_fingerprint":"local","object":"chat.completion.chunk"} ``` 2. Build ``` cargo build --release --feature python ``` 3. Run ``` dynemo-run out=pystr:my_engine.py --name test ``` And here's a distributed system, with your engine: - Node 1: `dynemo-run in=http out=dyn://test` - Node 2: `dynemo-run in=dyn://test out=pystr:my_engine.py`
-
- 06 Mar, 2025 7 commits
-
-
ptarasiewiczNV authored
Co-authored-by: ptarasiewicz@nvidia.com <Piotr Tarasiewicz>
-
ptarasiewiczNV authored
Co-authored-by: ptarasiewicz@nvidia.com <Piotr Tarasiewicz>
-
Ryan McCormick authored
-
Anant Sharma authored
-
Pawel Ziecina authored
Co-authored-by:ptarasiewiczNV <104908264+ptarasiewiczNV@users.noreply.github.com>
-
Ryan McCormick authored
-
GuanLuo authored
-
- 05 Mar, 2025 8 commits
-
-
Neelay Shah authored
-
Graham King authored
Fixes a panic.
-
Neelay Shah authored
-
Neelay Shah authored
Co-authored-by:Graham King <grahamk@nvidia.com>
-
Harrison Saturley-Hall authored
-
Maksim Khadkevich authored
-
Graham King authored
-
NVShreyas authored
Co-authored-by:
Tanmay Verma <tanmayv@nvidia.com> Co-authored-by:
Tanmay Verma <tanmayv@login-eos01.eos.clusters.nvidia.com> Co-authored-by:
Tanmay Verma <tanmay2592@gmail.com> Co-authored-by:
Ryan McCormick <rmccormick@nvidia.com>
-
- 04 Mar, 2025 9 commits
-
-
Graham King authored
Needs more testing but good enough for now. I get the same results with this as with `vllm serve`.
-
Biswa Panda authored
-
ishandhanani authored
-
J Wyman authored
-
Neelay Shah authored
Co-authored-by:
hongkuanz <hongkuanz@nvidia.com> Co-authored-by:
Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-authored-by:
Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local> Co-authored-by:
Neelay Shah <neelays@ipp2-0493.ipp2u1.colossus.nvidia.com> Co-authored-by:
Neelay Shah <neelays@ipp1-1941.ipp1a1.colossus.nvidia.com> Co-authored-by:
ishandhanani <ishandhanani@gmail.com> Co-authored-by:
Neelay Shah <neelays@4u8g-gen-0078.ipp3a2.colossus.nvidia.com> Co-authored-by:
ptarasiewiczNV <104908264+ptarasiewiczNV@users.noreply.github.com>
-
Harrison Saturley-Hall authored
-
ptarasiewiczNV authored
Co-authored-by:Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local>
-
Meenakshi Sharma authored
-
Harrison Saturley-Hall authored
Fixing GitHub CI for dynemo
-