- 26 Apr, 2025 1 commit
-
-
Hongkuan Zhou authored
Signed-off-by:
Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by:
ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by:
ishandhanani <ishandhanani@gmail.com> Co-authored-by:
Ubuntu <ubuntu@dev-inst-2w1vokvyuts83rzn4n1k7mnzew9.us-central1-a.c.brevdevprod.internal> Co-authored-by:
Biswa Panda <biswa.panda@gmail.com> Co-authored-by:
Anant Sharma <anants@nvidia.com>
-
- 25 Apr, 2025 2 commits
-
-
Harrison Saturley-Hall authored
Signed-off-by:Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com>
-
Graham King authored
This will allow an ingress-side pre-processor to see it without needing a model checkout. Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files. To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store. The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided. Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit. Part of #743
-
- 21 Apr, 2025 2 commits
-
-
Pankaj Gupta authored
-
ishandhanani authored
-
- 18 Apr, 2025 1 commit
-
-
Hongkuan Zhou authored
Co-authored-by:ishandhanani <82981111+ishandhanani@users.noreply.github.com>
-
- 17 Apr, 2025 2 commits
-
-
tlipoca9 authored
Co-authored-by:ishandhanani <82981111+ishandhanani@users.noreply.github.com>
-
Ryan McCormick authored
-
- 12 Apr, 2025 1 commit
-
-
Hongkuan Zhou authored
feat: ETCD prefix watcher + python binding + runtime reconfiguration for router and disagg router (#581) Signed-off-by:
Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by:
Neelay Shah <neelays@nvidia.com>
-
- 11 Apr, 2025 1 commit
-
-
Cole authored
-
- 09 Apr, 2025 1 commit
-
-
Anant Sharma authored
-
- 04 Apr, 2025 2 commits
-
-
Yan Ru Pei authored
-
Graham King authored
Adds `@dynamo_worker(static = True)` to create a static worker which has a predictable name and hence does not require discovery or `etcd` to be running. There can only be a single static worker per namespace / component / endpoint trio. This contrasts with the default dynamic `dynamo_worker` endpoints we have now, which get a unique random name (based on namespace/component/endpoint), and are discovered by ingress components using etcd. Also change the hello_world example to use `dynamo_worker(static = True)` so that it is exercised and demonstrated somewhere. For NIM.
-
- 03 Apr, 2025 2 commits
-
-
Ryan Olson authored
Moved all of `lib/llm/src/engines` to their own crates as e.g. `lib/engines/mistralrs`. This will allow publishing of the `dynamo-llm` crate as it won't have any github dependencies. The only engines in dynamo-llm will be the demo `echo` ones. Co-authored-by:Graham King <grahamk@nvidia.com>
-
Ryan Olson authored
-
- 02 Apr, 2025 1 commit
-
-
Ryan Olson authored
-
- 01 Apr, 2025 1 commit
-
-
Ryan Olson authored
-
- 31 Mar, 2025 1 commit
-
-
Ryan Olson authored
-
- 28 Mar, 2025 1 commit
-
-
Biswa Panda authored
-
- 24 Mar, 2025 1 commit
-
-
Graham King authored
This lets us do: ``` dynamo-run out=llamacpp <gguf_file> ``` Previously a `--model-config <hf-repo>` was also required, to configure our tokenizer.
-
- 19 Mar, 2025 1 commit
-
-
Graham King authored
This makes the Rust parts all use ring / rustls library instead of local install of openssl. It's a step on the journey to being statically linked. Pieces: - `tokenizers` and `mistralrs` now support rustls (mistralrs by default, tokenizers with feature flag). - Move shared dependencies up into workspace - New `rand` crate has some renames for future rust - Ensure the dependency doesn't creep back in by enforcing it with cargo deny.
-
- 18 Mar, 2025 1 commit
-
-
Dmitry Tokarev authored
Co-authored-by:Anant Sharma <anants@nvidia.com>
-
- 17 Mar, 2025 1 commit
-
-
GuanLuo authored
-
- 16 Mar, 2025 1 commit
-
-
April Yang authored
Co-authored-by:
Julien Mancuso <jmancuso@nvidia.com> Co-authored-by:
Hannah Zhang <hannahz@nvidia.com> Co-authored-by:
Biswa Panda <biswa.panda@gmail.com> Co-authored-by:
Maksim Khadkevich <mkhadkevich@nvidia.com>
-
- 14 Mar, 2025 2 commits
-
-
Anant Sharma authored
-
Ryan Olson authored
-
- 13 Mar, 2025 1 commit
-
-
Anant Sharma authored
-
- 11 Mar, 2025 2 commits
-
-
Alec authored
-
Biswa Panda authored
-
- 10 Mar, 2025 1 commit
-
-
Anant Sharma authored
-
- 09 Mar, 2025 5 commits
-
-
Alec authored
-
Neelay Shah authored
Co-authored-by:Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
-
Neelay Shah authored
Co-authored-by:
Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com> Co-authored-by:
Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
-
GuanLuo authored
-
Hongkuan Zhou authored
Signed-off-by:
Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by:
hongkuan <hongkuanz@nvidia.com> Co-authored-by:
Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-authored-by:
Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local> Co-authored-by:
alec-flowers <aflowers@nvidia.com> Co-authored-by:
Neelay Shah <neelays@nvidia.com>
-
- 08 Mar, 2025 3 commits
-
-
Dmitry Tokarev authored
-
Neelay Shah authored
Co-authored-by:Biswa Panda <biswa.panda@gmail.com>
-
GuanLuo authored
-
- 07 Mar, 2025 1 commit
-
-
Graham King authored
Instead of using `out=pystr:<my.py>` we can now do this: ``` dynemo-run out=pytok:/home/graham/my_python_engine.py --model-path <hf-repo-checkout> ``` That engine will receive and respond with tokens. Here's an example engine file: ``` import asyncio async def generate(request): yield {"token_ids":[791]} await asyncio.sleep(0.1) yield {"token_ids":[6864]} await asyncio.sleep(0.1) yield {"token_ids":[315]} await asyncio.sleep(0.1) yield {"token_ids":[9822]} await asyncio.sleep(0.1) yield {"token_ids":[374]} await asyncio.sleep(0.1) yield {"token_ids":[12366]} await asyncio.sleep(0.1) yield {"token_ids":[13]} ``` Also reduce duplication by making the bindings engine use the llm lib engine.
-
- 06 Mar, 2025 1 commit
-
-
GuanLuo authored
-