- 07 May, 2025 1 commit
-
-
Hongkuan Zhou authored
-
- 06 May, 2025 4 commits
-
-
jthomson04 authored
-
Graham King authored
New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines. Why? - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain. - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues. - Should have better performance as it's "native" vllm / sglang. - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle. -
hhzhang16 authored
-
Graham King authored
Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests: ``` from dynamo.llm import register_llm MODEL = "Qwen/Qwen2.5-0.5B-Instruct" await register_llm(endpoint, MODEL, 3) ``` Full vllm example, with pre-processing in dynamo: - `dynamo-run in=text out=dyn://dynamo.backend.generate` - `cd lib/bindings/python/examples/hello_world` - `python server_vllm.py` This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus. The `register_llm` call does this: - Download the model from HF if necessary - Load the model deployment card from the HF folder or extract from GGUF - Push the tokenizer config etc into NATS object store so ingress can access it from a different machine - Publish the model deployment card to ETCD
-
- 01 May, 2025 1 commit
-
-
Graham King authored
Part of https://github.com/ai-dynamo/dynamo/issues/743
-
- 29 Apr, 2025 1 commit
-
-
Graham King authored
In a distributed system we don't know if the remote workers need pre-processing done ingress-side or not. Previously Client required us to decide this before discovering the remote endpoints, which was fine because pre-processing was worker-side. As part of moving pre-processing back to ingress-side we need to split this into two steps: - Client discovers the endpoints, and (later PR) will fetch their Model Deployment Card. - PushRouter will use the Model Deployment Card to decide if they need pre-processing or not, which affects the types of the generic parameters. Part of #743
-
- 26 Apr, 2025 1 commit
-
-
Hongkuan Zhou authored
Signed-off-by:
Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by:
ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by:
ishandhanani <ishandhanani@gmail.com> Co-authored-by:
Ubuntu <ubuntu@dev-inst-2w1vokvyuts83rzn4n1k7mnzew9.us-central1-a.c.brevdevprod.internal> Co-authored-by:
Biswa Panda <biswa.panda@gmail.com> Co-authored-by:
Anant Sharma <anants@nvidia.com>
-
- 25 Apr, 2025 2 commits
-
-
Harrison Saturley-Hall authored
Signed-off-by:Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com>
-
Graham King authored
This will allow an ingress-side pre-processor to see it without needing a model checkout. Currently pre-processing is done in the worker, which has access to the model deployment card ("MDC") files (`config.json`, `tokenizer.json` and `tokenizer_config.json`) locally. We want to move the pre-processor to the ingress side to support KV routing. That requires ingress side (i.e the HTTP server), on a different machine than the worker to be able to see those three files. To support that this PR makes the worker upload the contents of those files to the NATS object store, and publishes the MDC with those NATS urls to the key-value store. The key-value store has an interface so any store (nats, etcd, redis, etc) can be supported. Implementations for memory and NATS are provided. Fetching the MDC from the store, doing pre-processing ingress side, and publishing a card backed by a GGUF, are all for a later commit. Part of #743
-
- 21 Apr, 2025 1 commit
-
-
Abrar Shivani authored
-
- 18 Apr, 2025 1 commit
-
-
Hongkuan Zhou authored
Co-authored-by:ishandhanani <82981111+ishandhanani@users.noreply.github.com>
-
- 12 Apr, 2025 1 commit
-
-
Hongkuan Zhou authored
feat: ETCD prefix watcher + python binding + runtime reconfiguration for router and disagg router (#581) Signed-off-by:
Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by:
Neelay Shah <neelays@nvidia.com>
-
- 09 Apr, 2025 1 commit
-
-
Anant Sharma authored
-
- 07 Apr, 2025 1 commit
-
-
Graham King authored
As a first step towards KV routing: - introduce a `--router-mode` in dynamo-run that only does random and round-robin right now. Not that interesting yet. - Make the vllm engine publish the KV events received from our patched vllm. Now we "just" need to connect the two. Easy right?
-
- 04 Apr, 2025 2 commits
-
-
Graham King authored
Also upgrade the cargo resolver to v3, the default. New clippy lints: - `next_back()` instead of `last()` for a double-ended iterator. That avoids walking the whole list. - ` repeat_n` instead of `repeat.take`. That avoids cloning. - Doc indenting
-
Graham King authored
Adds `@dynamo_worker(static = True)` to create a static worker which has a predictable name and hence does not require discovery or `etcd` to be running. There can only be a single static worker per namespace / component / endpoint trio. This contrasts with the default dynamic `dynamo_worker` endpoints we have now, which get a unique random name (based on namespace/component/endpoint), and are discovered by ingress components using etcd. Also change the hello_world example to use `dynamo_worker(static = True)` so that it is exercised and demonstrated somewhere. For NIM.
-
- 03 Apr, 2025 1 commit
-
-
tlipoca9 authored
-
- 02 Apr, 2025 1 commit
-
-
Ryan Olson authored
-
- 01 Apr, 2025 1 commit
-
-
Ryan Olson authored
-
- 31 Mar, 2025 1 commit
-
-
Ryan Olson authored
-
- 19 Mar, 2025 2 commits
-
-
Anant Sharma authored
Co-authored-by:Dmitry Tokarev <dtokarev@nvidia.com>
-
Graham King authored
This makes the Rust parts all use ring / rustls library instead of local install of openssl. It's a step on the journey to being statically linked. Pieces: - `tokenizers` and `mistralrs` now support rustls (mistralrs by default, tokenizers with feature flag). - Move shared dependencies up into workspace - New `rand` crate has some renames for future rust - Ensure the dependency doesn't creep back in by enforcing it with cargo deny.
-
- 18 Mar, 2025 2 commits
-
-
Dmitry Tokarev authored
Co-authored-by:Anant Sharma <anants@nvidia.com>
-
Harrison Saturley-Hall authored
Co-authored-by:Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com>
-
- 17 Mar, 2025 2 commits
-
-
Graham King authored
-
GuanLuo authored
-
- 14 Mar, 2025 3 commits
-
-
Ryan McCormick authored
-
Ryan McCormick authored
-
Ryan Olson authored
-
- 13 Mar, 2025 2 commits
-
-
Anant Sharma authored
-
Graham King authored
- Any engine can take the name of a Hugging Face repository. It will be downloaded before calling the engine. - The default engine (previously always mistralrs) depends on what is compiled in. - Text can be piped in and will result in a single run of the model. All of those together mean if you build with `--features vllm` you can do this and it will download the model and run it with vllm, answer your question, and exit: ``` echo "What is the capital of Costa Rica?" | dynamo-run Qwen/Qwen2.5-3B-Instruct ``` Co-authored-by:Ryan McCormick <rmccormick@nvidia.com>
-
- 11 Mar, 2025 1 commit
-
-
Neelay Shah authored
Co-authored-by:Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com>
-
- 10 Mar, 2025 1 commit
-
-
Anant Sharma authored
-
- 09 Mar, 2025 2 commits
-
-
Neelay Shah authored
Co-authored-by:Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
-
Neelay Shah authored
Co-authored-by:
Harrison Saturley-Hall <454891+saturley-hall@users.noreply.github.com> Co-authored-by:
Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
-
- 08 Mar, 2025 2 commits
-
-
Dmitry Tokarev authored
-
Neelay Shah authored
Co-authored-by:Biswa Panda <biswa.panda@gmail.com>
-
- 07 Mar, 2025 2 commits
-
-
Graham King authored
There are two etcd keys: - The service - The model The second one is the interesting one for us. Previously we confused the two.
-
Ryan McCormick authored
Replaces hard-coded "kv-hit-rate" string in multiple places with KV_HIT_RATE_SUBJECT constant in lib/llm.
-