Commits · 1eab75d24e4d9710c758617088c15b8e7ef40b4b · OpenDAS / dynamo

13 May, 2025 1 commit

fix: bugfix - dynamo serve merge issue and service config fixes (#1036) · 1eab75d2

Biswa Panda authored May 13, 2025

Co-authored-by: Graham King <grahamk@nvidia.com>
Co-authored-by: hongkuan <hongkuanz@nvidia.com>
Co-authored-by: Ubuntu <ubuntu@crusoe-prod--inst-2wjuoekvfq72mlpdrcugujrtgfp.us-east1-a.compute.internal>

1eab75d2

09 May, 2025 1 commit
- feat(sglang): aggregated support (#937) · 5d5235bc
  ishandhanani authored May 08, 2025
```
Co-authored-by: ishandhanani <ishandhananai@gmail.com>
```
  5d5235bc
07 May, 2025 2 commits

fix: Fix vllm/sglang engine model name if using HF repo (#986) · 92bbbc39
Graham King authored May 07, 2025
```
Signed-off-by: Graham King <graham@gkgk.org>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
```
92bbbc39

chore: Remove embedded Python vllm and sglang engines (#966) · 42969800

Graham King authored May 07, 2025

vllm and sglang are now the sub-process engines from #954

Also updated docs on doing vllm and sglang multi-gpu (tensor parallel) and multi-node (pipeline parallel).

42969800

06 May, 2025 2 commits

feat(dynamo-run): vllm and sglang subprocess engines (#954) · 28fd481c

Graham King authored May 06, 2025

New vllm and sglang engines that run in a sub-process. Will hopefully replace the existing embedded python engines.
    
Why?
    
  - Pure Python, does not require knowing Rust to work on it. Much simpler to maintain.
  - No embedded Python interpreter which avoids linking libpython and avoids the MacOS virtualenv issues.
  - Should have better performance as it's "native" vllm / sglang.
  - Works with any version of vllm (including v1!) and sglang. Less upgrade struggle.

28fd481c

feat: dynamo-run <-> python interop (#934) · 99cd9d85

Graham King authored May 05, 2025

Adding this to a Python script makes it register on the network so that `dynamo-run` can discover it and send it requests:
```
from dynamo.llm import register_llm

MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
await register_llm(endpoint, MODEL, 3)
```

Full vllm example, with pre-processing in dynamo:
- `dynamo-run in=text out=dyn://dynamo.backend.generate`
- `cd lib/bindings/python/examples/hello_world`
- `python server_vllm.py`

This builds on top of the work to move pre-processor to ingress side. It means we can decouple Rust and Python using NATS as the bus.

The `register_llm` call does this:

- Download the model from HF if necessary
- Load the model deployment card from the HF folder or extract from GGUF
- Push the tokenizer config etc into NATS object store so ingress can access it from a different machine
- Publish the model deployment card to ETCD

99cd9d85