Commits · 904730b9abdac12db8bb1b5db4966950b3e9b725 · OpenDAS / dynamo

28 Apr, 2025 1 commit
- feat: Adding completions endpoint support to `dynamo run in=http` (#777) · b495cd83
  Olga Andreeva authored Apr 28, 2025
```
Signed-off-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
```
  b495cd83
07 Apr, 2025 1 commit

feat(dynamo-run): Basic routing choice (#524) · ec2e7307

Graham King authored Apr 07, 2025

As a first step towards KV routing:
- introduce a `--router-mode` in dynamo-run that only does random and round-robin right now. Not that interesting yet.
- Make the vllm engine publish the KV events received from our patched vllm.

Now we "just" need to connect the two. Easy right?

ec2e7307

04 Apr, 2025 1 commit

feat: Python decorator dynamo_worker takes optional `static` parameter without etcd (#494) · 88ad3425

Graham King authored Apr 04, 2025

Adds `@dynamo_worker(static = True)` to create a static worker which has a predictable name and hence does not require discovery or `etcd` to be running. There can only be a single static worker per namespace / component / endpoint trio.

This contrasts with the default dynamic `dynamo_worker` endpoints we have now, which get a unique random name (based on namespace/component/endpoint), and are discovered by ingress components using etcd.

Also change the hello_world example to use `dynamo_worker(static = True)` so that it is exercised and demonstrated somewhere.

For NIM.

88ad3425

08 Mar, 2025 1 commit
- chore: rename dynamo (#44) · 602352ce
  Neelay Shah authored Mar 08, 2025
```
Co-authored-by: Biswa Panda <biswa.panda@gmail.com>
```
  602352ce
07 Mar, 2025 1 commit

fix: dynemo-run model discovery working again (#52) · 9f53922a

Graham King authored Mar 07, 2025

There are two etcd keys:
- The service
- The model

The second one is the interesting one for us. Previously we confused the two.

9f53922a

05 Mar, 2025 2 commits
- refactor: rename triton_distributed to dynemo (#22) · 1af7433b
  Neelay Shah authored Mar 05, 2025
```
Co-authored-by: Graham King <grahamk@nvidia.com>
```
  1af7433b
- refactor: Rename 'tio' to 'dynemo-run' (#18) · 14ce7e03
  Graham King authored Mar 04, 2025
  
  14ce7e03
04 Mar, 2025 1 commit
- feat: vllm engine tensor parallel and pipeline parallel (#16) · a657ec61
  Graham King authored Mar 04, 2025
```
Needs more testing but good enough for now. I get the same results with this as with `vllm serve`.
```
  a657ec61
27 Feb, 2025 2 commits
- refactor: rename ChatCompletionResponseDelta to NvCreateChatCompletionStreamResponse (#292) · 110f3f8c
  Paul Hendricks authored Feb 27, 2025
  
  110f3f8c
- refactor: rename ChatCompletionRequest to NvCreateChatCompletionRequest (#284) · 96866f43
  Paul Hendricks authored Feb 27, 2025
  
  96866f43
25 Feb, 2025 5 commits

feat: Add completion endpoint to http server and llmctl (#230) · b760c569
Alec authored Feb 25, 2025
```
Co-authored-by: aflowers <aflowers@nvidia.com>
```
b760c569
refactor: moving tio to launch dir · eb022ec9
Neelay Shah authored Feb 25, 2025

eb022ec9

feat: tio support preprocessor (#265) · 72064d84

Graham King authored Feb 25, 2025

Add backend type `EngineConfig::StaticCore` that wraps the engine in a preprocessor (prompt templating and tokenization).

Add example engine `echo_core` (`out=echo_core`) which takes and returns tokens. A nice side effect is that it echos the full prompt template with system prompt, whereas `echo_full` echos only user prompt.

![image](https://github.com/user-attachments/assets/27ec0a7b-a27d-4e69-96ea-1ffa0822ea90)

72064d84

ci: Add rust checks to missing directories (#239) · c06b95ff
Ryan McCormick authored Feb 25, 2025
```
Signed-off-by: Ryan McCormick <rmccormick@nvidia.com>
```
c06b95ff

refactor: move libs to lib dir · 08fcd7e9

Neelay Shah authored Feb 24, 2025


Signed-off-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

08fcd7e9

21 Feb, 2025 2 commits

feat(tio): Distributed inference! (#235) · 32a748e4

Graham King authored Feb 21, 2025

Add support in tio for distributed components and discovery.

Node 1:
```
tio in=http out=tdr://ns/backend/mistralrs
```

Node 2:
```
tio in=tdr://ns/backend/mistralrs out=mistralrs ~/llm_models/Llama-3.2-3B-Instruct
```

This will use etcd to auto-discover the model and NATS to talk to it. You can run multiple workers on the same endpoint and it will pick one at random each time.

The `ns/backend/mistralrs` are purely symbolic, pick anything as long as it has three parts, and it matches the other node.

32a748e4

feat: event plane + count · 3b7a462d

Ryan Olson authored Feb 21, 2025


Signed-off-by: Ryan Olson <ryanolson@users.noreply.github.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

3b7a462d

13 Feb, 2025 1 commit

feat: Add `tio` your friendly cmd line uncle to run triton-llm services (#174) · 418ae5e8

Graham King authored Feb 13, 2025

This provides a simple example of how to write a triton-llm engine, and how to connect it to the OpenAI HTTP server.

This is the tool previously called `nio` and `llmctl`.

- **Inputs**: Text and HTTP.
- **Engines**: Echo, which streams your prompt back with a slight delay.

Build: `cargo build`

Pre-requisites: `nats-server` and `etcd` must be running locally, even though they are not yet used by `tio`.

Run with text input:
```
./target/debug/tio in=text out=echo_full --model-name test
```

Run with the triton-llm HTTP server:
```
./target/debug/tio in=http out=echo_full --http-port 8080 --model-name Echo-0B
```

List models:
```
curl localhost:8080/v1/models | jq
```

Will output
```
{
  "object": "list",
  "data": [
    {
      "id": "Echo-0B",
      "object": "object",
      "created": 1739400430,
      "owned_by": "nvidia"
    }
  ]
}
```

#### What's next

As triton-distributed gains features `tio` will be able to grow:
- When we get the pre-processor we can have token-in token-out engines. 
- When we get a pull-router we can have `in=nats` and `out=nats`.
- When we get discovery we can have dynamic engines.

418ae5e8