docs: Cleanup index.rst (#2007)

c49a13eb · atchernych · GitHub · 73505c77 · 73505c77 · c49a13eb
Unverified Commit c49a13eb authored Jul 22, 2025 by atchernych Committed by GitHub Jul 22, 2025
5 changed files
--- a/docs/guides/cli_overview.md
+++ b/docs/guides/cli_overview.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# About the Dynamo Command Line Interface
-
-The Dynamo CLI serves, containerizes, and deploys Dynamo applications efficiently. It provides intuitive commands to manage your Dynamo services.
-
-## CLI Capabilities
-
-With the Dynamo CLI, you can:
-
-* Chat with models quickly using `run`
-* Serve multiple services locally using `serve`
-* Package your services into archive (called `dynamo artifact`) using `build`
-* Deploy pipelines to Dynamo Cloud using `deploy`
-
-## Commands
-
-### `run`
-
-Use `run` to start an interactive chat session with a model. This command executes the `dynamo-run` Rust binary under the hood. For more details, see [Running Dynamo](dynamo_run.md).
-
-#### Example
-```bash
-dynamo-run deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
-```
-
-### `serve`
-
-Use `serve` to run your defined inference graph locally. You'll need to specify your file and intended class using the file:Class syntax. For more details, see [Serving Inference Graphs](dynamo_serve.md).
-
-#### Usage
-```bash
-dynamo serve [SERVICE]
-```
-
-#### Arguments
-* `SERVICE`: Specify the service to start using file:Class syntax
-
-#### Flags
-* `--file`/`-f`: Path to optional YAML configuration file. For configuration examples, see the [SDK docs](../API/sdk.md)
-* `--dry-run`: Print the dependency graph and values without starting services
-* `--service-name`: Start only the specified service name
-* `--working-dir`: Set the directory for finding the Service instance
-* Additional flags following Class.key=value pattern are passed to the service constructor. For details, see the configuration section of the [SDK docs](../API/sdk.md)
-
-#### Example
-```bash
-cd examples
-# Start the Frontend, Middle, and Backend components
-dynamo serve hello_world:Frontend
-
-# Start only the Middle component in the graph that is discoverable from the Frontend service
-dynamo serve --service-name Middle hello_world:Frontend
-```
-
-For a detailed deployment example, see [Operator Deployment](dynamo_deploy/operator_deployment.md).
--- a/docs/guides/dynamo_deploy/quickstart.md
+++ b/docs/guides/dynamo_deploy/quickstart.md
@@ -79,15 +79,38 @@ export DOCKER_USERNAME='$oauthtoken'  # your-username if not using nvcr.io
 export DOCKER_PASSWORD=YOUR_NGC_CLI_API_KEY  # your-password if not using nvcr.io
 ```

+### Pick the Dynamo Inference Image
+
+Export the tag of the Dynamo Runtime Image.
+If you are using a pre-defined release:
+
+```bash
+export IMAGE_TAG=RELEASE_VERSION # i.e. 0.3.2 - the release you are using
+```
+
+Or build your own image first and tag it with IMAGE_TAG
+
 ```bash
-export IMAGE_TAG=RELEASE_VERSION # i.e. 0.3.2 - the release you are using or your-image-tag of you have built your own Dynamo image.
-# The  Nvidia Cloud Operator image will be pulled from the `$DOCKER_SERVER/dynamo-operator:$IMAGE_TAG`.
+export IMAGE_TAG=<your-pick>
+./container/build.sh
+docker tag dynamo:latest-vllm <your-registry>/dynamo-base:$IMAGE_TAG
+docker login <your-registry>
+docker push <your-registry>/dynamo-base:latest-vllm
 ```

-The operator image will be pulled from `$DOCKER_SERVER/dynamo-operator:$IMAGE_TAG`.
+[More on image building](../../../../README.md)
+

 ### Install Dynamo Cloud

+You need to build and push the Dynamo Cloud Operator Image by running
+
+```bash
+earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG
+```
+
+The  Nvidia Cloud Operator image will be pulled from the `$DOCKER_SERVER/dynamo-operator:$IMAGE_TAG`.
+
 You could run the `deploy.sh` or use the manual commands under Step 1 and Step 2.

 **Installing with a script (alternative to the Step 1 and Step 2)**

--- a/docs/guides/dynamo_run.md
+++ b/docs/guides/dynamo_run.md
-# Running Dynamo (`dynamo run`)
+# Running Dynamo CLI (`dynamo-run`)

-This guide explains the `dynamo run` command.

-`dynamo-run` is a CLI tool for exploring the Dynamo components. It's also an example of how to use components from Rust. If you use the Python wheel, it's available as `dynamo run`.
+With the Dynamo CLI, you can chat with models quickly using `dynamo-run`
+`dynamo-run` is a CLI tool for exploring the Dynamo components. It's also an example of how to use components from Rust. If you use the Python wheel, it's available as `dynamo-run`.

 It supports these engines: mistralrs, llamacpp, sglang, vllm, and tensorrt-llm. `mistralrs` is the default.

@@ -11,7 +11,7 @@ Usage:
 dynamo-run in=[http|text|dyn://<path>|batch:<folder>] out=echo_core|echo_full|mistralrs|llamacpp|sglang|vllm|dyn [--http-port 8080] [--model-path <path>] [--model-name <served-model-name>] [--model-config <hf-repo>] [--tensor-parallel-size=1] [--context-length=N] [--num-nodes=1] [--node-rank=0] [--leader-addr=127.0.0.1:9876] [--base-gpu-id=0] [--extra-engine-args=args.json] [--router-mode random|round-robin|kv] [--kv-overlap-score-weight=1.0] [--router-temperature=0.0] [--use-kv-events=true] [--verbosity (-v|-vv)]
 ```

-Example: `dynamo run Qwen/Qwen3-0.6B`
+Example: `dynamo-run Qwen/Qwen3-0.6B`

 Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.

@@ -32,12 +32,12 @@ The vllm and sglang engines require [etcd](https://etcd.io/) and [nats](https://

 To automatically download Qwen3 4B from Hugging Face (16 GiB download) and to start it in interactive text mode:
 ```
-dynamo run out=vllm Qwen/Qwen3-4B
+dynamo-run out=vllm Qwen/Qwen3-4B
 ```

 The general format for HF download follows this pattern:
 ```
-dynamo run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
+dynamo-run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
 ```

 For gated models (such as meta-llama/Llama-3.2-3B-Instruct), you must set an `HF_TOKEN` environment variable.
@@ -65,12 +65,12 @@ To run the model:

 *Text interface*
 ```
-dynamo run Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF file
+dynamo-run Llama-3.2-3B-Instruct-Q4_K_M.gguf # or path to a Hugging Face repo checkout instead of the GGUF file
 ```

 *HTTP interface*
 ```
-dynamo run in=http out=mistralrs Llama-3.2-3B-Instruct-Q4_K_M.gguf
+dynamo-run in=http out=mistralrs Llama-3.2-3B-Instruct-Q4_K_M.gguf
 ```
 You can also list models or send a request:

@@ -211,7 +211,7 @@ The KV-aware routing arguments:

 ## Full usage details

-`dynamo run` executes `dynamo-run`. `dynamo-run` is also an example of what can be built in Rust with the `dynamo-llm` and `dynamo-runtime` crates. The following guide shows how to build from source with all the features.
+ The `dynamo-run` is also an example of what can be built in Rust with the `dynamo-llm` and `dynamo-runtime` crates. The following guide shows how to build from source with all the features.

 ### Getting Started

@@ -472,7 +472,7 @@ See instructions [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/t

 See instructions [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md#run-container) to run the built environment.

-##### Step 3: Execute `dynamo run` command
+##### Step 3: Execute `dynamo-run` command

 Execute the following to load the TensorRT-LLM model specified in the configuration.
 ```

--- a/docs/guides/dynamo_serve.md
+++ b/docs/guides/dynamo_serve.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Serving Inference Graphs (`dynamo serve`)
-
-This guide explains how to create, configure, and deploy inference graphs locally for large language models using the `dynamo serve` command.
-
-Inference graphs are compositions of service components that work together to handle LLM inference. A typical graph might include:
-
- Frontend: OpenAI-compatible HTTP server that handles incoming requests
- Processor: Processes requests before passing to workers
- Router: Routes requests to appropriate workers based on specified strategy
- Workers: Handle the actual LLM inference (prefill and decode phases)
-
-## Creating an inference graph
-
-Once you've written Dynamo services ([see the SDK](https://github.com/ai-dynamo/dynamo/blob/main/deploy/dynamo/sdk/docs/sdk/README.md)), create an inference graph by composing them together using the following mechanisms:
-1. Dependencies with `depends()`
-2. Dynamic composition with `.link()`
-
-See the following sections for more details.
-
-### Dependencies with `depends()`
-
-```python
-from components.worker import VllmWorker
-
-class Processor:
-    worker = depends(VllmWorker)
-
-    # Now you can call worker methods directly
-    async def process(self, request):
-        result = await self.worker.generate(request)
-```
-
-Benefits of `depends()`:
-
- Automatically ensures dependent services are deployed
- Creates type-safe client connections between services
- Allows calling dependent service methods directly
-
-### Dynamic composition with `.link()`
-
-```python
-# From examples/llm/graphs/agg.py
-from components.frontend import Frontend
-from components.processor import Processor
-from components.worker import VllmWorker
-
-Frontend.link(Processor).link(VllmWorker)
-```
-
-This creates a graph where:
-
- Frontend depends on Processor
- Processor depends on VllmWorker
-
-The `.link()` method is useful for:
-
- Dynamically building graphs at runtime
- Selectively activating specific dependencies
- Creating different graph configurations from the same components
-
-## Deploying the inference graph
-
-Once you've defined your inference graph and its configuration, deploy it locally using the `dynamo serve` command. We recommend running the `--dry-run` command to see what arguments will be passed into your final graph.
-
-Consider the following example.
-
-### Guided Example
-
-The files referenced in this example can be found [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components). You need 1 GPU minimum to run this example. This example can be run from the `examples/llm` directory.
-
-This example walks through:
-1. [Defining your components](#define-your-components)
-2. [Defining your graph](#define-your-graph)
-3. [Defining your configuration](#define-your-configuration)
-4. [Serving your graph](#serve-your-graph)
-
-See the following sections for details.
-
-
-#### Define your components
-
-In this example we'll be deploying an aggregated serving graph. Our components include:
-
-1. Frontend - OpenAI-compatible HTTP server that handles incoming requests
-2. Processor - Runs processing steps and routes the request to a worker
-3. VllmWorker - Handles the prefill and decode phases of the request
-
-```python
-# components/frontend.py
-class Frontend:
-    worker = depends(VllmWorker)
-    worker_routerless = depends(VllmWorkerRouterLess)
-    processor = depends(Processor)
-
-    ...
-```
-
-```python
-# components/processor.py
-class Processor(ProcessMixIn):
-    worker = depends(VllmWorker)
-    router = depends(Router)
-
-    ...
-```
-
-```python
-# components/worker.py
-class VllmWorker:
-    prefill_worker = depends(PrefillWorker)
-
-    ...
-```
-
-Note that our prebuilt components have the maximal set of dependencies needed to run the component, which allows you to plug different components into the same graph to create different architectures. When writing your own components, you can be as flexible as you like.
-
-#### Define your graph
-
-```python
-# graphs/agg.py
-from components.frontend import Frontend
-from components.processor import Processor
-from components.worker import VllmWorker
-
-Frontend.link(Processor).link(VllmWorker)
-```
-
-#### Define your configuration
-
-We provide [basic configurations](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/configs/agg.yaml) that you can change; you can also override them by passing in CLI flags to `dynamo serve`.
-
-#### Serve your graph
-
-Before serving your graph, ensure that NATS and etcd are running using the [docker compose file](https://github.com/ai-dynamo/dynamo/blob/main/deploy/metrics/docker-compose.yml) file in the deploy directory.
-
-```bash
-docker compose up -d
-```
-Note that the we point toward the first node in our graph. In this case, it's the `Frontend` service.
-
-```bash
-# check out the configuration that will be used when we serve
-dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --dry-run
-```
-
-This returns output like:
-
-```bash
-Service Configuration:
-{
-  "Common": {
-    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
-    "block-size": 64,
-    "max-model-len": 16384,
-  },
-  "Frontend": {
-    "served_model_name": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
-    "endpoint": "dynamo.Processor.chat/completions",
-    "port": 8000
-  },
-  "Processor": {
-    "router": "round-robin",
-    "common-configs": [model, block-size, max-model-len]
-  },
-  "VllmWorker": {
-    "enforce-eager": true,
-    "max-num-batched-tokens": 16384,
-    "enable-prefix-caching": true,
-    "router": "random",
-    "tensor-parallel-size": 1,
-    "ServiceArgs": {
-      "workers": 1
-    },
-    "common-configs": [model, block-size, max-model-len]
-  }
-}
-
-Environment Variable that would be set:
-DYNAMO_SERVICE_CONFIG={"Common": {"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "block-size": 64, "max-model-len": 16384}, "Frontend": {"served_model_name": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "endpoint": "dynamo.Processor.chat/completions", "port": 8000}, "Processor": {"router": "round-robin", "common-configs": ["model", "block-size", "max-model-len"]}, "VllmWorker": {"enforce-eager": true, "max-num-batched-tokens": 16384, "enable-prefix-caching":
-true, "router": "random", "tensor-parallel-size": 1, "ServiceArgs": {"workers": 1}, "common-configs": ["model", "block-size", "max-model-len"]}}
-```
-
-You can override any of these configuration options by passing in CLI flags to serve. For example, to change the routing strategy, you can run
-
-```bash
-dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --Processor.router=random --dry-run
-```
-
-Which prints out output like:
-
-```bash
-  #...
-  "Processor": {
-    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
-    "block-size": 64,
-    "max-model-len": 16384,
-    "router": "random"
-  },
-  #...
-```
-
-Once you're ready - simply remove the `--dry-run` flag to serve your graph!
-
-```bash
-dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
-```
-
-Once everything is running, you can test your graph by making a request to the frontend from a different window.
-
-```bash
-curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
-    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
-    "messages": [
-    {
-        "role": "user",
-        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
-    }
-    ],
-    "stream":false,
-    "max_tokens": 30
-  }'
-```
-
-## Close deployment
-
-```{important}
-We are aware of an issue where vLLM subprocesses might not be killed when `ctrl-c` is pressed.
-We are working on addressing this. Relevant vLLM issues can be found [here](https://github.com/vllm-project/vllm/pull/8492) and [here](https://github.com/vllm-project/vllm/issues/6219#issuecomment-2439257824).
-
-To stop the serve, you can press `ctrl-c` which kills the components. In order to kill the remaining vLLM subprocesses you can run `nvidia-smi` and `kill -9` the remaining processes or run `pkill python3` from inside of the container.
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -91,13 +91,10 @@ The examples below assume you build the latest image yourself from source. If us

 .. toctree::
   :hidden:
-   :caption: Dynamo Command Line Interface
+   :caption: Using Dynamo

-   CLI Overview <guides/cli_overview.md>
-   Running Dynamo (dynamo run) <guides/dynamo_run.md>
-   Serving Inference Graphs (dynamo serve) <guides/dynamo_serve.md>
-   Building Dynamo (dynamo build) <guides/dynamo_build.md>
-   Deploying Inference Graphs (dynamo deploy) <guides/dynamo_deploy/README.md>
+   Running Inference Graphs Locally (dynamo-run) <guides/dynamo_run.md>
+   Deploying Inference Graphs <guides/dynamo_deploy/README.md>

 .. toctree::
   :hidden:
@@ -114,7 +111,6 @@ The examples below assume you build the latest image yourself from source. If us

   Dynamo Deploy Quickstart <guides/dynamo_deploy/quickstart.md>
   Dynamo Cloud Kubernetes Platform <guides/dynamo_deploy/dynamo_cloud.md>
-   Deploying Dynamo Inference Graphs to Kubernetes using the Dynamo Cloud Platform <guides/dynamo_deploy/operator_deployment.md>
   Manual Helm Deployment <guides/dynamo_deploy/manual_helm_deployment.md>
   GKE Setup Guide <guides/dynamo_deploy/gke_setup.md>
   Minikube Setup Guide <guides/dynamo_deploy/minikube.md>
@@ -131,15 +127,13 @@ The examples below assume you build the latest image yourself from source. If us
   :hidden:
   :caption: API

-   SDK Reference <API/sdk.md>
   Python API <API/python_bindings.md>

 .. toctree::
   :hidden:
   :caption: Examples

-   Hello World Example: Basic <examples/hello_world.md>
-   Hello World Example: Aggregated and Disaggregated Deployment <examples/disagg_skeleton.md>
+   Aggregated and Disaggregated Deployment <examples/disagg_skeleton.md>
   LLM Deployment Examples <examples/llm_deployment.md>
   Multinode Examples <examples/multinode.md>
   LLM Deployment Examples using TensorRT-LLM <examples/trtllm.md>