@@ -15,29 +15,10 @@ See the License for the specific language governing permissions and
...
@@ -15,29 +15,10 @@ See the License for the specific language governing permissions and
limitations under the License.
limitations under the License.
-->
-->
# LLM Deployment Examples using TensorRT-LLM
# LLM Deployment using TensorRT-LLM
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can configure the deployment to always use either aggregate or disaggregated serving.
This figure shows an overview of the major components to deploy:
## Single Node Examples
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
This figure shows an overview of the major components to deploy:
@@ -111,29 +119,23 @@ This figure shows an overview of the major components to deploy:
...
@@ -111,29 +119,23 @@ This figure shows an overview of the major components to deploy:
| +---------| kv-router |
| +---------| kv-router |
+------------->| |
+------------->| |
+------------------+
+------------------+
```
```
**Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.
**Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.
### Single-Node Deployments
### Aggregated
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each command and run them in separate terminals.
#### Aggregated
```bash
```bash
cd$DYNAMO_HOME/components/backends/trtllm
cd$DYNAMO_HOME/components/backends/trtllm
./launch/agg.sh
./launch/agg.sh
```
```
#### Aggregated with KV Routing
### Aggregated with KV Routing
```bash
```bash
cd$DYNAMO_HOME/components/backends/trtllm
cd$DYNAMO_HOME/components/backends/trtllm
./launch/agg_router.sh
./launch/agg_router.sh
```
```
#### Disaggregated
### Disaggregated
> [!IMPORTANT]
> [!IMPORTANT]
> Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.
> Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.
...
@@ -143,7 +145,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
...
@@ -143,7 +145,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
./launch/disagg.sh
./launch/disagg.sh
```
```
#### Disaggregated with KV Routing
### Disaggregated with KV Routing
> [!IMPORTANT]
> [!IMPORTANT]
> Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.
> Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.
...
@@ -153,7 +155,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
...
@@ -153,7 +155,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
./launch/disagg_router.sh
./launch/disagg_router.sh
```
```
#### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
```bash
```bash
cd$DYNAMO_HOME/components/backends/trtllm
cd$DYNAMO_HOME/components/backends/trtllm
...
@@ -172,21 +174,16 @@ Notes:
...
@@ -172,21 +174,16 @@ Notes:
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
### Multinode Deployment
## Advanced Examples
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
### Client
See [client](../llm/README.md#client) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
### Benchmarking
### Multinode Deployment
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
@@ -221,6 +218,13 @@ indicates a request to this model may be migrated up to 3 times to another Backe
...
@@ -221,6 +218,13 @@ indicates a request to this model may be migrated up to 3 times to another Backe
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
## More Example Architectures
## Client
See [client](../llm/README.md#client) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
## Deployment Architectures
## Use the Latest Release
See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. vLLM supports aggregated, disaggregated, and KV-routed serving patterns.
We recommend using the latest stable release of Dynamo to avoid breaking changes:
| **WideEP** | ✅ | Support for PPLX / DeepEP not verified |
| **DP Rank Routing**| ✅ | Supported via external control of DP ranks |
| **GB200 Support** | 🚧 | Container functional on main |
## Quick Start
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start NATS and ETCD in the background
Start using [Docker Compose](../../../deploy/docker-compose.yml)
```bash
```bash
docker compose -f deploy/docker-compose.yml up -d
docker compose -f deploy/docker-compose.yml up -d
```
```
### Build and Run docker
### Pull or build container
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
## Run Deployment
## Run Single Node Examples
> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
This figure shows an overview of the major components to deploy:
This figure shows an overview of the major components to deploy:
...
@@ -53,12 +101,7 @@ This figure shows an overview of the major components to deploy:
...
@@ -53,12 +101,7 @@ This figure shows an overview of the major components to deploy:
Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
### Example Architectures
### Aggregated Serving
> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `dynamo run` to start the ingress and uses `python3 main.py` to start the vLLM workers. You can run each command in separate terminals for better log visibility.
#### Aggregated Serving
```bash
```bash
# requires one gpu
# requires one gpu
...
@@ -66,7 +109,7 @@ cd components/backends/vllm
...
@@ -66,7 +109,7 @@ cd components/backends/vllm
bash launch/agg.sh
bash launch/agg.sh
```
```
#### Aggregated Serving with KV Routing
### Aggregated Serving with KV Routing
```bash
```bash
# requires two gpus
# requires two gpus
...
@@ -74,7 +117,7 @@ cd components/backends/vllm
...
@@ -74,7 +117,7 @@ cd components/backends/vllm
bash launch/agg_router.sh
bash launch/agg_router.sh
```
```
#### Disaggregated Serving
### Disaggregated Serving
```bash
```bash
# requires two gpus
# requires two gpus
...
@@ -82,7 +125,7 @@ cd components/backends/vllm
...
@@ -82,7 +125,7 @@ cd components/backends/vllm
bash launch/disagg.sh
bash launch/disagg.sh
```
```
#### Disaggregated Serving with KV Routing
### Disaggregated Serving with KV Routing
```bash
```bash
# requires three gpus
# requires three gpus
...
@@ -90,9 +133,9 @@ cd components/backends/vllm
...
@@ -90,9 +133,9 @@ cd components/backends/vllm
bash launch/disagg_router.sh
bash launch/disagg_router.sh
```
```
#### Single Node Data Parallel Attention / Expert Parallelism
### Single Node Data Parallel Attention / Expert Parallelism
This example is not meant to be performant but showcases dynamo routing to data parallel workers
This example is not meant to be performant but showcases Dynamo routing to data parallel workers
```bash
```bash
# requires four gpus
# requires four gpus
...
@@ -100,10 +143,13 @@ cd components/backends/vllm
...
@@ -100,10 +143,13 @@ cd components/backends/vllm
bash launch/dep.sh
bash launch/dep.sh
```
```
> [!TIP]
> [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
## Advanced Examples
Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!
### Kubernetes Deployment
### Kubernetes Deployment
For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:
For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:
...
@@ -118,7 +164,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
...
@@ -118,7 +164,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
-**Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
-**Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
-**Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/vllm-runtime`. If you don't have access, build and push your own image:
-**Container Images**: We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd prefer to use your own registry, build and push your own image:
Dynamo `DistributedRuntime` is the core infrastructure in dynamo that enables distributed communication and coordination between different dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via binding (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
Dynamo's`DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
-`DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
-`DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
-`Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
-`Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
...
@@ -28,13 +28,13 @@ Dynamo `DistributedRuntime` is the core infrastructure in dynamo that enables di
...
@@ -28,13 +28,13 @@ Dynamo `DistributedRuntime` is the core infrastructure in dynamo that enables di
While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
For example, the deployment configuration `examples/llm/configs/disagg.yaml` have four workers:
For example, a typical deployment configuration (like `components/backends/vllm/deploy/agg.yaml` or `components/backends/sglang/deploy/agg.yaml`) has multiple workers:
-`Frontend`: Start an HTTP server and register a `chat/completions` endpoint. The HTTP server route the request to the `Processor`.
-`Frontend`: Starts an HTTP server and handles incoming requests. The HTTP server routes all requests to the worker components.
-`Processor`: When a new request arrives, `Processor` applies the chat template and perform the tokenization. Then, it route the request to the `VllmWorker`.
-`VllmDecodeWorker`: Performs the actual decode computation using the vLLM engine through the `DecodeWorkerHandler`.
-`VllmWorker` and `PrefillWorker`: Perform the actual decode and prefill computation.
-`VllmPrefillWorker` (in disaggregated deployments): Performs prefill computation using the vLLM engine through the `PrefillWorkerHandler`.
Since the four workers are deployed in different processes, each of them have their own `DistributedRuntime`. Within their own `DistributedRuntime`, they all have their own`Namespace`s named `dynamo`. Then, under their own `dynamo`namespace, they have their own `Component`s named `Frontend/Processor/VllmWorker/PrefillWorker`. Lastly, for the `Endpoint`, `Frontend` has no `Endpoints`, `Processor` and `VllmWorker` each has a `generate` endpoint, and `PrefillWorker` has a placeholder `mock` endpoint.
Since the workers are deployed in different processes, each of them has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same`Namespace` (e.g., `vllm-agg`, `vllm-disagg`, `sglang-agg`). Then, under their namespace, they have their own `Component`s: `Frontend` uses the `make_engine` function which handles HTTP serving and routing automatically, while worker components like `VllmDecodeWorker` and `VllmPrefillWorker` create components with names like `worker`, `decode`, or `prefill` and register endpoints like `generate` and `clear_kv_blocks`. The `Frontend` component doesn't explicitly create endpoints - instead, the `make_engine` function handles the HTTP server and worker discovery. Worker components create their endpoints programmatically using the `component.endpoint()` method and use their respective handler classes (`DecodeWorkerHandler` or `PrefillWorkerHandler`) to process requests. Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("vllm-agg").component("worker")`), and their `Endpoint`s are created using the `component.endpoint()` method.
## Initialization
## Initialization
...
@@ -55,7 +55,7 @@ The hierarchy and naming in etcd and NATS may change over time, and this documen
...
@@ -55,7 +55,7 @@ The hierarchy and naming in etcd and NATS may change over time, and this documen
-`Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
-`Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
-`Endpoint`: When an Endpoint object is created and started, it performs two key registrations:
-`Endpoint`: When an Endpoint object is created and started, it performs two key registrations:
- NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`.
- NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`.
- etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `PrefillWorker`s in one deployment) share the same `Namespace`, `Componenet`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
- etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
## Calling Endpoints
## Calling Endpoints
...
@@ -74,19 +74,6 @@ After selecting which endpoint to hit, the `Client` sends the serialized request
...
@@ -74,19 +74,6 @@ After selecting which endpoint to hit, the `Client` sends the serialized request
We provide native rust and python (through binding) examples for basic usage of `DistributedRuntime`:
We provide native rust and python (through binding) examples for basic usage of `DistributedRuntime`:
- Rust: `/lib/runtime/examples/`
- Rust: `/lib/runtime/examples/`
- Python: `/lib/bindings/python/examples/`. We also provide a complete example of using `DistributedRuntime` for communication and Dynamo's LLM library for prompt templates and (de)tokenization to deploy a vllm-based service. Please refer to `lib/bindings/python/examples/hello_world/server_vllm.py` for details.
- Python: We also provide complete examples of using `DistributedRuntime`. Please refer to the engines in `/components/backends` for full implementation details.
```{note}
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to be slow and requires extensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
For example, on an ARM machine with low system resources:
In this example we create an EKS cluster consisting of 1 `g6e.48xlarge` compute node, each with 8 NVIDIA L40S GPUs and 1 `c5.2xlarge` CPU node as control plane. We also setup EFA between the compute nodes.
### a. Configure AWS CLI
```
aws configure
```
### b. Create a config file for EKS cluster creation
```
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: <CLUSTER_NAME>
version: "1.32"
region: <REGION_NAME>
iam:
withOIDC: true
managedNodeGroups:
- name: sys-ng
instanceType: c5.2xlarge
minSize: 1
desiredCapacity: 1
maxSize: 1
iam:
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
efs: true
awsLoadBalancerController: true
cloudWatch: true
albIngress: true
- name: efa-compute-ng
instanceType: g6e.48xlarge
minSize: 1
desiredCapacity: 1
maxSize: 1
volumeSize: 300
efaEnabled: true
privateNetworking: true
iam:
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
efs: true
awsLoadBalancerController: true
cloudWatch: true
albIngress: true
```
> [!NOTE]
> We set `minSize` and `desiredCapacity` to be 1 because AWS does not create your cluster successfully if no nodes are available. For example, if you specify `desiredCapacity` to be 2 but there are no available 2 nodes, your cluster creation will fail due to timeout even though there are no errors. The easiest way to avoid this is to create the cluster with 1 node and increase the number of nodes later in the EKS console. After you increase number of nodes in your node groups, make sure GPU nodes are in the same subnet. This is required for EFA to work.
### c. Create the EKS cluster
```
eksctl create cluster -f eks_cluster_config.yaml
```
## 3. Create an EFS file system
We'll need a common, shared storage location to enable pods deployed to multiple nodes to load shards of the same model. This way, they can be used in coordination to serve inference requests for models too large to loaded by GPUs on a single node. In Kubernetes, these common, shared storage locations are referred to as persistent volumes. Persistent volumes can be volume mapped in to any number of pods and then accessed by processes running inside of said pods as if they were part of the pod's file system. We will be using EFS as persistent volume.
Additionally, we will need to create a persistent-volume claim which can use to assign the persistent volume to a pod.
### a. Create an IAM role
Follow the steps to create an IAM role for your EFS file system: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-create-iam-resources. This role will be used later when you install the EFS CSI Driver.
### b. Install EFS CSI driver
Install the EFS CSI Driver through the Amazon EKS add-on in AWS console: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-install-driver. Once it's done, check the Add-ons section in EKS console, you should see the driver is showing `Active` under Status.
### c. Create EFS file system
Follow the steps to create an EFS file system: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/docs/efs-create-filesystem.md. Make sure you mount subnets in the last step correctly. This will affect whether your nodes are able to access the created EFS file system.
## 4. Test
Follow the steps to check if your EFS file system is working properly with your nodes: https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/multiple_pods. This test is going to mount your EFS file system on all of your available nodes and write a text file to the file system.
## 5. Create StorageClass
You can find your `fileSystemId` from AWS EFS. It usually start with `fs-`.
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": false,
"max_tokens": 30
}'
```
You should output something similar to below
```
{"id":"chatcmpl-bbe52b36-90ed-4479-9872-89e1aa412aa7","choices":[{"index":0,"message":{"content":"<think>\nOkay, so the user wants me to develop a character background for an explorer named someone in Eldoria. The character is part of the","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1753417848,"model":"Qwen/Qwen3-0.6B","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":{"prompt_tokens":196,"completion_tokens":29,"total_tokens":225,"prompt_tokens_details":null,"completion_tokens_details":null}}