@@ -183,7 +184,6 @@ Currently in vLLM for HPU we support four execution modes, depending on selected
...
@@ -183,7 +184,6 @@ Currently in vLLM for HPU we support four execution modes, depending on selected
| 0 | 0 | torch.compile |
| 0 | 0 | torch.compile |
| 0 | 1 | PyTorch eager mode |
| 0 | 1 | PyTorch eager mode |
| 1 | 0 | HPU Graphs |
| 1 | 0 | HPU Graphs |
<figcaption>vLLM execution modes</figcaption>
!!! warning
!!! warning
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
- AWS Neuron SDK 2.23
- AWS Neuron SDK 2.23
## Configure a new environment
# --8<-- [end:requirements]
# --8<-- [start:configure-a-new-environment]
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
...
@@ -37,7 +38,7 @@ for alternative setup instructions including using Docker and manually installin
...
@@ -37,7 +38,7 @@ for alternative setup instructions including using Docker and manually installin
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
# --8<-- [end:requirements]
# --8<-- [end:configure-a-new-environment]
# --8<-- [start:set-up-using-python]
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python]
# --8<-- [end:set-up-using-python]
...
@@ -102,8 +103,8 @@ Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dock
...
@@ -102,8 +103,8 @@ Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dock
### Feature support through NxD Inference backend
### Feature support through NxD Inference backend
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
to perform most of the heavy lifting which includes PyTorch model initialization, compilation, and runtime execution. Therefore, most
to perform most of the heavy lifting which includes PyTorch model initialization, compilation, and runtime execution. Therefore, most
[features supported on Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html) are also available via the vLLM integration.
[features supported on Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html) are also available via the vLLM integration.
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
| QUEUED_RESOURCE_ID | The user-assigned ID of the queued resource request. |
| QUEUED_RESOURCE_ID | The user-assigned ID of the queued resource request. |
| TPU_NAME | The user-assigned name of the TPU which is created when the queued |
| TPU_NAME | The user-assigned name of the TPU which is created when the queued resource request is allocated. |
| PROJECT_ID | Your Google Cloud project |
| PROJECT_ID | Your Google Cloud project |
| ZONE | The GCP zone where you want to create your Cloud TPU. The value you use |
| ZONE | The GCP zone where you want to create your Cloud TPU. The value you use depends on the version of TPUs you are using. For more information, see [TPU regions and zones] |
| ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example |
| ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example `v5litepod-4` specifies a v5e TPU with 4 cores, `v6e-1` specifies a v6e TPU with 1 core. For more information, see [TPU versions]. |
| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images](https://cloud.google.com/tpu/docs/runtimes). |
| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images]. |
<figcaption>Parameter descriptions</figcaption>
| SERVICE_ACCOUNT | The email address for your service account. You can find it in the IAM Cloud Console under *Service Accounts*. For example: `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com` |
Connect to your TPU using SSH:
Connect to your TPU using SSH:
...
@@ -94,7 +96,11 @@ Connect to your TPU using SSH:
...
@@ -94,7 +96,11 @@ Connect to your TPU using SSH:
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE