[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
This tab describes how to set up your environment to run vLLM on Neuron.
This describes how to set up your environment to run vLLM on Neuron.
!!! warning
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
## Requirements
# --8<-- [start:requirements]
- OS: Linux
- OS: Linux
- Python: 3.9 or newer
- Python: 3.9 or newer
...
@@ -17,8 +16,7 @@
...
@@ -17,8 +16,7 @@
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
- AWS Neuron SDK 2.23
- AWS Neuron SDK 2.23
# --8<-- [end:requirements]
## Configure a new environment
# --8<-- [start:configure-a-new-environment]
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
...
@@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
...
@@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
- Once inside your instance, activate the pre-installed virtual environment for inference by running
- Once inside your instance, activate the pre-installed virtual environment for inference by running
@@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin
...
@@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
<https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
available on vLLM V0. Please utilize the AWS Fork for the following features:
available on vLLM V0. Please utilize the AWS Fork for the following features:
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
# --8<-- [end:pre-built-images]
### Build image from source
# --8<-- [start:build-image-from-source]
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
# --8<-- [end:build-image-from-source]
## Extra information
# --8<-- [start:extra-information]
## Supported features
### Supported features
-[Offline inference][offline-inference]
-[Offline inference][offline-inference]
- Online serving via [OpenAI-Compatible Server][openai-compatible-server]
- Online serving via [OpenAI-Compatible Server][openai-compatible-server]
...
@@ -130,14 +121,14 @@ docker run \
...
@@ -130,14 +121,14 @@ docker run \
for accelerating low-batch latency and throughput
for accelerating low-batch latency and throughput
- Attention with Linear Biases (ALiBi)
- Attention with Linear Biases (ALiBi)
## Unsupported features
### Unsupported features
- Beam search
- Beam search
- LoRA adapters
- LoRA adapters
- Quantization
- Quantization
- Prefill chunking (mixed-batch inferencing)
- Prefill chunking (mixed-batch inferencing)
## Supported configurations
### Supported configurations
The following configurations have been validated to function with
The following configurations have been validated to function with
Gaudi2 devices. Configurations that are not listed may or may not work.
Gaudi2 devices. Configurations that are not listed may or may not work.
...
@@ -401,4 +392,3 @@ the below:
...
@@ -401,4 +392,3 @@ the below:
higher batches. You can do that by adding `--enforce-eager` flag to
higher batches. You can do that by adding `--enforce-eager` flag to
server (for online serving), or by passing `enforce_eager=True`
server (for online serving), or by passing `enforce_eager=True`
argument to LLM constructor (for offline inference).
argument to LLM constructor (for offline inference).