[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
This tab describes how to set up your environment to run vLLM on Neuron.
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
This describes how to set up your environment to run vLLM on Neuron.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
## Requirements
- OS: Linux
- Python: 3.9 or newer
...
...
@@ -17,8 +16,7 @@
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
- AWS Neuron SDK 2.23
# --8<-- [end:requirements]
# --8<-- [start:configure-a-new-environment]
## Configure a new environment
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
...
...
@@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
- Once inside your instance, activate the pre-installed virtual environment for inference by running
@@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
available on vLLM V0. Please utilize the AWS Fork for the following features:
<https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
available on vLLM V0. Please utilize the AWS Fork for the following features:
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
### Build image from source
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
## Extra information
## Supported features
### Supported features
-[Offline inference][offline-inference]
- Online serving via [OpenAI-Compatible Server][openai-compatible-server]
...
...
@@ -130,14 +121,14 @@ docker run \
for accelerating low-batch latency and throughput
- Attention with Linear Biases (ALiBi)
## Unsupported features
### Unsupported features
- Beam search
- LoRA adapters
- Quantization
- Prefill chunking (mixed-batch inferencing)
## Supported configurations
### Supported configurations
The following configurations have been validated to function with
Gaudi2 devices. Configurations that are not listed may or may not work.
...
...
@@ -401,4 +392,3 @@ the below:
higher batches. You can do that by adding `--enforce-eager` flag to
server (for online serving), or by passing `enforce_eager=True`
argument to LLM constructor (for offline inference).