tpu-installation.rst 7.17 KB
Newer Older
1
2
.. _installation_tpu:

3
#####################
4
Installation with TPU
5
#####################
6

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Tensor Processing Units (TPUs) are Google's custom-developed application-specific 
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs 
are available in different versions each with different hardware specifications.
For more information about TPUs, see `TPU System Architecture <https://cloud.google.com/tpu/docs/system-architecture-tpu-vm>`_. 
For more information on the TPU versions supported with vLLM, see:

* `TPU v6e <https://cloud.google.com/tpu/docs/v6e>`_
* `TPU v5e <https://cloud.google.com/tpu/docs/v5e>`_
* `TPU v5p <https://cloud.google.com/tpu/docs/v5p>`_
* `TPU v4 <https://cloud.google.com/tpu/docs/v4>`_

These TPU versions allow you to configure the physical arrangements of the TPU 
chips. This can improve throughput and networking performance. For more 
information see: 

* `TPU v6e topologies <https://cloud.google.com/tpu/docs/v6e#configurations>`_
* `TPU v5e topologies <https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config>`_
* `TPU v5p topologies <https://cloud.google.com/tpu/docs/v5p#tpu-v5p-config>`_
* `TPU v4 topologies <https://cloud.google.com/tpu/docs/v4#tpu-v4-config>`_

In order for you to use Cloud TPUs you need to have TPU quota granted to your 
Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a
GPC project and are specified in terms of TPU version, the number of TPU you 
want to use, and quota type. For more information, see `TPU quota <https://cloud.google.com/tpu/docs/quota#tpu_quota>`_. 

For TPU pricing information, see `Cloud TPU pricing <https://cloud.google.com/tpu/pricing>`_.

You may need additional persistent storage for your TPU VMs. For more 
information, see `Storage options for Cloud TPU data <https://cloud.devsite.corp.google.com/tpu/docs/storage-options>`_.
36
37
38
39

Requirements
------------

40
41
42
43
44
45
46
* Google Cloud TPU VM 
* TPU versions: v6e, v5e, v5p, v4
* Python: 3.10 or newer

Provision Cloud TPUs
====================

47
48
49
50
51
52
53
54
55
56
57
58
You can provision Cloud TPUs using the `Cloud TPU API <https://cloud.google.com/tpu/docs/reference/rest>`_ 
or the `queued resources <https://cloud.google.com/tpu/docs/queued-resources>`_ 
API. This section shows how to create TPUs using the queued resource API. For 
more information about using the Cloud TPU API, see `Create a Cloud TPU using the Create Node API <https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api>`_. 
Queued resources enable you to request Cloud TPU resources in a queued manner. 
When you request queued resources, the request is added to a queue maintained by 
the Cloud TPU service. When the requested resource becomes available, it's 
assigned to your Google Cloud project for your immediate exclusive use. 

.. note::
   In all of the following commands, replace the ALL CAPS parameter names with 
   appropriate values. See the parameter descriptions table for more information.
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73

Provision a Cloud TPU with the queued resource API
--------------------------------------------------
Create a TPU v5e with 4 TPU chips:

.. code-block:: console

    gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
    --node-id TPU_NAME \
    --project PROJECT_ID \
    --zone ZONE \
    --accelerator-type ACCELERATOR_TYPE \
    --runtime-version RUNTIME_VERSION \
    --service-account SERVICE_ACCOUNT

74
   
75
76
77
78
79
80
81
82
83
84
85
86
87
.. list-table:: Parameter descriptions
    :header-rows: 1

    * - Parameter name
      - Description
    * - QUEUED_RESOURCE_ID
      - The user-assigned ID of the queued resource request.
    * - TPU_NAME
      - The user-assigned name of the TPU which is created when the queued 
        resource request is allocated.
    * - PROJECT_ID
      - Your Google Cloud project
    * - ZONE
88
89
90
      - The GCP zone where you want to create your Cloud TPU. The value you use 
        depends on the version of TPUs you are using. For more information, see 
        `TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_ 
91
    * - ACCELERATOR_TYPE
92
93
94
      - The TPU version you want to use. Specify the TPU version, for example 
        `v5litepod-4` specifies a v5e TPU with 4 cores. For more information, 
        see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
95
96
97
98
99
100
101
102
103
104
105
    * - RUNTIME_VERSION
      - The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
    * - SERVICE_ACCOUNT
      - The email address for your service account. You can find it in the IAM 
        Cloud Console under *Service Accounts*. For example: 
        `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`

Connect to your TPU using SSH:

.. code-block:: bash

106
107
108
109
110
111
112
113
114
    gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE

Install Miniconda

.. code-block:: bash

    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh
    source ~/.bashrc
115
116
117
118

Create and activate a Conda environment for vLLM:

.. code-block:: bash
119

120
121
    conda create -n vllm python=3.10 -y
    conda activate vllm
122

123
124
125
126
127
128
129
130
131
132
133
134
Clone the vLLM repository and go to the vLLM directory:

.. code-block:: bash

    git clone https://github.com/vllm-project/vllm.git && cd vllm

Uninstall the existing `torch` and `torch_xla` packages:

.. code-block:: bash

    pip uninstall torch torch-xla -y

135
Install build dependencies:
136
137
138

.. code-block:: bash

139
140
    pip install -r requirements-tpu.txt
    sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev 
141

142
Run the setup script:
143
144
145

.. code-block:: bash

146
   VLLM_TARGET_DEVICE="tpu" python setup.py develop
147
148
149
150
151
152
153
154
155


Provision Cloud TPUs with GKE 
-----------------------------

For more information about using TPUs with GKE, see 
https://cloud.google.com/kubernetes-engine/docs/how-to/tpus
https://cloud.google.com/kubernetes-engine/docs/concepts/tpus
https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus
156
157
158
159
160
161

.. _build_docker_tpu:

Build a docker image with :code:`Dockerfile.tpu`
------------------------------------------------

162
163
You can use `Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_ 
to build a Docker image with TPU support.
164
165
166
167
168

.. code-block:: console

    $ docker build -f Dockerfile.tpu -t vllm-tpu .

169
Run the Docker image with the following command:
170
171
172
173
174
175

.. code-block:: console

    $ # Make sure to add `--privileged --net host --shm-size=16G`.
    $ docker run --privileged --net host --shm-size=16G -it vllm-tpu

176
177
.. note::

178
179
180
181
182
    Since TPU relies on XLA which requires static shapes, vLLM bucketizes the 
    possible input shapes and compiles an XLA graph for each shape. The 
    compilation time may take 20~30 minutes in the first run. However, the 
    compilation time reduces to ~5 minutes afterwards because the XLA graphs are 
    cached in the disk (in :code:`VLLM_XLA_CACHE_PATH` or :code:`~/.cache/vllm/xla_cache` by default).
183

184
185
186
187
188
189
190
.. tip::

    If you encounter the following error:

    .. code-block:: console

        from torch._C import *  # noqa: F403
191
192
        ImportError: libopenblas.so.0: cannot open shared object file: No such 
        file or directory
193
194


195
    Install OpenBLAS with the following command:
196
197
198
199
200

    .. code-block:: console

        $ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev