You can provision Cloud TPUs using the `Cloud TPU API <https://cloud.google.com/tpu/docs/reference/rest>`_`
or the `queued resources <https://cloud.google.com/tpu/docs/queued-resources>`_`
API. This section shows how to create TPUs using the queued resource API.
For more information about using the Cloud TPU API, see `Create a Cloud TPU using the Create Node API <https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api>`_.
enable you to request Cloud TPU resources in a queued manner. When you request
queued resources, the request is added to a queue maintained by the Cloud TPU
service. When the requested resource becomes available, it's assigned to your
Google Cloud project for your immediate exclusive use.
You can provision Cloud TPUs using the `Cloud TPU API <https://cloud.google.com/tpu/docs/reference/rest>`_
or the `queued resources <https://cloud.google.com/tpu/docs/queued-resources>`_
API. This section shows how to create TPUs using the queued resource API. For
more information about using the Cloud TPU API, see `Create a Cloud TPU using the Create Node API <https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api>`_.
Queued resources enable you to request Cloud TPU resources in a queued manner.
When you request queued resources, the request is added to a queue maintained by
the Cloud TPU service. When the requested resource becomes available, it's
assigned to your Google Cloud project for your immediate exclusive use.
.. note::
In all of the following commands, replace the ALL CAPS parameter names with
appropriate values. See the parameter descriptions table for more information.
Provision a Cloud TPU with the queued resource API
@@ -162,9 +175,11 @@ Run the Docker image with the following command:
.. note::
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the possible input shapes and compiles an XLA graph for each different shape.
The compilation time may take 20~30 minutes in the first run.
However, the compilation time reduces to ~5 minutes afterwards because the XLA graphs are cached in the disk (in :code:`VLLM_XLA_CACHE_PATH` or :code:`~/.cache/vllm/xla_cache` by default).
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
possible input shapes and compiles an XLA graph for each shape. The
compilation time may take 20~30 minutes in the first run. However, the
compilation time reduces to ~5 minutes afterwards because the XLA graphs are
cached in the disk (in :code:`VLLM_XLA_CACHE_PATH` or :code:`~/.cache/vllm/xla_cache` by default).
.. tip::
...
...
@@ -173,7 +188,8 @@ Run the Docker image with the following command:
.. code-block:: console
from torch._C import * # noqa: F403
ImportError: libopenblas.so.0: cannot open shared object file: No such file or directory
ImportError: libopenblas.so.0: cannot open shared object file: No such