[Doc] Documentation for distributed inference (#261)

2cf1a333 · Zhuohan Li · GitHub · 0b7db411 · 2cf1a333 · 2cf1a333
Unverified Commit 2cf1a333 authored Jun 26, 2023 by Zhuohan Li Committed by GitHub Jun 26, 2023
Showing with 54 additions and 3 deletions

.gitignore .gitignore +3 -0

README.md README.md +1 -1

docs/source/index.rst docs/source/index.rst +12 -2

docs/source/serving/distributed_serving.rst docs/source/serving/distributed_serving.rst +38 -0

No files found.
--- a/.gitignore
+++ b/.gitignore
@@ -170,3 +170,6 @@ cython_debug/
 # Python pickle files
 *.pkl
+# Sphinx documentation
+_build/
--- a/README.md
+++ b/README.md
@@ -28,7 +28,7 @@ vLLM is fast with:
 - State-of-the-art serving throughput
 - Efficient management of attention key and value memory with **PagedAttention**
- Dynamic batching of incoming requests
+- Continuous batching of incoming requests
 - Optimized CUDA kernels
 vLLM is flexible and easy to use with:

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -29,7 +29,7 @@ vLLM is fast with:
 * State-of-the-art serving throughput
 * Efficient management of attention key and value memory with **PagedAttention**
-* Dynamic batching of incoming requests
+* Continuous batching of incoming requests
 * Optimized CUDA kernels
 vLLM is flexible and easy to use with:
@@ -40,7 +40,11 @@ vLLM is flexible and easy to use with:
 * Streaming outputs
 * OpenAI-compatible API server
-For more information, please refer to our `blog post <https://vllm.ai>`_.
+For more information, check out the following:
+* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
+* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.
 Documentation
@@ -53,6 +57,12 @@ Documentation
   getting_started/installation
   getting_started/quickstart
+.. toctree::
+   :maxdepth: 1
+   :caption: Serving
+   serving/distributed_serving
 .. toctree::
   :maxdepth: 1
   :caption: Models

--- a/docs/source/serving/distributed_serving.rst
+++ b/docs/source/serving/distributed_serving.rst
+.. _distributed_serving:
+Distributed Inference and Serving
+=================================
+vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We manage the distributed runtime with `Ray <https://github.com/ray-project/ray>`_. To run distributed inference, install Ray with:
+.. code-block:: console
+    $ pip install ray
+To run multi-GPU inference with the :code:`LLM` class, set the :code:`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
+.. code-block:: python
+    from vllm import LLM
+    llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
+    output = llm.generate("San Franciso is a")
+To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
+.. code-block:: console
+    $ python -m vllm.entrypoints.api_server \
+    $     --model facebook/opt-13b \
+    $     --tensor-parallel-size 4
+To scale vLLM beyond a single machine, start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM:
+.. code-block:: console
+    $ # On head node
+    $ ray start --head
+    $ # On worker nodes
+    $ ray start --address=<ray-head-address>
+After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting :code:`tensor_parallel_size` to the number of GPUs to be the total number of GPUs across all machines.
\ No newline at end of file