Write README and front page of doc (#147)

dcda03b4 · Woosuk Kwon · GitHub · bf5f121c · dcda03b4 · dcda03b4
Unverified Commit dcda03b4 authored Jun 18, 2023 by Woosuk Kwon Committed by GitHub Jun 18, 2023
9 changed files
--- a/README.md
+++ b/README.md
-# vLLM
+# vLLM: Easy, Fast, and Cheap LLM Serving for Everyone
-## Build from source
+| [**Documentation**](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) | [**Blog**]() |
-```bash
+vLLM is a fast and easy-to-use library for LLM inference and serving.
-pip install -r requirements.txt
-pip install -e .  # This may take several minutes.
-```
-## Test simple server
+## Latest News 🔥
-```bash
+- [2023/06] We officially released vLLM! vLLM has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid April. Check out our [blog post]().
-# Single-GPU inference.
-python examples/simple_server.py # --model <your_model>
-# Multi-GPU inference (e.g., 2 GPUs).
+## Getting Started
-ray start --head
-python examples/simple_server.py -tp 2 # --model <your_model>
-```
-The detailed arguments for `simple_server.py` can be found by:
+Visit our [documentation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) to get started.
-```bash
+- [Installation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/installation.html): `pip install vllm`
-python examples/simple_server.py --help
+- [Quickstart](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/quickstart.html)
-```
+- [Supported Models](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/models/supported_models.html)
-## FastAPI server
+## Key Features
-To start the server:
+vLLM comes with many powerful features that include:
-```bash
-ray start --head
-python -m vllm.entrypoints.fastapi_server # --model <your_model>
-```
-To test the server:
+- State-of-the-art performance in serving throughput
-```bash
+- Efficient management of attention key and value memory with **PagedAttention**
-python test_cli_client.py
+- Seamless integration with popular HuggingFace models
-```
+- Dynamic batching of incoming requests
+- Optimized CUDA kernels
+- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
+- Tensor parallelism support for distributed inference
+- Streaming outputs
+- OpenAI-compatible API server
-## Gradio web server
+## Performance
-Install the following additional dependencies:
+vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput.
-```bash
+For details, check out our [blog post]().
-pip install gradio
-```
-Start the server:
+<p align="center">
-```bash
+  <img src="./assets/figures/perf_a10g_n1.png" width="45%">
-python -m vllm.http_frontend.fastapi_frontend
+  <img src="./assets/figures/perf_a100_n1.png" width="45%">
-# At another terminal
+  <br>
-python -m vllm.http_frontend.gradio_webserver
+  <em> Serving throughput when each request asks for 1 output completion. </em>
-```
+</p>
-## Load LLaMA weights
+<p align="center">
+  <img src="./assets/figures/perf_a10g_n3.png" width="45%">
+  <img src="./assets/figures/perf_a100_n3.png" width="45%">
+  <br>
+  <em> Serving throughput when each request asks for 3 output completions. </em>
+</p>
-Since LLaMA weight is not fully public, we cannot directly download the LLaMA weights from huggingface. Therefore, you need to follow the following process to load the LLaMA weights.
+## Contributing
-1. Converting LLaMA weights to huggingface format with [this script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py).
+We welcome and value any contributions and collaborations.
-    ```bash
+Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
-    python src/transformers/models/llama/convert_llama_weights_to_hf.py \
-        --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path/llama-7b
-    ```
-2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example:
-    ```bash
-    python simple_server.py --model /output/path/llama-7b
-    python -m vllm.http_frontend.fastapi_frontend --model /output/path/llama-7b
-    ```
--- a/assets/figures/perf_a100_n1.png
+++ b/assets/figures/perf_a100_n1.png
--- a/assets/figures/perf_a100_n3.png
+++ b/assets/figures/perf_a100_n3.png
--- a/assets/figures/perf_a10g_n1.png
+++ b/assets/figures/perf_a10g_n1.png
--- a/assets/figures/perf_a10g_n3.png
+++ b/assets/figures/perf_a10g_n3.png
--- a/docs/source/getting_started/installation.rst
+++ b/docs/source/getting_started/installation.rst
@@ -3,17 +3,20 @@
 Installation
 ============
-vLLM is a Python library that includes some C++ and CUDA code.
+vLLM is a Python library that also contains some C++ and CUDA code.
-vLLM can run on systems that meet the following requirements:
+This additional code requires compilation on the user's machine.
+Requirements
+------------
 * OS: Linux
 * Python: 3.8 or higher
 * CUDA: 11.0 -- 11.8
-* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, etc.)
+* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)
 .. note::
    As of now, vLLM does not support CUDA 12.
-    If you are using Hopper or Lovelace GPUs, please use CUDA 11.8.
+    If you are using Hopper or Lovelace GPUs, please use CUDA 11.8 instead of CUDA 12.
 .. tip::
    If you have trouble installing vLLM, we recommend using the NVIDIA PyTorch Docker image.
@@ -45,7 +48,7 @@ You can install vLLM using pip:
 Build from source
 -----------------
-You can also build and install vLLM from source.
+You can also build and install vLLM from source:
 .. code-block:: console

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
 Welcome to vLLM!
 ================
-vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLM).
+**vLLM** is a fast and easy-to-use library for LLM inference and serving.
+Its core features include:
+- State-of-the-art performance in serving throughput
+- Efficient management of attention key and value memory with **PagedAttention**
+- Seamless integration with popular HuggingFace models
+- Dynamic batching of incoming requests
+- Optimized CUDA kernels
+- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
+- Tensor parallelism support for distributed inference
+- Streaming outputs
+- OpenAI-compatible API server
+For more information, please refer to our `blog post <>`_.
 Documentation
 -------------

--- a/docs/source/models/supported_models.rst
+++ b/docs/source/models/supported_models.rst
@@ -3,7 +3,7 @@
 Supported Models
 ================
-vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://github.com/huggingface/transformers>`_.
+vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
 The following is the list of model architectures that are currently supported by vLLM.
 Alongside each architecture, we include some popular models that use it.
@@ -18,7 +18,7 @@ Alongside each architecture, we include some popular models that use it.
  * - :code:`GPTNeoXForCausalLM`
    - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
  * - :code:`LlamaForCausalLM`
-    - LLaMA, Vicuna, Alpaca, Koala
+    - LLaMA, Vicuna, Alpaca, Koala, Guanaco
  * - :code:`OPTForCausalLM`
    - OPT, OPT-IML

--- a/setup.py
+++ b/setup.py
@@ -165,7 +165,7 @@ setuptools.setup(
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
    ],
    packages=setuptools.find_packages(
-        exclude=("benchmarks", "csrc", "docs", "examples", "tests")),
+        exclude=("assets", "benchmarks", "csrc", "docs", "examples", "tests")),
    python_requires=">=3.8",
    install_requires=get_requirements(),
    ext_modules=ext_modules,