[Docs] Add CUDA graph support to docs (#2148)

26c52a5e · Woosuk Kwon · GitHub · c3372e87 · 26c52a5e · 26c52a5e
Unverified Commit 26c52a5e authored Dec 17, 2023 by Woosuk Kwon Committed by GitHub Dec 17, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 2 deletions

README.md README.md +2 -1

docs/source/index.rst docs/source/index.rst +2 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -35,6 +35,7 @@ vLLM is fast with:
 - State-of-the-art serving throughput
 - Efficient management of attention key and value memory with **PagedAttention**
 - Continuous batching of incoming requests
+- Fast model execution with CUDA/HIP graph
 - Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)
 - Optimized CUDA kernels
@@ -45,7 +46,7 @@ vLLM is flexible and easy to use with:
 - Tensor parallelism support for distributed inference
 - Streaming outputs
 - OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs.
+- Support NVIDIA GPUs and AMD GPUs
 vLLM seamlessly supports many Hugging Face models, including the following architectures:

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -30,6 +30,7 @@ vLLM is fast with:
 * State-of-the-art serving throughput
 * Efficient management of attention key and value memory with **PagedAttention**
 * Continuous batching of incoming requests
+* Fast model execution with CUDA/HIP graph
 * Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_
 * Optimized CUDA kernels
@@ -40,7 +41,7 @@ vLLM is flexible and easy to use with:
 * Tensor parallelism support for distributed inference
 * Streaming outputs
 * OpenAI-compatible API server
-* Support NVIDIA GPUs and AMD GPUs.
+* Support NVIDIA GPUs and AMD GPUs
 For more information, check out the following: