Unverified Commit 26c52a5e authored by Woosuk Kwon's avatar Woosuk Kwon Committed by GitHub
Browse files

[Docs] Add CUDA graph support to docs (#2148)

parent c3372e87
...@@ -35,6 +35,7 @@ vLLM is fast with: ...@@ -35,6 +35,7 @@ vLLM is fast with:
- State-of-the-art serving throughput - State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention** - Efficient management of attention key and value memory with **PagedAttention**
- Continuous batching of incoming requests - Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629) - Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)
- Optimized CUDA kernels - Optimized CUDA kernels
...@@ -45,7 +46,7 @@ vLLM is flexible and easy to use with: ...@@ -45,7 +46,7 @@ vLLM is flexible and easy to use with:
- Tensor parallelism support for distributed inference - Tensor parallelism support for distributed inference
- Streaming outputs - Streaming outputs
- OpenAI-compatible API server - OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs. - Support NVIDIA GPUs and AMD GPUs
vLLM seamlessly supports many Hugging Face models, including the following architectures: vLLM seamlessly supports many Hugging Face models, including the following architectures:
......
...@@ -30,6 +30,7 @@ vLLM is fast with: ...@@ -30,6 +30,7 @@ vLLM is fast with:
* State-of-the-art serving throughput * State-of-the-art serving throughput
* Efficient management of attention key and value memory with **PagedAttention** * Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests * Continuous batching of incoming requests
* Fast model execution with CUDA/HIP graph
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_ * Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_
* Optimized CUDA kernels * Optimized CUDA kernels
...@@ -40,7 +41,7 @@ vLLM is flexible and easy to use with: ...@@ -40,7 +41,7 @@ vLLM is flexible and easy to use with:
* Tensor parallelism support for distributed inference * Tensor parallelism support for distributed inference
* Streaming outputs * Streaming outputs
* OpenAI-compatible API server * OpenAI-compatible API server
* Support NVIDIA GPUs and AMD GPUs. * Support NVIDIA GPUs and AMD GPUs
For more information, check out the following: For more information, check out the following:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment