index.rst 918 Bytes
Newer Older
Woosuk Kwon's avatar
Woosuk Kwon committed
1
2
Welcome to vLLM!
================
Woosuk Kwon's avatar
Woosuk Kwon committed
3

4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
**vLLM** is a fast and easy-to-use library for LLM inference and serving.
Its core features include:

- State-of-the-art performance in serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Seamless integration with popular HuggingFace models
- Dynamic batching of incoming requests
- Optimized CUDA kernels
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server

For more information, please refer to our `blog post <>`_.

Zhuohan Li's avatar
Zhuohan Li committed
19

Woosuk Kwon's avatar
Woosuk Kwon committed
20
21
22
23
24
25
26
27
28
Documentation
-------------

.. toctree::
   :maxdepth: 1
   :caption: Getting Started

   getting_started/installation
   getting_started/quickstart
Woosuk Kwon's avatar
Woosuk Kwon committed
29
30
31
32
33
34
35

.. toctree::
   :maxdepth: 1
   :caption: Models

   models/supported_models
   models/adding_model