Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
norm
vllm
Commits
b81a6a6b
Unverified
Commit
b81a6a6b
authored
Dec 15, 2023
by
Woosuk Kwon
Committed by
GitHub
Dec 15, 2023
Browse files
[Docs] Add supported quantization methods to docs (#2135)
parent
0fbfc4b8
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
4 additions
and
2 deletions
+4
-2
README.md
README.md
+2
-1
docs/source/index.rst
docs/source/index.rst
+2
-1
No files found.
README.md
View file @
b81a6a6b
...
@@ -35,6 +35,7 @@ vLLM is fast with:
...
@@ -35,6 +35,7 @@ vLLM is fast with:
-
State-of-the-art serving throughput
-
State-of-the-art serving throughput
-
Efficient management of attention key and value memory with
**PagedAttention**
-
Efficient management of attention key and value memory with
**PagedAttention**
-
Continuous batching of incoming requests
-
Continuous batching of incoming requests
-
Quantization:
[
GPTQ
](
https://arxiv.org/abs/2210.17323
)
,
[
AWQ
](
https://arxiv.org/abs/2306.00978
)
,
[
SqueezeLLM
](
https://arxiv.org/abs/2306.07629
)
-
Optimized CUDA kernels
-
Optimized CUDA kernels
vLLM is flexible and easy to use with:
vLLM is flexible and easy to use with:
...
@@ -44,7 +45,7 @@ vLLM is flexible and easy to use with:
...
@@ -44,7 +45,7 @@ vLLM is flexible and easy to use with:
-
Tensor parallelism support for distributed inference
-
Tensor parallelism support for distributed inference
-
Streaming outputs
-
Streaming outputs
-
OpenAI-compatible API server
-
OpenAI-compatible API server
-
Support NVIDIA
CUDA
and AMD
ROCm
.
-
Support NVIDIA
GPUs
and AMD
GPUs
.
vLLM seamlessly supports many Hugging Face models, including the following architectures:
vLLM seamlessly supports many Hugging Face models, including the following architectures:
...
...
docs/source/index.rst
View file @
b81a6a6b
...
@@ -30,6 +30,7 @@ vLLM is fast with:
...
@@ -30,6 +30,7 @@ vLLM is fast with:
* State-of-the-art serving throughput
* State-of-the-art serving throughput
* Efficient management of attention key and value memory with **PagedAttention**
* Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests
* Continuous batching of incoming requests
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_
* Optimized CUDA kernels
* Optimized CUDA kernels
vLLM is flexible and easy to use with:
vLLM is flexible and easy to use with:
...
@@ -39,7 +40,7 @@ vLLM is flexible and easy to use with:
...
@@ -39,7 +40,7 @@ vLLM is flexible and easy to use with:
* Tensor parallelism support for distributed inference
* Tensor parallelism support for distributed inference
* Streaming outputs
* Streaming outputs
* OpenAI-compatible API server
* OpenAI-compatible API server
* Support NVIDIA
CUDA
and AMD
ROCm
.
* Support NVIDIA
GPUs
and AMD
GPUs
.
For more information, check out the following:
For more information, check out the following:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment