Unverified Commit ee77fdb5 authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Doc][2/N] Reorganize Models and Usage sections (#11755)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
parent 996357e4
(quantization-index)=
# Quantization
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
```{toctree}
:caption: Contents
:maxdepth: 1
supported_hardware
auto_awq
bnb
gguf
int8
fp8
fp8_e5m2_kvcache
fp8_e4m3_kvcache
```
(supported-hardware-for-quantization)= (quantization-supported-hardware)=
# Supported Hardware for Quantization Kernels # Supported Hardware
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM: The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
...@@ -120,12 +120,12 @@ The table below shows the compatibility of various quantization implementations ...@@ -120,12 +120,12 @@ The table below shows the compatibility of various quantization implementations
- ✗ - ✗
``` ```
## Notes:
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0. - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅︎" indicates that the quantization method is supported on the specified hardware. - "✅︎" indicates that the quantization method is supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware. - "✗" indicates that the quantization method is not supported on the specified hardware.
Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods. ```{note}
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team. For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
```
...@@ -79,6 +79,9 @@ serving/metrics ...@@ -79,6 +79,9 @@ serving/metrics
serving/integrations serving/integrations
serving/tensorizer serving/tensorizer
serving/runai_model_streamer serving/runai_model_streamer
serving/engine_args
serving/env_vars
serving/usage_stats
``` ```
```{toctree} ```{toctree}
...@@ -88,53 +91,28 @@ serving/runai_model_streamer ...@@ -88,53 +91,28 @@ serving/runai_model_streamer
models/supported_models models/supported_models
models/generative_models models/generative_models
models/pooling_models models/pooling_models
models/adding_model
models/enabling_multimodal_inputs
``` ```
```{toctree} ```{toctree}
:caption: Usage :caption: Features
:maxdepth: 1 :maxdepth: 1
usage/lora features/quantization/index
usage/multimodal_inputs features/lora
usage/tool_calling features/multimodal_inputs
usage/structured_outputs features/tool_calling
usage/spec_decode features/structured_outputs
usage/compatibility_matrix features/automatic_prefix_caching
usage/performance features/disagg_prefill
usage/engine_args features/spec_decode
usage/env_vars features/compatibility_matrix
usage/usage_stats
usage/disagg_prefill
```
```{toctree}
:caption: Quantization
:maxdepth: 1
quantization/supported_hardware
quantization/auto_awq
quantization/bnb
quantization/gguf
quantization/int8
quantization/fp8
quantization/fp8_e5m2_kvcache
quantization/fp8_e4m3_kvcache
```
```{toctree}
:caption: Automatic Prefix Caching
:maxdepth: 1
automatic_prefix_caching/apc
automatic_prefix_caching/details
``` ```
```{toctree} ```{toctree}
:caption: Performance :caption: Performance
:maxdepth: 1 :maxdepth: 1
performance/optimization
performance/benchmarks performance/benchmarks
``` ```
...@@ -148,10 +126,8 @@ community/meetups ...@@ -148,10 +126,8 @@ community/meetups
community/sponsors community/sponsors
``` ```
% API Documentation: API reference aimed at vllm library usage
```{toctree} ```{toctree}
:caption: API Documentation :caption: API Reference
:maxdepth: 2 :maxdepth: 2
dev/sampling_params dev/sampling_params
...@@ -160,30 +136,32 @@ dev/offline_inference/offline_index ...@@ -160,30 +136,32 @@ dev/offline_inference/offline_index
dev/engine/engine_index dev/engine/engine_index
``` ```
% Design: docs about vLLM internals % Design Documents: Details about vLLM internals
```{toctree} ```{toctree}
:caption: Design :caption: Design Documents
:maxdepth: 2 :maxdepth: 2
design/arch_overview design/arch_overview
design/huggingface_integration design/huggingface_integration
design/plugin_system design/plugin_system
design/input_processing/model_inputs_index
design/kernel/paged_attention design/kernel/paged_attention
design/input_processing/model_inputs_index
design/multimodal/multimodal_index design/multimodal/multimodal_index
design/automatic_prefix_caching
design/multiprocessing design/multiprocessing
``` ```
% For Developers: contributing to the vLLM project % Developer Guide: How to contribute to the vLLM project
```{toctree} ```{toctree}
:caption: For Developers :caption: Developer Guide
:maxdepth: 2 :maxdepth: 2
contributing/overview contributing/overview
contributing/profiling/profiling_index contributing/profiling/profiling_index
contributing/dockerfile/dockerfile contributing/dockerfile/dockerfile
contributing/model/index
``` ```
# Indices and tables # Indices and tables
......
...@@ -37,7 +37,7 @@ print(output) ...@@ -37,7 +37,7 @@ print(output)
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported. If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
```` ````
Otherwise, please refer to [Adding a New Model](#adding-a-new-model) and [Enabling Multimodal Inputs](#enabling-multimodal-inputs) for instructions on how to implement your model in vLLM. Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support. Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
### ModelScope ### ModelScope
......
(performance)= (optimization-and-tuning)=
# Performance and Tuning # Optimization and Tuning
## Preemption ## Preemption
......
...@@ -217,7 +217,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai ...@@ -217,7 +217,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters; [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information. see our [Multimodal Inputs](#multimodal-inputs) guide for more information.
- *Note: `image_url.detail` parameter is not supported.* - *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/openai_chat_completion_client.py> Code example: <gh-file:examples/openai_chat_completion_client.py>
......
...@@ -430,7 +430,7 @@ class ROCmFlashAttentionImpl(AttentionImpl): ...@@ -430,7 +430,7 @@ class ROCmFlashAttentionImpl(AttentionImpl):
Returns: Returns:
shape = [num_tokens, num_heads * head_size] shape = [num_tokens, num_heads * head_size]
""" """
# Reminder: Please update docs/source/usage/compatibility_matrix.md # Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid # If the feature combo become valid
if attn_type != AttentionType.DECODER: if attn_type != AttentionType.DECODER:
raise NotImplementedError("Encoder self-attention and " raise NotImplementedError("Encoder self-attention and "
......
...@@ -644,7 +644,7 @@ class ModelConfig: ...@@ -644,7 +644,7 @@ class ModelConfig:
self.use_async_output_proc = False self.use_async_output_proc = False
return return
# Reminder: Please update docs/source/usage/compatibility_matrix.md # Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid # If the feature combo become valid
from vllm.platforms import current_platform from vllm.platforms import current_platform
if not current_platform.is_async_output_supported(self.enforce_eager): if not current_platform.is_async_output_supported(self.enforce_eager):
...@@ -665,7 +665,7 @@ class ModelConfig: ...@@ -665,7 +665,7 @@ class ModelConfig:
if self.runner_type == "pooling": if self.runner_type == "pooling":
self.use_async_output_proc = False self.use_async_output_proc = False
# Reminder: Please update docs/source/usage/compatibility_matrix.md # Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid # If the feature combo become valid
if speculative_config: if speculative_config:
logger.warning("Async output processing is not supported with" logger.warning("Async output processing is not supported with"
...@@ -2064,7 +2064,7 @@ class LoRAConfig: ...@@ -2064,7 +2064,7 @@ class LoRAConfig:
model_config.quantization) model_config.quantization)
def verify_with_scheduler_config(self, scheduler_config: SchedulerConfig): def verify_with_scheduler_config(self, scheduler_config: SchedulerConfig):
# Reminder: Please update docs/source/usage/compatibility_matrix.md # Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid # If the feature combo become valid
if scheduler_config.chunked_prefill_enabled: if scheduler_config.chunked_prefill_enabled:
logger.warning("LoRA with chunked prefill is still experimental " logger.warning("LoRA with chunked prefill is still experimental "
......
...@@ -1148,7 +1148,7 @@ class EngineArgs: ...@@ -1148,7 +1148,7 @@ class EngineArgs:
disable_logprobs=self.disable_logprobs_during_spec_decoding, disable_logprobs=self.disable_logprobs_during_spec_decoding,
) )
# Reminder: Please update docs/source/usage/compatibility_matrix.md # Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid # If the feature combo become valid
if self.num_scheduler_steps > 1: if self.num_scheduler_steps > 1:
if speculative_config is not None: if speculative_config is not None:
......
...@@ -65,7 +65,7 @@ class MultiStepOutputProcessor(SequenceGroupOutputProcessor): ...@@ -65,7 +65,7 @@ class MultiStepOutputProcessor(SequenceGroupOutputProcessor):
@staticmethod @staticmethod
@functools.lru_cache @functools.lru_cache
def _log_prompt_logprob_unsupported_warning_once(): def _log_prompt_logprob_unsupported_warning_once():
# Reminder: Please update docs/source/usage/compatibility_matrix.md # Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid # If the feature combo become valid
logger.warning( logger.warning(
"Prompt logprob is not supported by multi step workers. " "Prompt logprob is not supported by multi step workers. "
......
...@@ -22,7 +22,7 @@ class CPUExecutor(ExecutorBase): ...@@ -22,7 +22,7 @@ class CPUExecutor(ExecutorBase):
def _init_executor(self) -> None: def _init_executor(self) -> None:
assert self.device_config.device_type == "cpu" assert self.device_config.device_type == "cpu"
# Reminder: Please update docs/source/usage/compatibility_matrix.md # Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid # If the feature combo become valid
assert self.lora_config is None, "cpu backend doesn't support LoRA" assert self.lora_config is None, "cpu backend doesn't support LoRA"
......
...@@ -50,7 +50,7 @@ class CpuPlatform(Platform): ...@@ -50,7 +50,7 @@ class CpuPlatform(Platform):
import vllm.envs as envs import vllm.envs as envs
from vllm.utils import GiB_bytes from vllm.utils import GiB_bytes
model_config = vllm_config.model_config model_config = vllm_config.model_config
# Reminder: Please update docs/source/usage/compatibility_matrix.md # Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid # If the feature combo become valid
if not model_config.enforce_eager: if not model_config.enforce_eager:
logger.warning( logger.warning(
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment