Unverified Commit ee77fdb5 authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Doc][2/N] Reorganize Models and Usage sections (#11755)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
parent 996357e4
(quantization-index)=
# Quantization
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
```{toctree}
:caption: Contents
:maxdepth: 1
supported_hardware
auto_awq
bnb
gguf
int8
fp8
fp8_e5m2_kvcache
fp8_e4m3_kvcache
```
(supported-hardware-for-quantization)=
(quantization-supported-hardware)=
# Supported Hardware for Quantization Kernels
# Supported Hardware
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
......@@ -120,12 +120,12 @@ The table below shows the compatibility of various quantization implementations
- ✗
```
## Notes:
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅︎" indicates that the quantization method is supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware.
Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
```{note}
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
```
......@@ -79,6 +79,9 @@ serving/metrics
serving/integrations
serving/tensorizer
serving/runai_model_streamer
serving/engine_args
serving/env_vars
serving/usage_stats
```
```{toctree}
......@@ -88,53 +91,28 @@ serving/runai_model_streamer
models/supported_models
models/generative_models
models/pooling_models
models/adding_model
models/enabling_multimodal_inputs
```
```{toctree}
:caption: Usage
:caption: Features
:maxdepth: 1
usage/lora
usage/multimodal_inputs
usage/tool_calling
usage/structured_outputs
usage/spec_decode
usage/compatibility_matrix
usage/performance
usage/engine_args
usage/env_vars
usage/usage_stats
usage/disagg_prefill
```
```{toctree}
:caption: Quantization
:maxdepth: 1
quantization/supported_hardware
quantization/auto_awq
quantization/bnb
quantization/gguf
quantization/int8
quantization/fp8
quantization/fp8_e5m2_kvcache
quantization/fp8_e4m3_kvcache
```
```{toctree}
:caption: Automatic Prefix Caching
:maxdepth: 1
automatic_prefix_caching/apc
automatic_prefix_caching/details
features/quantization/index
features/lora
features/multimodal_inputs
features/tool_calling
features/structured_outputs
features/automatic_prefix_caching
features/disagg_prefill
features/spec_decode
features/compatibility_matrix
```
```{toctree}
:caption: Performance
:maxdepth: 1
performance/optimization
performance/benchmarks
```
......@@ -148,10 +126,8 @@ community/meetups
community/sponsors
```
% API Documentation: API reference aimed at vllm library usage
```{toctree}
:caption: API Documentation
:caption: API Reference
:maxdepth: 2
dev/sampling_params
......@@ -160,30 +136,32 @@ dev/offline_inference/offline_index
dev/engine/engine_index
```
% Design: docs about vLLM internals
% Design Documents: Details about vLLM internals
```{toctree}
:caption: Design
:caption: Design Documents
:maxdepth: 2
design/arch_overview
design/huggingface_integration
design/plugin_system
design/input_processing/model_inputs_index
design/kernel/paged_attention
design/input_processing/model_inputs_index
design/multimodal/multimodal_index
design/automatic_prefix_caching
design/multiprocessing
```
% For Developers: contributing to the vLLM project
% Developer Guide: How to contribute to the vLLM project
```{toctree}
:caption: For Developers
:caption: Developer Guide
:maxdepth: 2
contributing/overview
contributing/profiling/profiling_index
contributing/dockerfile/dockerfile
contributing/model/index
```
# Indices and tables
......
......@@ -37,7 +37,7 @@ print(output)
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
````
Otherwise, please refer to [Adding a New Model](#adding-a-new-model) and [Enabling Multimodal Inputs](#enabling-multimodal-inputs) for instructions on how to implement your model in vLLM.
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
### ModelScope
......
(performance)=
(optimization-and-tuning)=
# Performance and Tuning
# Optimization and Tuning
## Preemption
......
......@@ -217,7 +217,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information.
see our [Multimodal Inputs](#multimodal-inputs) guide for more information.
- *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/openai_chat_completion_client.py>
......
......@@ -430,7 +430,7 @@ class ROCmFlashAttentionImpl(AttentionImpl):
Returns:
shape = [num_tokens, num_heads * head_size]
"""
# Reminder: Please update docs/source/usage/compatibility_matrix.md
# Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid
if attn_type != AttentionType.DECODER:
raise NotImplementedError("Encoder self-attention and "
......
......@@ -644,7 +644,7 @@ class ModelConfig:
self.use_async_output_proc = False
return
# Reminder: Please update docs/source/usage/compatibility_matrix.md
# Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid
from vllm.platforms import current_platform
if not current_platform.is_async_output_supported(self.enforce_eager):
......@@ -665,7 +665,7 @@ class ModelConfig:
if self.runner_type == "pooling":
self.use_async_output_proc = False
# Reminder: Please update docs/source/usage/compatibility_matrix.md
# Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid
if speculative_config:
logger.warning("Async output processing is not supported with"
......@@ -2064,7 +2064,7 @@ class LoRAConfig:
model_config.quantization)
def verify_with_scheduler_config(self, scheduler_config: SchedulerConfig):
# Reminder: Please update docs/source/usage/compatibility_matrix.md
# Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid
if scheduler_config.chunked_prefill_enabled:
logger.warning("LoRA with chunked prefill is still experimental "
......
......@@ -1148,7 +1148,7 @@ class EngineArgs:
disable_logprobs=self.disable_logprobs_during_spec_decoding,
)
# Reminder: Please update docs/source/usage/compatibility_matrix.md
# Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid
if self.num_scheduler_steps > 1:
if speculative_config is not None:
......
......@@ -65,7 +65,7 @@ class MultiStepOutputProcessor(SequenceGroupOutputProcessor):
@staticmethod
@functools.lru_cache
def _log_prompt_logprob_unsupported_warning_once():
# Reminder: Please update docs/source/usage/compatibility_matrix.md
# Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid
logger.warning(
"Prompt logprob is not supported by multi step workers. "
......
......@@ -22,7 +22,7 @@ class CPUExecutor(ExecutorBase):
def _init_executor(self) -> None:
assert self.device_config.device_type == "cpu"
# Reminder: Please update docs/source/usage/compatibility_matrix.md
# Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid
assert self.lora_config is None, "cpu backend doesn't support LoRA"
......
......@@ -50,7 +50,7 @@ class CpuPlatform(Platform):
import vllm.envs as envs
from vllm.utils import GiB_bytes
model_config = vllm_config.model_config
# Reminder: Please update docs/source/usage/compatibility_matrix.md
# Reminder: Please update docs/source/features/compatibility_matrix.md
# If the feature combo become valid
if not model_config.enforce_eager:
logger.warning(
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment