[Doc][2/N] Reorganize Models and Usage sections (#11755)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Doc][2/N] Reorganize Models and Usage sections (#11755)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
ee77fdb5 · Cyrus Leung · GitHub · 996357e4 · ee77fdb5 · ee77fdb5
Unverified Commit ee77fdb5 authored Jan 06, 2025 by Cyrus Leung Committed by GitHub Jan 06, 2025
20 changed files
--- a/docs/source/quantization/gguf.md
+++ b/docs/source/quantization/gguf.md
--- a/docs/source/features/quantization/index.md
+++ b/docs/source/features/quantization/index.md
+(quantization-index)=
+
+# Quantization
+
+Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
+
+```{toctree}
+:caption: Contents
+:maxdepth: 1
+
+supported_hardware
+auto_awq
+bnb
+gguf
+int8
+fp8
+fp8_e5m2_kvcache
+fp8_e4m3_kvcache
+```
--- a/docs/source/quantization/int8.md
+++ b/docs/source/quantization/int8.md
--- a/docs/source/quantization/supported_hardware.md
+++ b/docs/source/quantization/supported_hardware.md
-(supported-hardware-for-quantization)=
+(quantization-supported-hardware)=

-# Supported Hardware for Quantization Kernels
+# Supported Hardware

 The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:

@@ -120,12 +120,12 @@ The table below shows the compatibility of various quantization implementations
  - ✗
 ```

-## Notes:
-
 - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
 - "✅︎" indicates that the quantization method is supported on the specified hardware.
 - "✗" indicates that the quantization method is not supported on the specified hardware.

-Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
+```{note}
+This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.

 For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
+```
--- a/docs/source/usage/spec_decode.md
+++ b/docs/source/usage/spec_decode.md
--- a/docs/source/usage/structured_outputs.md
+++ b/docs/source/usage/structured_outputs.md
--- a/docs/source/usage/tool_calling.md
+++ b/docs/source/usage/tool_calling.md
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -79,6 +79,9 @@ serving/metrics
 serving/integrations
 serving/tensorizer
 serving/runai_model_streamer
+serving/engine_args
+serving/env_vars
+serving/usage_stats
 ```

 ```{toctree}
@@ -88,53 +91,28 @@ serving/runai_model_streamer
 models/supported_models
 models/generative_models
 models/pooling_models
-models/adding_model
-models/enabling_multimodal_inputs
 ```

 ```{toctree}
-:caption: Usage
+:caption: Features
 :maxdepth: 1

-usage/lora
-usage/multimodal_inputs
-usage/tool_calling
-usage/structured_outputs
-usage/spec_decode
-usage/compatibility_matrix
-usage/performance
-usage/engine_args
-usage/env_vars
-usage/usage_stats
-usage/disagg_prefill
-```
-
-```{toctree}
-:caption: Quantization
-:maxdepth: 1
-
-quantization/supported_hardware
-quantization/auto_awq
-quantization/bnb
-quantization/gguf
-quantization/int8
-quantization/fp8
-quantization/fp8_e5m2_kvcache
-quantization/fp8_e4m3_kvcache
-```
-
-```{toctree}
-:caption: Automatic Prefix Caching
-:maxdepth: 1
-
-automatic_prefix_caching/apc
-automatic_prefix_caching/details
+features/quantization/index
+features/lora
+features/multimodal_inputs
+features/tool_calling
+features/structured_outputs
+features/automatic_prefix_caching
+features/disagg_prefill
+features/spec_decode
+features/compatibility_matrix
 ```

 ```{toctree}
 :caption: Performance
 :maxdepth: 1

+performance/optimization
 performance/benchmarks
 ```

@@ -148,10 +126,8 @@ community/meetups
 community/sponsors
 ```

-% API Documentation: API reference aimed at vllm library usage
-
 ```{toctree}
-:caption: API Documentation
+:caption: API Reference
 :maxdepth: 2

 dev/sampling_params
@@ -160,30 +136,32 @@ dev/offline_inference/offline_index
 dev/engine/engine_index
 ```

-% Design: docs about vLLM internals
+% Design Documents: Details about vLLM internals

 ```{toctree}
-:caption: Design
+:caption: Design Documents
 :maxdepth: 2

 design/arch_overview
 design/huggingface_integration
 design/plugin_system
-design/input_processing/model_inputs_index
 design/kernel/paged_attention
+design/input_processing/model_inputs_index
 design/multimodal/multimodal_index
+design/automatic_prefix_caching
 design/multiprocessing
 ```

-% For Developers: contributing to the vLLM project
+% Developer Guide: How to contribute to the vLLM project

 ```{toctree}
-:caption: For Developers
+:caption: Developer Guide
 :maxdepth: 2

 contributing/overview
 contributing/profiling/profiling_index
 contributing/dockerfile/dockerfile
+contributing/model/index
 ```

 # Indices and tables

--- a/docs/source/models/supported_models.md
+++ b/docs/source/models/supported_models.md
@@ -37,7 +37,7 @@ print(output)
 If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
 ````

-Otherwise, please refer to [Adding a New Model](#adding-a-new-model) and [Enabling Multimodal Inputs](#enabling-multimodal-inputs) for instructions on how to implement your model in vLLM.
+Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
 Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.

 ### ModelScope

--- a/docs/source/usage/performance.md
+++ b/docs/source/usage/performance.md
-(performance)=
+(optimization-and-tuning)=

-# Performance and Tuning
+# Optimization and Tuning

 ## Preemption


--- a/docs/source/usage/engine_args.md
+++ b/docs/source/usage/engine_args.md
--- a/docs/source/usage/env_vars.md
+++ b/docs/source/usage/env_vars.md
--- a/docs/source/serving/openai_compatible_server.md
+++ b/docs/source/serving/openai_compatible_server.md
@@ -217,7 +217,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai

 We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
 [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
-see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information.
+see our [Multimodal Inputs](#multimodal-inputs) guide for more information.
 - *Note: `image_url.detail` parameter is not supported.*

 Code example: <gh-file:examples/openai_chat_completion_client.py>

--- a/docs/source/usage/usage_stats.md
+++ b/docs/source/usage/usage_stats.md
--- a/vllm/attention/backends/rocm_flash_attn.py
+++ b/vllm/attention/backends/rocm_flash_attn.py
@@ -430,7 +430,7 @@ class ROCmFlashAttentionImpl(AttentionImpl):
        Returns:
            shape = [num_tokens, num_heads * head_size]
        """
-        # Reminder: Please update docs/source/usage/compatibility_matrix.md
+        # Reminder: Please update docs/source/features/compatibility_matrix.md
        # If the feature combo become valid
        if attn_type != AttentionType.DECODER:
            raise NotImplementedError("Encoder self-attention and "

--- a/vllm/config.py
+++ b/vllm/config.py
@@ -644,7 +644,7 @@ class ModelConfig:
            self.use_async_output_proc = False
            return

-        # Reminder: Please update docs/source/usage/compatibility_matrix.md
+        # Reminder: Please update docs/source/features/compatibility_matrix.md
        # If the feature combo become valid
        from vllm.platforms import current_platform
        if not current_platform.is_async_output_supported(self.enforce_eager):
@@ -665,7 +665,7 @@ class ModelConfig:
        if self.runner_type == "pooling":
            self.use_async_output_proc = False

-        # Reminder: Please update docs/source/usage/compatibility_matrix.md
+        # Reminder: Please update docs/source/features/compatibility_matrix.md
        # If the feature combo become valid
        if speculative_config:
            logger.warning("Async output processing is not supported with"
@@ -2064,7 +2064,7 @@ class LoRAConfig:
                           model_config.quantization)

    def verify_with_scheduler_config(self, scheduler_config: SchedulerConfig):
-        # Reminder: Please update docs/source/usage/compatibility_matrix.md
+        # Reminder: Please update docs/source/features/compatibility_matrix.md
        # If the feature combo become valid
        if scheduler_config.chunked_prefill_enabled:
            logger.warning("LoRA with chunked prefill is still experimental "

--- a/vllm/engine/arg_utils.py
+++ b/vllm/engine/arg_utils.py
@@ -1148,7 +1148,7 @@ class EngineArgs:
            disable_logprobs=self.disable_logprobs_during_spec_decoding,
        )

-        # Reminder: Please update docs/source/usage/compatibility_matrix.md
+        # Reminder: Please update docs/source/features/compatibility_matrix.md
        # If the feature combo become valid
        if self.num_scheduler_steps > 1:
            if speculative_config is not None:

--- a/vllm/engine/output_processor/multi_step.py
+++ b/vllm/engine/output_processor/multi_step.py
@@ -65,7 +65,7 @@ class MultiStepOutputProcessor(SequenceGroupOutputProcessor):
    @staticmethod
    @functools.lru_cache
    def _log_prompt_logprob_unsupported_warning_once():
-        # Reminder: Please update docs/source/usage/compatibility_matrix.md
+        # Reminder: Please update docs/source/features/compatibility_matrix.md
        # If the feature combo become valid
        logger.warning(
            "Prompt logprob is not supported by multi step workers. "

--- a/vllm/executor/cpu_executor.py
+++ b/vllm/executor/cpu_executor.py
@@ -22,7 +22,7 @@ class CPUExecutor(ExecutorBase):

    def _init_executor(self) -> None:
        assert self.device_config.device_type == "cpu"
-        # Reminder: Please update docs/source/usage/compatibility_matrix.md
+        # Reminder: Please update docs/source/features/compatibility_matrix.md
        # If the feature combo become valid
        assert self.lora_config is None, "cpu backend doesn't support LoRA"


--- a/vllm/platforms/cpu.py
+++ b/vllm/platforms/cpu.py
@@ -50,7 +50,7 @@ class CpuPlatform(Platform):
        import vllm.envs as envs
        from vllm.utils import GiB_bytes
        model_config = vllm_config.model_config
-        # Reminder: Please update docs/source/usage/compatibility_matrix.md
+        # Reminder: Please update docs/source/features/compatibility_matrix.md
        # If the feature combo become valid
        if not model_config.enforce_eager:
            logger.warning(