Remove unnecessary explicit title anchors and use relative links instead (#20620)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Remove unnecessary explicit title anchors and use relative links instead (#20620)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
b4bab816 · Harry Mellor · GitHub · b91cb3fa · b4bab816 · b4bab816
Unverified Commit b4bab816 authored Jul 08, 2025 by Harry Mellor Committed by GitHub Jul 08, 2025
20 changed files
--- a/docs/features/quantization/int8.md
+++ b/docs/features/quantization/int8.md
 ---
 title: INT8 W8A8
 ---
-[](){ #int8 }

 vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
 This quantization method is particularly useful for reducing model size while maintaining good performance.

--- a/docs/features/quantization/quantized_kvcache.md
+++ b/docs/features/quantization/quantized_kvcache.md
 ---
 title: Quantized KV Cache
 ---
-[](){ #quantized-kvcache }

 ## FP8 KV Cache


--- a/docs/features/quantization/quark.md
+++ b/docs/features/quantization/quark.md
 ---
 title: AMD Quark
 ---
-[](){ #quark }

 Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
 throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),

--- a/docs/features/quantization/supported_hardware.md
+++ b/docs/features/quantization/supported_hardware.md
 ---
 title: Supported Hardware
 ---
-[](){ #quantization-supported-hardware }

 The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:


--- a/docs/features/reasoning_outputs.md
+++ b/docs/features/reasoning_outputs.md
 ---
 title: Reasoning Outputs
 ---
-[](){ #reasoning-outputs }

 vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.


--- a/docs/features/spec_decode.md
+++ b/docs/features/spec_decode.md
 ---
 title: Speculative Decoding
 ---
-[](){ #spec-decode }

 !!! warning
    Please note that speculative decoding in vLLM is not yet optimized and does
@@ -269,7 +268,7 @@ speculative decoding, breaking down the guarantees into three key areas:
 3. **vLLM Logprob Stability**
   \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
   same request across runs. For more details, see the FAQ section
-   titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
+   titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).

 While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
 can occur due to following factors:
@@ -278,7 +277,7 @@ can occur due to following factors:
 - **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
  due to non-deterministic behavior in batched operations or numerical instability.

-For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
+For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).

 ## Resources for vLLM contributors


--- a/docs/features/structured_outputs.md
+++ b/docs/features/structured_outputs.md
 ---
 title: Structured Outputs
 ---
-[](){ #structured-outputs }

 vLLM supports the generation of structured outputs using
 [xgrammar](https://github.com/mlc-ai/xgrammar) or
@@ -21,7 +20,7 @@ The following parameters are supported, which must be added as extra parameters:
 - `guided_grammar`: the output will follow the context free grammar.
 - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.

-You can see the complete list of supported parameters on the [OpenAI-Compatible Server][serving-openai-compatible-server] page.
+You can see the complete list of supported parameters on the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) page.

 Structured outputs are supported by default in the OpenAI-Compatible Server. You
 may choose to specify the backend to use by setting the

--- a/docs/getting_started/installation/README.md
+++ b/docs/getting_started/installation/README.md
 ---
 title: Installation
 ---
-[](){ #installation-index }

 vLLM supports the following hardware platforms:


--- a/docs/getting_started/installation/intel_gaudi.md
+++ b/docs/getting_started/installation/intel_gaudi.md
@@ -109,8 +109,8 @@ docker run \

 ### Supported features

- [Offline inference][offline-inference]
- Online serving via [OpenAI-Compatible Server][serving-openai-compatible-server]
+- [Offline inference](../../serving/offline_inference.md)
+- Online serving via [OpenAI-Compatible Server](../../serving/openai_compatible_server.md)
 - HPU autodetection - no need to manually select device within vLLM
 - Paged KV cache with algorithms enabled for Intel Gaudi accelerators
 - Custom Intel Gaudi implementations of Paged Attention, KV cache ops,

--- a/docs/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
 ---
 title: Quickstart
 ---
-[](){ #quickstart }

 This guide will help you quickly get started with vLLM to perform:

@@ -43,7 +42,7 @@ uv pip install vllm --torch-backend=auto
 ```

 !!! note
-    For more detail and non-CUDA platforms, please refer [here][installation-index] for specific instructions on how to install vLLM.
+    For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.

 [](){ #quickstart-offline }

@@ -77,7 +76,7 @@ prompts = [
 sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
 ```

-The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here][supported-models].
+The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here](../models/supported_models.md).

 ```python
 llm = LLM(model="facebook/opt-125m")

--- a/docs/mkdocs/hooks/generate_examples.py
+++ b/docs/mkdocs/hooks/generate_examples.py
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import itertools
+import logging
 from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Literal

 import regex as re

+logger = logging.getLogger("mkdocs")
+
 ROOT_DIR = Path(__file__).parent.parent.parent.parent
 ROOT_DIR_RELATIVE = '../../../../..'
 EXAMPLE_DIR = ROOT_DIR / "examples"
 EXAMPLE_DOC_DIR = ROOT_DIR / "docs/examples"
-print(ROOT_DIR.resolve())
-print(EXAMPLE_DIR.resolve())
-print(EXAMPLE_DOC_DIR.resolve())


 def fix_case(text: str) -> str:
@@ -135,6 +135,11 @@ class Example:


 def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
+    logger.info("Generating example documentation")
+    logger.debug("Root directory: %s", ROOT_DIR.resolve())
+    logger.debug("Example directory: %s", EXAMPLE_DIR.resolve())
+    logger.debug("Example document directory: %s", EXAMPLE_DOC_DIR.resolve())
+
    # Create the EXAMPLE_DOC_DIR if it doesn't exist
    if not EXAMPLE_DOC_DIR.exists():
        EXAMPLE_DOC_DIR.mkdir(parents=True)
@@ -156,7 +161,7 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
    for example in sorted(examples, key=lambda e: e.path.stem):
        example_name = f"{example.path.stem}.md"
        doc_path = EXAMPLE_DOC_DIR / example.category / example_name
-        print(doc_path)
+        logger.debug("Example generated: %s", doc_path.relative_to(ROOT_DIR))
        if not doc_path.parent.exists():
            doc_path.parent.mkdir(parents=True)
        with open(doc_path, "w+") as f:

--- a/docs/models/extensions/runai_model_streamer.md
+++ b/docs/models/extensions/runai_model_streamer.md
 ---
 title: Loading models with Run:ai Model Streamer
 ---
-[](){ #runai-model-streamer }

 Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
 Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).

--- a/docs/models/extensions/tensorizer.md
+++ b/docs/models/extensions/tensorizer.md
 ---
 title: Loading models with CoreWeave's Tensorizer
 ---
-[](){ #tensorizer }

 vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
 vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized

--- a/docs/models/generative_models.md
+++ b/docs/models/generative_models.md
 ---
 title: Generative Models
 ---
-[](){ #generative-models }

 vLLM provides first-class support for generative models, which covers most of LLMs.

@@ -134,7 +133,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)

 ## Online Serving

-Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs:
+Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:

 - [Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
- [Chat API][chat-api]  is similar to `LLM.chat`, accepting both text and [multi-modal inputs][multimodal-inputs] for models with a chat template.
+- [Chat API][chat-api]  is similar to `LLM.chat`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for models with a chat template.
--- a/docs/models/hardware_supported_models/tpu.md
+++ b/docs/models/hardware_supported_models/tpu.md
 ---
 title: TPU
 ---
-[](){ #tpu-supported-models }

 # TPU Supported Models
 ## Text-only Language Models

--- a/docs/models/pooling_models.md
+++ b/docs/models/pooling_models.md
 ---
 title: Pooling Models
 ---
-[](){ #pooling-models }

 vLLM also supports pooling models, including embedding, reranking and reward models.

@@ -11,7 +10,7 @@ before returning them.

 !!! note
    We currently support pooling models primarily as a matter of convenience.
-    As shown in the [Compatibility Matrix][compatibility-matrix], most vLLM features are not applicable to
+    As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
    pooling models as they only work on the generation or decode stage, so performance may not improve as much.

 For pooling models, we support the following `--task` options.
@@ -113,10 +112,10 @@ A code example can be found here: <gh-file:examples/offline_inference/basic/scor

 ## Online Serving

-Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs:
+Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:

 - [Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs][multimodal-inputs] for embedding models.
+- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
 - [Classification API][classification-api] is similar to `LLM.classify` and is applicable to sequence classification models.
 - [Score API][score-api] is similar to `LLM.score` for cross-encoder models.


--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
 ---
 title: Supported Models
 ---
-[](){ #supported-models }

 vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
 If a model supports more than one task, you can set the task via the `--task` argument.
@@ -34,7 +33,7 @@ llm.apply_model(lambda model: print(type(model)))
 If it is `TransformersForCausalLM` then it means it's based on Transformers!

 !!! tip
-    You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference][offline-inference] or `--model-impl transformers` for the [openai-compatible-server][serving-openai-compatible-server].
+    You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference](../serving/offline_inference.md) or `--model-impl transformers` for the [openai-compatible-server](../serving/openai_compatible_server.md).

 !!! note
    vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
@@ -53,8 +52,8 @@ For a model to be compatible with the Transformers backend for vLLM it must:

 If the compatible model is:

- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference][offline-inference] or `--trust-remote-code` for the [openai-compatible-server][serving-openai-compatible-server].
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference][offline-inference] or `vllm serve <MODEL_DIR>` for the [openai-compatible-server][serving-openai-compatible-server].
+- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference](../serving/offline_inference.md) or `--trust-remote-code` for the [openai-compatible-server](../serving/openai_compatible_server.md).
+- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference](../serving/offline_inference.md) or `vllm serve <MODEL_DIR>` for the [openai-compatible-server](../serving/openai_compatible_server.md).

 This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!

@@ -171,7 +170,7 @@ The [Transformers backend][transformers-backend] enables you to run models direc

    If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.

-Otherwise, please refer to [Adding a New Model][new-model] for instructions on how to implement your model in vLLM.
+Otherwise, please refer to [Adding a New Model](../contributing/model/README.md) for instructions on how to implement your model in vLLM.
 Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.

 #### Download a model
@@ -308,13 +307,13 @@ print(output)

 ### Generative Models

-See [this page][generative-models] for more information on how to use generative models.
+See [this page](generative_models.md) for more information on how to use generative models.

 #### Text Generation

 Specified using `--task generate`.

-| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
+| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
 |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
 | `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ |
@@ -412,7 +411,7 @@ See [this page](./pooling_models.md) for more information on how to use pooling

 Specified using `--task embed`.

-| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
+| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
 |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
 | `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
 | `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
@@ -448,7 +447,7 @@ of the whole prompt are extracted from the normalized hidden state corresponding

 Specified using `--task reward`.

-| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
+| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
 |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
 | `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
@@ -466,7 +465,7 @@ If your model is not in the above list, we will try to automatically convert the

 Specified using `--task classify`.

-| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
+| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
 |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
 | `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
 | `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ |
@@ -527,7 +526,7 @@ On the other hand, modalities separated by `/` are mutually exclusive.

 - e.g.: `T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.

-See [this page][multimodal-inputs] on how to pass multi-modal inputs to the model.
+See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inputs to the model.

 !!! important
    **To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
@@ -557,13 +556,13 @@ See [this page][multimodal-inputs] on how to pass multi-modal inputs to the mode

 ### Generative Models

-See [this page][generative-models] for more information on how to use generative models.
+See [this page](generative_models.md) for more information on how to use generative models.

 #### Text Generation

 Specified using `--task generate`.

-| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
+| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
 |--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
 | `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
 | `AyaVisionForConditionalGeneration` | Aya Vision | T + I<sup>+</sup> | `CohereForAI/aya-vision-8b`, `CohereForAI/aya-vision-32b`, etc. | | ✅︎ | ✅︎ |
@@ -685,7 +684,7 @@ Specified using `--task transcription`.

 Speech2Text models trained specifically for Automatic Speech Recognition.

-| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
+| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
 |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
 | `WhisperForConditionalGeneration` | Whisper | `openai/whisper-small`, `openai/whisper-large-v3-turbo`, etc. | | | |

@@ -708,7 +707,7 @@ Any text generation model can be converted into an embedding model by passing `-

 The following table lists those that are tested in vLLM.

-| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
+| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
 |--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
 | `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
 | `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |

--- a/docs/serving/distributed_serving.md
+++ b/docs/serving/distributed_serving.md
 ---
 title: Distributed Inference and Serving
 ---
-[](){ #distributed-serving }

 ## How to decide the distributed inference strategy?


--- a/docs/serving/integrations/langchain.md
+++ b/docs/serving/integrations/langchain.md
 ---
 title: LangChain
 ---
-[](){ #serving-langchain }

 vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .


--- a/docs/serving/integrations/llamaindex.md
+++ b/docs/serving/integrations/llamaindex.md
 ---
 title: LlamaIndex
 ---
-[](){ #serving-llamaindex }

 vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .