Unverified Commit b942c094 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Stop using title frontmatter and fix doc that can only be reached by search (#20623)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent b4bab816
--- # Architecture Overview
title: Architecture Overview
---
This document provides an overview of the vLLM architecture. This document provides an overview of the vLLM architecture.
......
--- # Automatic Prefix Caching
title: Automatic Prefix Caching
---
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
......
--- # Integration with HuggingFace
title: Integration with HuggingFace
---
This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`. This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.
......
--- # vLLM Paged Attention
title: vLLM Paged Attention
---
Currently, vLLM utilizes its own implementation of a multi-head query Currently, vLLM utilizes its own implementation of a multi-head query
attention kernel (`csrc/attention/attention_kernels.cu`). attention kernel (`csrc/attention/attention_kernels.cu`).
......
--- # Multi-Modal Data Processing
title: Multi-Modal Data Processing
---
To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor. To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
......
--- # vLLM's Plugin System
title: vLLM's Plugin System
---
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM. The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.
......
--- # Automatic Prefix Caching
title: Automatic Prefix Caching
---
## Introduction ## Introduction
......
--- # Compatibility Matrix
title: Compatibility Matrix
---
The tables below show mutually exclusive features and the support on some hardware. The tables below show mutually exclusive features and the support on some hardware.
......
--- # Disaggregated Prefilling (experimental)
title: Disaggregated Prefilling (experimental)
---
This page introduces you the disaggregated prefilling feature in vLLM. This page introduces you the disaggregated prefilling feature in vLLM.
......
--- # LoRA Adapters
title: LoRA Adapters
---
This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model. This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
......
--- # Multimodal Inputs
title: Multimodal Inputs
---
This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM. This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
......
--- # Quantization
title: Quantization
---
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices. Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
......
--- # AutoAWQ
title: AutoAWQ
---
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ). To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint. Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
......
--- # BitBLAS
title: BitBLAS
---
vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations. vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
......
--- # BitsAndBytes
title: BitsAndBytes
---
vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference. vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy. BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
......
--- # FP8 W8A8
title: FP8 W8A8
---
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8. Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
......
--- # GGUF
title: GGUF
---
!!! warning !!! warning
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team. Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
......
--- # GPTQModel
title: GPTQModel
---
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI. To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
......
--- # INT4 W4A16
title: INT4 W4A16
---
vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS). vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).
......
--- # INT8 W8A8
title: INT8 W8A8
---
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration. vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance. This quantization method is particularly useful for reducing model size while maintaining good performance.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment