Commit 58666cd7 authored by chenzk's avatar chenzk
Browse files

vllm kvprune wo:v1.1.2

parent d4645504
...@@ -21,59 +21,170 @@ For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us. ...@@ -21,59 +21,170 @@ For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
## About ## About
vLLM is a fast and easy-to-use library for LLM inference and serving. The model compression function of kv cache pruning has been added to the official vllm.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM prune with:
vLLM is fast with: - [**SNAPKV**](https://arxiv.org/pdf/2404.14469)
- [**COMPACTOR**](https://arxiv.org/pdf/2507.08143)
- [**CRITICALADAKV**](https://arxiv.org/pdf/2502.03805)
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support
vLLM seamlessly supports most popular open-source models on HuggingFace, including: vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama) - Transformer-like LLMs (e.g., Qwen3/Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g., E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA)
Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html). ## Env
## Getting Started ```bash
cd vllm
python use_existing_torch.py
# then add torch in requires of pyproject.toml
export SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM="0.6.0"
pip install -e . --no-build-isolation -v -i https://mirrors.aliyun.com/pypi/simple/
pip install numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/
```
Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source): More related libraries:
```bash - flash_attn-2.8.3+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
pip install vllm - torchvision-0.24.0+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
- triton-3.1.0+das.opt1.dtk2604.torch271-cp310-cp310-manylinux_2_28_x86_64.whl
## Quick Start
Basic Chat Generation with Compression:
```
python test.py --schedule pdtriton
``` ```
test.py:
```python
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
# PYTHONPATH=/home/vllm-project/vllm python test.py --schedule pdtriton
from __future__ import annotations
import argparse
import os
import sys
from multiprocessing import freeze_support
def _apply_kvprune_attention_env(schedule: str | None) -> None:
"""Map CLI -> VLLM_KVPRUNE_ATTENTION_SCHEDULE (fa_triton | pdtriton | pdfa)."""
if not schedule:
return
os.environ["VLLM_KVPRUNE_ATTENTION_SCHEDULE"] = schedule
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument(
"--schedule",
type=str,
default="pdtriton",
choices=("fa_triton", "pdtriton", "pdfa"),
help=(
"fa_triton=FA prefill + Triton decode;"
"pdtriton=Triton prefill + Triton decode;"
"pdfa=FA prefill + FA decode (page KV writing is Triton);"
),
)
args, _unknown = parser.parse_known_args()
_apply_kvprune_attention_env(args.schedule)
from transformers import AutoTokenizer
from vllm import CompressionParams, LLM, SamplingParams
model_id = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.8,
repetition_penalty=1.05,
max_tokens=512,
)
llm = LLM(
model=model_id,
tensor_parallel_size=4,
max_model_len=8192,
gpu_memory_utilization=0.85,
kvprune_compression=True, # True, False
)
prompt = (
"Write a 200-word English prompt for a creative writing task. The prompt should be "
"a single coherent paragraph without any bullet points, numbered lists, or markdown "
"formatting. It should describe a specific scenario, character, or conflict, and end "
"with a clear question that invites the writer to continue the story. Do not use any "
"special symbols or line breaks. The tone can be mysterious, tense, or reflective. "
"After the paragraph, include the question on the same line directly following the "
"period, without hitting enter."
)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # True
)
compression = [
CompressionParams(
compression_ratio=0.5,
compression_method="compactor",
),
]
outputs = llm.generate(
[text],
sampling_params=sampling_params,
compression=compression,
)
for output in outputs:
generated_text = output.outputs[0].text
print(f"Generated text: {generated_text!r}")
if __name__ == "__main__":
freeze_support()
main()
```
`kvprune_compression=True` is used for pruning, disable the CUDA graph mode of the vLLM v1 engine to reduce inference time, and minimize the GPU memory usage of the vLLM v1 engine.
If a DCU kernel error occurs, prepend the test command with `export HIP_LAUNCH_BLOCKING=1` to work around the instability of the DCU Triton kernel. In the long term, the stability needs to be fundamentally improved by Triton compilation engineers.
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html) If test ruler datasets:
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html) ```
rm -rf ~/.triton ~/.cache/torch /tmp/triton* /tmp/torch*
export PYTHONPATH=/home/vllm-project/vllm
export HIP_LAUNCH_BLOCKING=1
python vllm/tests/kvprune/evaluate/eval_ruler.py \
--tensor-parallel-size 4 \
--dataset-parquet ruler/4096/test-00000-of-00001.parquet \
--dataset-split train \
--model Qwen/Qwen3-8B \
--compression-method snapkv \
--seq-compression-ratio 0.5 \
--attention-schedule pdtriton
```
## Contributing ## Contributing
We welcome and value any contributions and collaborations. We welcome and value any contributions and collaborations.
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
## Citation ## Citation
......
<!-- markdownlint-disable MD001 MD041 -->
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
</picture>
</p>
<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>
<p align="center">
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
</p>
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
---
## About
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g., E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA)
Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
## Getting Started
Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
```bash
pip install vllm
```
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
## Contributing
We welcome and value any contributions and collaborations.
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
## Citation
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
```bibtex
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
```
## Contact Us
<!-- --8<-- [start:contact-us] -->
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
<!-- --8<-- [end:contact-us] -->
## Media Kit
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)
...@@ -130,7 +130,7 @@ def parse_args() -> argparse.Namespace: ...@@ -130,7 +130,7 @@ def parse_args() -> argparse.Namespace:
parser.add_argument( parser.add_argument(
"--kvprune-compression", "--kvprune-compression",
action=argparse.BooleanOptionalAction, action=argparse.BooleanOptionalAction,
default=True, default=True, # True
help="Enable kvprune_compression on LLM (skip v1 CUDA graphs, minimal v1 KV blocks). " help="Enable kvprune_compression on LLM (skip v1 CUDA graphs, minimal v1 KV blocks). "
"Default: True.", "Default: True.",
) )
......
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
import itertools import itertools
import os import os
from copy import deepcopy
from collections.abc import Callable, Iterable, Sequence from collections.abc import Callable, Iterable, Sequence
from pathlib import Path from pathlib import Path
from typing import TYPE_CHECKING, Any from typing import TYPE_CHECKING, Any
...@@ -186,15 +187,13 @@ class LLM: ...@@ -186,15 +187,13 @@ class LLM:
enforce_eager: Whether to enforce eager execution. If True, we will enforce_eager: Whether to enforce eager execution. If True, we will
disable CUDA graph and always execute the model in eager mode. disable CUDA graph and always execute the model in eager mode.
If False, we will use CUDA graph and eager execution in hybrid. If False, we will use CUDA graph and eager execution in hybrid.
kvprune_compression: If True, sets ``enforce_eager=True`` for the **v1** kvprune_compression: Compatibility flag for the integrated kvprune path.
engine only (no v1 CUDA graph capture). If ``None`` (default), read If ``None`` (default), read ``VLLM_KVPRUNE_COMPRESSION_DEFAULT``.
``VLLM_KVPRUNE_COMPRESSION_DEFAULT`` (``"0"`` = allow v1 graphs; When enabled, requests with ``compression_ratio < 1.0`` automatically
``"1"`` = skip v1 graphs). This is independent of the compactor's rebuild the internal v1 engine into a kvprune-friendly mode
``LLMConfig.enforce_eager`` (see ``VLLM_KVPRUNE_COMPACTOR_CUDA_GRAPH`` / (``enforce_eager=True`` and ``num_gpu_blocks_override=1``). Requests
``VLLM_KVPRUNE_COMPACTOR_ENFORCE_EAGER``; default tries compactor graphs). with ``compression_ratio >= 1.0`` use the caller-provided normal v1
When True, v1's GPU KV pool defaults to **one** block (minimum allowed by engine configuration.
the scheduler) unless ``num_gpu_blocks_override`` is passed in ``**kwargs``
or ``VLLM_KVPRUNE_V1_NUM_GPU_BLOCKS`` is set (``auto`` = profiled allocation).
enable_return_routed_experts: Whether to return routed experts. enable_return_routed_experts: Whether to return routed experts.
disable_custom_all_reduce: See disable_custom_all_reduce: See
[ParallelConfig][vllm.config.ParallelConfig]. [ParallelConfig][vllm.config.ParallelConfig].
...@@ -351,26 +350,12 @@ class LLM: ...@@ -351,26 +350,12 @@ class LLM:
"'examples/offline_inference/data_parallel.py'." "'examples/offline_inference/data_parallel.py'."
) )
# v1 ``enforce_eager`` is independent of kvprune compactor ``LLMConfig.enforce_eager``. # v1 ``enforce_eager`` is independent of kvprune compactor
# ``LLMConfig.enforce_eager``. ``kvprune_compression`` enables automatic
# switching between normal and compressed v1 engine init modes per request.
if kvprune_compression is None: if kvprune_compression is None:
_kvd = os.environ.get("VLLM_KVPRUNE_COMPRESSION_DEFAULT", "0").strip().lower() _kvd = os.environ.get("VLLM_KVPRUNE_COMPRESSION_DEFAULT", "0").strip().lower()
kvprune_compression = _kvd in ("1", "true", "yes") kvprune_compression = _kvd in ("1", "true", "yes")
if kvprune_compression:
enforce_eager = True
# Reserve minimal v1 GPU KV so compactor can use the rest of VRAM. v1
# scheduler requires num_gpu_blocks >= 1; profiling would allocate a
# large pool from gpu_memory_utilization. Override:
# VLLM_KVPRUNE_V1_NUM_GPU_BLOCKS unset -> 1 block (default)
# VLLM_KVPRUNE_V1_NUM_GPU_BLOCKS=auto -> profiled (no override)
# VLLM_KVPRUNE_V1_NUM_GPU_BLOCKS=<int> -> max(1, int)
if "num_gpu_blocks_override" not in kwargs:
_v1_kv = os.environ.get("VLLM_KVPRUNE_V1_NUM_GPU_BLOCKS", "").strip()
if _v1_kv.lower() in ("auto", "profile"):
pass
elif not _v1_kv:
kwargs["num_gpu_blocks_override"] = 1
else:
kwargs["num_gpu_blocks_override"] = max(1, int(_v1_kv))
engine_args = EngineArgs( engine_args = EngineArgs(
model=model, model=model,
runner=runner, runner=runner,
...@@ -411,13 +396,52 @@ class LLM: ...@@ -411,13 +396,52 @@ class LLM:
log_non_default_args(engine_args) log_non_default_args(engine_args)
self.llm_engine = LLMEngine.from_engine_args(
engine_args=engine_args, usage_context=UsageContext.LLM_CLASS
)
self.engine_class = type(self.llm_engine)
self.request_counter = Counter() self.request_counter = Counter()
self.default_sampling_params: dict[str, Any] | None = None self.default_sampling_params: dict[str, Any] | None = None
# Cache for __repr__ to avoid repeated collective_rpc calls
self._cached_repr: str | None = None
# Lazy compactor engine (``vllm.kvprune``) when :meth:`generate` uses compression.
self._kvprune_compactor_engine: Any = None
self._kvprune_compression_enabled = bool(kvprune_compression)
self._engine_args_base = deepcopy(engine_args)
self._kvprune_v1_mode = "normal"
self.chat_template = load_chat_template(chat_template)
self.chat_template_config = ChatTemplateConfig(chat_template=self.chat_template)
if not self._kvprune_compression_enabled:
self._rebuild_llm_engine_for_kvprune_mode("normal")
def _ensure_llm_engine_initialized(self, mode: str = "normal") -> None:
if not hasattr(self, "llm_engine"):
self._rebuild_llm_engine_for_kvprune_mode(mode)
def _shutdown_llm_engine(self) -> None:
old_engine = getattr(self, "llm_engine", None)
if old_engine is None:
return
try:
old_engine.engine_core.shutdown()
except Exception:
logger.warning("Failed to shutdown previous LLMEngine cleanly.", exc_info=True)
try:
dp_group = getattr(old_engine, "dp_group", None)
if dp_group is not None and not getattr(old_engine, "external_launcher_dp", False):
from vllm.distributed import (
stateless_destroy_torch_distributed_process_group,
)
stateless_destroy_torch_distributed_process_group(dp_group)
except Exception:
logger.warning(
"Failed to destroy previous LLMEngine DP group cleanly.",
exc_info=True,
)
def _attach_llm_engine(self, llm_engine: LLMEngine) -> None:
self.llm_engine = llm_engine
self.engine_class = type(self.llm_engine)
self.default_sampling_params = None
self._cached_repr = None
self._kvprune_compactor_engine = None
supported_tasks = self.llm_engine.get_supported_tasks() supported_tasks = self.llm_engine.get_supported_tasks()
logger.info("Supported tasks: %s", supported_tasks) logger.info("Supported tasks: %s", supported_tasks)
...@@ -425,23 +449,44 @@ class LLM: ...@@ -425,23 +449,44 @@ class LLM:
self.model_config = self.llm_engine.model_config self.model_config = self.llm_engine.model_config
self.renderer = self.llm_engine.renderer self.renderer = self.llm_engine.renderer
self.chat_template = load_chat_template(chat_template)
self.io_processor = self.llm_engine.io_processor self.io_processor = self.llm_engine.io_processor
self.input_processor = self.llm_engine.input_processor self.input_processor = self.llm_engine.input_processor
self.chat_template_config = ChatTemplateConfig(chat_template=self.chat_template)
self.pooling_io_processors = init_pooling_io_processors( self.pooling_io_processors = init_pooling_io_processors(
supported_tasks=supported_tasks, supported_tasks=supported_tasks,
model_config=self.model_config, model_config=self.model_config,
renderer=self.renderer, renderer=self.renderer,
chat_template_config=self.chat_template_config, chat_template_config=self.chat_template_config,
) )
# Cache for __repr__ to avoid repeated collective_rpc calls
self._cached_repr: str | None = None def _rebuild_llm_engine_for_kvprune_mode(self, mode: str) -> None:
# Lazy compactor engine (``vllm.kvprune``) when :meth:`generate` uses compression. if mode not in ("normal", "compressed"):
self._kvprune_compactor_engine: Any = None raise ValueError(f"Unknown kvprune v1 mode: {mode!r}")
self._kvprune_compression_enabled = bool(kvprune_compression) if getattr(self, "_kvprune_v1_mode", None) == mode and hasattr(self, "llm_engine"):
return
engine_args = deepcopy(self._engine_args_base)
if mode == "compressed":
engine_args.enforce_eager = True
engine_args.num_gpu_blocks_override = 1
self._shutdown_llm_engine()
llm_engine = LLMEngine.from_engine_args(
engine_args=engine_args, usage_context=UsageContext.LLM_CLASS
)
self._attach_llm_engine(llm_engine)
self._kvprune_v1_mode = mode
def _compression_needs_kvprune(self, compression: Any) -> bool:
if compression is None:
return False
from vllm.kvprune.integration.compression_params import CompressionParams
if isinstance(compression, CompressionParams):
return compression.compression_ratio < 1.0
return any(cp.compression_ratio < 1.0 for cp in compression)
def get_tokenizer(self) -> TokenizerLike: def get_tokenizer(self) -> TokenizerLike:
self._ensure_llm_engine_initialized("normal")
return self.llm_engine.get_tokenizer() return self.llm_engine.get_tokenizer()
def get_world_size(self, include_dp: bool = True) -> int: def get_world_size(self, include_dp: bool = True) -> int:
...@@ -456,16 +501,19 @@ class LLM: ...@@ -456,16 +501,19 @@ class LLM:
The world size (tensor_parallel_size * pipeline_parallel_size), The world size (tensor_parallel_size * pipeline_parallel_size),
optionally multiplied by data_parallel_size if include_dp is True. optionally multiplied by data_parallel_size if include_dp is True.
""" """
self._ensure_llm_engine_initialized("normal")
parallel_config = self.llm_engine.vllm_config.parallel_config parallel_config = self.llm_engine.vllm_config.parallel_config
if include_dp: if include_dp:
return parallel_config.world_size_across_dp return parallel_config.world_size_across_dp
return parallel_config.world_size return parallel_config.world_size
def reset_mm_cache(self) -> None: def reset_mm_cache(self) -> None:
self._ensure_llm_engine_initialized("normal")
self.renderer.clear_mm_cache() self.renderer.clear_mm_cache()
self.llm_engine.reset_mm_cache() self.llm_engine.reset_mm_cache()
def get_default_sampling_params(self) -> SamplingParams: def get_default_sampling_params(self) -> SamplingParams:
self._ensure_llm_engine_initialized("normal")
if self.default_sampling_params is None: if self.default_sampling_params is None:
self.default_sampling_params = self.model_config.get_diff_sampling_param() self.default_sampling_params = self.model_config.get_diff_sampling_param()
if self.default_sampling_params: if self.default_sampling_params:
...@@ -513,16 +561,31 @@ class LLM: ...@@ -513,16 +561,31 @@ class LLM:
prompt has ``compression_ratio < 1.0``, the batch is run on the integrated prompt has ``compression_ratio < 1.0``, the batch is run on the integrated
compactor engine with weights shared from this ``LLM``. Omit or use all compactor engine with weights shared from this ``LLM``. Omit or use all
``compression_ratio >= 1`` to use the standard v1 engine only. ``compression_ratio >= 1`` to use the standard v1 engine only.
Use ``kvprune_compression=True`` or ``VLLM_KVPRUNE_COMPRESSION_DEFAULT=1`` If ``kvprune_compression=True``, requests with
so the v1 engine skips CUDA graph capture. Compactor decode graphs ``compression_ratio < 1.0`` automatically rebuild the internal v1
default on (``VLLM_KVPRUNE_COMPACTOR_CUDA_GRAPH`` default ``1``) with engine into eager + 1-block mode before entering kvprune.
eager fallback if capture fails; set ``VLLM_KVPRUNE_COMPACTOR_ENFORCE_EAGER=1`` Compactor decode graphs default on
to skip compactor graph capture entirely. (``VLLM_KVPRUNE_COMPACTOR_CUDA_GRAPH`` default ``1``) with eager
fallback if capture fails; set
``VLLM_KVPRUNE_COMPACTOR_ENFORCE_EAGER=1`` to skip compactor graph
capture entirely.
Returns: Returns:
A list of `RequestOutput` objects containing the A list of `RequestOutput` objects containing the
generated completions in the same order as the input prompts. generated completions in the same order as the input prompts.
""" """
compression_eff = compression
if self._kvprune_compression_enabled:
target_mode = (
"compressed"
if self._compression_needs_kvprune(compression_eff)
else "normal"
)
self._rebuild_llm_engine_for_kvprune_mode(target_mode)
else:
self._ensure_llm_engine_initialized("normal")
runner_type = self.model_config.runner_type runner_type = self.model_config.runner_type
if runner_type != "generate": if runner_type != "generate":
raise ValueError( raise ValueError(
...@@ -530,23 +593,6 @@ class LLM: ...@@ -530,23 +593,6 @@ class LLM:
"Try passing `--runner generate` to use the model as a " "Try passing `--runner generate` to use the model as a "
"generative model." "generative model."
) )
compression_eff = compression
if compression is None and getattr(self, "_kvprune_compression_enabled", False):
pc = self.llm_engine.vllm_config.parallel_config
if (
pc.tensor_parallel_size > 1
and pc.pipeline_parallel_size == 1
and pc.data_parallel_size == 1
):
from vllm.kvprune.integration.compression_params import CompressionParams
from vllm.kvprune.integration.compressed_generate import (
_normalize_prompt_list,
)
_plist = _normalize_prompt_list(prompts)
compression_eff = [
CompressionParams(compression_ratio=1.0) for _ in _plist
]
if compression_eff is not None: if compression_eff is not None:
from vllm.kvprune.integration.compressed_generate import ( from vllm.kvprune.integration.compressed_generate import (
...@@ -605,6 +651,7 @@ class LLM: ...@@ -605,6 +651,7 @@ class LLM:
Returns: Returns:
A list of request IDs for the enqueued requests. A list of request IDs for the enqueued requests.
""" """
self._ensure_llm_engine_initialized("normal")
runner_type = self.model_config.runner_type runner_type = self.model_config.runner_type
if runner_type != "generate": if runner_type != "generate":
raise ValueError("LLM.enqueue() is only supported for generative models.") raise ValueError("LLM.enqueue() is only supported for generative models.")
...@@ -741,6 +788,7 @@ class LLM: ...@@ -741,6 +788,7 @@ class LLM:
and set up data-plane communication to pass data. and set up data-plane communication to pass data.
""" """
self._ensure_llm_engine_initialized("normal")
return self.llm_engine.collective_rpc(method, timeout, args, kwargs) return self.llm_engine.collective_rpc(method, timeout, args, kwargs)
def apply_model(self, func: Callable[[nn.Module], _R]) -> list[_R]: def apply_model(self, func: Callable[[nn.Module], _R]) -> list[_R]:
...@@ -754,6 +802,7 @@ class LLM: ...@@ -754,6 +802,7 @@ class LLM:
make sure you move them to CPU first to avoid taking up additional make sure you move them to CPU first to avoid taking up additional
VRAM! VRAM!
""" """
self._ensure_llm_engine_initialized("normal")
return self.llm_engine.apply_model(func) return self.llm_engine.apply_model(func)
def beam_search( def beam_search(
......
...@@ -377,14 +377,7 @@ def try_compressed_generate( ...@@ -377,14 +377,7 @@ def try_compressed_generate(
comps = _normalize_compression_params(compression, len(plist)) comps = _normalize_compression_params(compression, len(plist))
pc = llm.llm_engine.vllm_config.parallel_config pc = llm.llm_engine.vllm_config.parallel_config
# TP>1: every worker must run the same collective_rpc session. If all
# compression_ratio >= 1, the old code returned None and only the driver ran
# v1 _run_engine — other ranks never joined a matching collective, which can
# deadlock NCCL / leave workers unsynchronized (hang at "Processed prompts:").
if pc.tensor_parallel_size > 1:
if not _should_use_kvprune_compactor_path(comps): if not _should_use_kvprune_compactor_path(comps):
comps = [CompressionParams(compression_ratio=1.0) for _ in plist]
elif not _should_use_kvprune_compactor_path(comps):
return None return None
v1_eager = bool( v1_eager = bool(
...@@ -394,8 +387,8 @@ def try_compressed_generate( ...@@ -394,8 +387,8 @@ def try_compressed_generate(
logger.warning( logger.warning(
"KV-prune compression: v1 CUDA graphs are still enabled on this LLM. " "KV-prune compression: v1 CUDA graphs are still enabled on this LLM. "
"The compactor does not reuse v1 graphs; capture wastes VRAM. " "The compactor does not reuse v1 graphs; capture wastes VRAM. "
"Set kvprune_compression=True, enforce_eager=True, or " "Set enforce_eager=True on LLM() if you need to avoid the extra "
"VLLM_KVPRUNE_COMPRESSION_DEFAULT=1 before import vllm." "v1 graph capture overhead for compressed generation."
) )
if pc.tensor_parallel_size > 1: if pc.tensor_parallel_size > 1:
...@@ -449,4 +442,3 @@ def _sequences_to_request_outputs(seqs: list[Any], engine: LLMEngine) -> list[Re ...@@ -449,4 +442,3 @@ def _sequences_to_request_outputs(seqs: list[Any], engine: LLMEngine) -> list[Re
) )
out.append(ro) out.append(ro)
return out return out
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment