vllm kvprune wo:v1.1.2

58666cd7 · chenzk · d4645504 · 58666cd7 · 58666cd7 · 58666cd7
Commit 58666cd7 authored May 06, 2026 by chenzk
5 changed files
--- a/README.md
+++ b/README.md
@@ -21,59 +21,170 @@ For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
 ## About
-vLLM is a fast and easy-to-use library for LLM inference and serving.
+The model compression function of kv cache pruning has been added to the official vllm.
-Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
+vLLM prune with:
-vLLM is fast with:
+- [**SNAPKV**](https://arxiv.org/pdf/2404.14469)
+- [**COMPACTOR**](https://arxiv.org/pdf/2507.08143)
+- [**CRITICALADAKV**](https://arxiv.org/pdf/2502.03805)
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill
-vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support
 vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama)
+- Transformer-like LLMs (e.g., Qwen3/Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g., E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA)
-Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
+## Env
-## Getting Started
+```bash
+cd vllm
+python use_existing_torch.py
+# then add torch in requires of pyproject.toml
+export SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM="0.6.0"
+pip install -e . --no-build-isolation -v -i https://mirrors.aliyun.com/pypi/simple/
+pip install numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/
+```
-Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
+More related libraries:
-```bash
+- flash_attn-2.8.3+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
-pip install vllm
+- torchvision-0.24.0+das.opt1.dtk2604.torch290-cp310-cp310-manylinux_2_28_x86_64.whl
+- triton-3.1.0+das.opt1.dtk2604.torch271-cp310-cp310-manylinux_2_28_x86_64.whl
+## Quick Start
+Basic Chat Generation with Compression:
+```
+python test.py --schedule pdtriton 
 ```
+test.py:
+```python 
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+# PYTHONPATH=/home/vllm-project/vllm python test.py --schedule pdtriton 
+from __future__ import annotations
+import argparse
+import os
+import sys
+from multiprocessing import freeze_support
+def _apply_kvprune_attention_env(schedule: str | None) -> None:
+    """Map CLI -> VLLM_KVPRUNE_ATTENTION_SCHEDULE (fa_triton | pdtriton | pdfa)."""
+    if not schedule:
+        return
+    os.environ["VLLM_KVPRUNE_ATTENTION_SCHEDULE"] = schedule
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--schedule",
+        type=str,
+        default="pdtriton",
+        choices=("fa_triton", "pdtriton", "pdfa"),
+        help=(
+            "fa_triton=FA prefill + Triton decode;"
+            "pdtriton=Triton prefill + Triton decode;"
+            "pdfa=FA prefill + FA decode (page KV writing is Triton);"
+        ),
+    )
+    args, _unknown = parser.parse_known_args()
+    _apply_kvprune_attention_env(args.schedule)
+    from transformers import AutoTokenizer
+    from vllm import CompressionParams, LLM, SamplingParams
+    model_id = "Qwen/Qwen3-8B"
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    sampling_params = SamplingParams(
+        temperature=0.7,
+        top_p=0.8,
+        repetition_penalty=1.05,
+        max_tokens=512,
+    )
+    llm = LLM(
+        model=model_id,
+        tensor_parallel_size=4,
+        max_model_len=8192,
+        gpu_memory_utilization=0.85,
+        kvprune_compression=True, # True, False
+        )
+    prompt = (
+        "Write a 200-word English prompt for a creative writing task. The prompt should be "
+        "a single coherent paragraph without any bullet points, numbered lists, or markdown "
+        "formatting. It should describe a specific scenario, character, or conflict, and end "
+        "with a clear question that invites the writer to continue the story. Do not use any "
+        "special symbols or line breaks. The tone can be mysterious, tense, or reflective. "
+        "After the paragraph, include the question on the same line directly following the "
+        "period, without hitting enter."
+    )
+    messages = [{"role": "user", "content": prompt}]
+    text = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+        enable_thinking=True, # True
+    )
+    compression = [
+        CompressionParams(
+            compression_ratio=0.5,
+            compression_method="compactor",
+        ),
+    ]
+    outputs = llm.generate(
+        [text],
+        sampling_params=sampling_params,
+        compression=compression,
+    )
+    for output in outputs:
+        generated_text = output.outputs[0].text
+        print(f"Generated text: {generated_text!r}")
+if __name__ == "__main__":
+    freeze_support()
+    main()
+```
+`kvprune_compression=True` is used for pruning, disable the CUDA graph mode of the vLLM v1 engine to reduce inference time, and minimize the GPU memory usage of the vLLM v1 engine.
+If a DCU kernel error occurs, prepend the test command with `export HIP_LAUNCH_BLOCKING=1` to work around the instability of the DCU Triton kernel. In the long term, the stability needs to be fundamentally improved by Triton compilation engineers.
-Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
+If test ruler datasets：
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
+```
+rm -rf ~/.triton ~/.cache/torch /tmp/triton* /tmp/torch*
+export PYTHONPATH=/home/vllm-project/vllm
+export HIP_LAUNCH_BLOCKING=1
+python vllm/tests/kvprune/evaluate/eval_ruler.py \
+  --tensor-parallel-size 4 \
+  --dataset-parquet ruler/4096/test-00000-of-00001.parquet \
+  --dataset-split train \
+  --model Qwen/Qwen3-8B \
+  --compression-method snapkv \
+  --seq-compression-ratio 0.5 \
+  --attention-schedule pdtriton
+```
 ## Contributing
 We welcome and value any contributions and collaborations.
-Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
 ## Citation

--- a/README_vllm.md
+++ b/README_vllm.md
+<!-- markdownlint-disable MD001 MD041 -->
+<p align="center">
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
+    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
+  </picture>
+</p>
+<h3 align="center">
+Easy, fast, and cheap LLM serving for everyone
+</h3>
+<p align="center">
+| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
+</p>
+🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
+For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
+---
+## About
+vLLM is a fast and easy-to-use library for LLM inference and serving.
+Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
+vLLM is fast with:
+- State-of-the-art serving throughput
+- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
+- Continuous batching of incoming requests
+- Fast model execution with CUDA/HIP graph
+- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
+- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
+- Speculative decoding
+- Chunked prefill
+vLLM is flexible and easy to use with:
+- Seamless integration with popular Hugging Face models
+- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
+- Tensor, pipeline, data and expert parallelism support for distributed inference
+- Streaming outputs
+- OpenAI-compatible API server
+- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
+- Prefix caching support
+- Multi-LoRA support
+vLLM seamlessly supports most popular open-source models on HuggingFace, including:
+- Transformer-like LLMs (e.g., Llama)
+- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
+- Embedding Models (e.g., E5-Mistral)
+- Multi-modal LLMs (e.g., LLaVA)
+Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
+## Getting Started
+Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
+```bash
+pip install vllm
+```
+Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
+- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
+- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
+- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
+## Contributing
+We welcome and value any contributions and collaborations.
+Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
+## Citation
+If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
+```bibtex
+@inproceedings{kwon2023efficient,
+  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
+  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
+  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
+  year={2023}
+}
+```
+## Contact Us
+<!-- --8<-- [start:contact-us] -->
+- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
+- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
+- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
+- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
+- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
+<!-- --8<-- [end:contact-us] -->
+## Media Kit
+- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)
--- a/tests/kvprune/evaluate/eval_ruler.py
+++ b/tests/kvprune/evaluate/eval_ruler.py
@@ -130,7 +130,7 @@ def parse_args() -> argparse.Namespace:
    parser.add_argument(
        "--kvprune-compression",
        action=argparse.BooleanOptionalAction,
-        default=True,
+        default=True, # True
        help="Enable kvprune_compression on LLM (skip v1 CUDA graphs, minimal v1 KV blocks). "
        "Default: True.",
    )

--- a/vllm/entrypoints/llm.py
+++ b/vllm/entrypoints/llm.py
@@ -3,6 +3,7 @@
 import itertools
 import os
+from copy import deepcopy
 from collections.abc import Callable, Iterable, Sequence
 from pathlib import Path
 from typing import TYPE_CHECKING, Any
@@ -186,15 +187,13 @@ class LLM:
        enforce_eager: Whether to enforce eager execution. If True, we will
            disable CUDA graph and always execute the model in eager mode.
            If False, we will use CUDA graph and eager execution in hybrid.
-        kvprune_compression: If True, sets ``enforce_eager=True`` for the **v1**
+        kvprune_compression: Compatibility flag for the integrated kvprune path.
-            engine only (no v1 CUDA graph capture). If ``None`` (default), read
+            If ``None`` (default), read ``VLLM_KVPRUNE_COMPRESSION_DEFAULT``.
-            ``VLLM_KVPRUNE_COMPRESSION_DEFAULT`` (``"0"`` = allow v1 graphs;
+            When enabled, requests with ``compression_ratio < 1.0`` automatically
-            ``"1"`` = skip v1 graphs). This is independent of the compactor's
+            rebuild the internal v1 engine into a kvprune-friendly mode
-            ``LLMConfig.enforce_eager`` (see ``VLLM_KVPRUNE_COMPACTOR_CUDA_GRAPH`` /
+            (``enforce_eager=True`` and ``num_gpu_blocks_override=1``). Requests
-            ``VLLM_KVPRUNE_COMPACTOR_ENFORCE_EAGER``; default tries compactor graphs).
+            with ``compression_ratio >= 1.0`` use the caller-provided normal v1
-            When True, v1's GPU KV pool defaults to **one** block (minimum allowed by
+            engine configuration.
-            the scheduler) unless ``num_gpu_blocks_override`` is passed in ``**kwargs``
-            or ``VLLM_KVPRUNE_V1_NUM_GPU_BLOCKS`` is set (``auto`` = profiled allocation).
        enable_return_routed_experts: Whether to return routed experts.
        disable_custom_all_reduce: See
            [ParallelConfig][vllm.config.ParallelConfig].
@@ -351,26 +350,12 @@ class LLM:
                "'examples/offline_inference/data_parallel.py'."
            )
-        # v1 ``enforce_eager`` is independent of kvprune compactor ``LLMConfig.enforce_eager``.
+        # v1 ``enforce_eager`` is independent of kvprune compactor
+        # ``LLMConfig.enforce_eager``. ``kvprune_compression`` enables automatic
+        # switching between normal and compressed v1 engine init modes per request.
        if kvprune_compression is None:
            _kvd = os.environ.get("VLLM_KVPRUNE_COMPRESSION_DEFAULT", "0").strip().lower()
            kvprune_compression = _kvd in ("1", "true", "yes")
-        if kvprune_compression:
-            enforce_eager = True
-            # Reserve minimal v1 GPU KV so compactor can use the rest of VRAM. v1
-            # scheduler requires num_gpu_blocks >= 1; profiling would allocate a
-            # large pool from gpu_memory_utilization. Override:
-            #   VLLM_KVPRUNE_V1_NUM_GPU_BLOCKS unset  -> 1 block (default)
-            #   VLLM_KVPRUNE_V1_NUM_GPU_BLOCKS=auto   -> profiled (no override)
-            #   VLLM_KVPRUNE_V1_NUM_GPU_BLOCKS=<int>   -> max(1, int)
-            if "num_gpu_blocks_override" not in kwargs:
-                _v1_kv = os.environ.get("VLLM_KVPRUNE_V1_NUM_GPU_BLOCKS", "").strip()
-                if _v1_kv.lower() in ("auto", "profile"):
-                    pass
-                elif not _v1_kv:
-                    kwargs["num_gpu_blocks_override"] = 1
-                else:
-                    kwargs["num_gpu_blocks_override"] = max(1, int(_v1_kv))
        engine_args = EngineArgs(
            model=model,
            runner=runner,
@@ -411,13 +396,52 @@ class LLM:
        log_non_default_args(engine_args)
-        self.llm_engine = LLMEngine.from_engine_args(
-            engine_args=engine_args, usage_context=UsageContext.LLM_CLASS
-        )
-        self.engine_class = type(self.llm_engine)
        self.request_counter = Counter()
        self.default_sampling_params: dict[str, Any] | None = None
+        # Cache for __repr__ to avoid repeated collective_rpc calls
+        self._cached_repr: str | None = None
+        # Lazy compactor engine (``vllm.kvprune``) when :meth:`generate` uses compression.
+        self._kvprune_compactor_engine: Any = None
+        self._kvprune_compression_enabled = bool(kvprune_compression)
+        self._engine_args_base = deepcopy(engine_args)
+        self._kvprune_v1_mode = "normal"
+        self.chat_template = load_chat_template(chat_template)
+        self.chat_template_config = ChatTemplateConfig(chat_template=self.chat_template)
+        if not self._kvprune_compression_enabled:
+            self._rebuild_llm_engine_for_kvprune_mode("normal")
+    def _ensure_llm_engine_initialized(self, mode: str = "normal") -> None:
+        if not hasattr(self, "llm_engine"):
+            self._rebuild_llm_engine_for_kvprune_mode(mode)
+    def _shutdown_llm_engine(self) -> None:
+        old_engine = getattr(self, "llm_engine", None)
+        if old_engine is None:
+            return
+        try:
+            old_engine.engine_core.shutdown()
+        except Exception:
+            logger.warning("Failed to shutdown previous LLMEngine cleanly.", exc_info=True)
+        try:
+            dp_group = getattr(old_engine, "dp_group", None)
+            if dp_group is not None and not getattr(old_engine, "external_launcher_dp", False):
+                from vllm.distributed import (
+                    stateless_destroy_torch_distributed_process_group,
+                )
+                stateless_destroy_torch_distributed_process_group(dp_group)
+        except Exception:
+            logger.warning(
+                "Failed to destroy previous LLMEngine DP group cleanly.",
+                exc_info=True,
+            )
+    def _attach_llm_engine(self, llm_engine: LLMEngine) -> None:
+        self.llm_engine = llm_engine
+        self.engine_class = type(self.llm_engine)
+        self.default_sampling_params = None
+        self._cached_repr = None
+        self._kvprune_compactor_engine = None
        supported_tasks = self.llm_engine.get_supported_tasks()
        logger.info("Supported tasks: %s", supported_tasks)
@@ -425,23 +449,44 @@ class LLM:
        self.model_config = self.llm_engine.model_config
        self.renderer = self.llm_engine.renderer
-        self.chat_template = load_chat_template(chat_template)
        self.io_processor = self.llm_engine.io_processor
        self.input_processor = self.llm_engine.input_processor
-        self.chat_template_config = ChatTemplateConfig(chat_template=self.chat_template)
        self.pooling_io_processors = init_pooling_io_processors(
            supported_tasks=supported_tasks,
            model_config=self.model_config,
            renderer=self.renderer,
            chat_template_config=self.chat_template_config,
        )
-        # Cache for __repr__ to avoid repeated collective_rpc calls
-        self._cached_repr: str | None = None
+    def _rebuild_llm_engine_for_kvprune_mode(self, mode: str) -> None:
-        # Lazy compactor engine (``vllm.kvprune``) when :meth:`generate` uses compression.
+        if mode not in ("normal", "compressed"):
-        self._kvprune_compactor_engine: Any = None
+            raise ValueError(f"Unknown kvprune v1 mode: {mode!r}")
-        self._kvprune_compression_enabled = bool(kvprune_compression)
+        if getattr(self, "_kvprune_v1_mode", None) == mode and hasattr(self, "llm_engine"):
+            return
+        engine_args = deepcopy(self._engine_args_base)
+        if mode == "compressed":
+            engine_args.enforce_eager = True
+            engine_args.num_gpu_blocks_override = 1
+        self._shutdown_llm_engine()
+        llm_engine = LLMEngine.from_engine_args(
+            engine_args=engine_args, usage_context=UsageContext.LLM_CLASS
+        )
+        self._attach_llm_engine(llm_engine)
+        self._kvprune_v1_mode = mode
+    def _compression_needs_kvprune(self, compression: Any) -> bool:
+        if compression is None:
+            return False
+        from vllm.kvprune.integration.compression_params import CompressionParams
+        if isinstance(compression, CompressionParams):
+            return compression.compression_ratio < 1.0
+        return any(cp.compression_ratio < 1.0 for cp in compression)
    def get_tokenizer(self) -> TokenizerLike:
+        self._ensure_llm_engine_initialized("normal")
        return self.llm_engine.get_tokenizer()
    def get_world_size(self, include_dp: bool = True) -> int:
@@ -456,16 +501,19 @@ class LLM:
            The world size (tensor_parallel_size * pipeline_parallel_size),
            optionally multiplied by data_parallel_size if include_dp is True.
        """
+        self._ensure_llm_engine_initialized("normal")
        parallel_config = self.llm_engine.vllm_config.parallel_config
        if include_dp:
            return parallel_config.world_size_across_dp
        return parallel_config.world_size
    def reset_mm_cache(self) -> None:
+        self._ensure_llm_engine_initialized("normal")
        self.renderer.clear_mm_cache()
        self.llm_engine.reset_mm_cache()
    def get_default_sampling_params(self) -> SamplingParams:
+        self._ensure_llm_engine_initialized("normal")
        if self.default_sampling_params is None:
            self.default_sampling_params = self.model_config.get_diff_sampling_param()
        if self.default_sampling_params:
@@ -513,16 +561,31 @@ class LLM:
                prompt has ``compression_ratio < 1.0``, the batch is run on the integrated
                compactor engine with weights shared from this ``LLM``. Omit or use all
                ``compression_ratio >= 1`` to use the standard v1 engine only.
-                Use ``kvprune_compression=True`` or ``VLLM_KVPRUNE_COMPRESSION_DEFAULT=1``
+                If ``kvprune_compression=True``, requests with
-                so the v1 engine skips CUDA graph capture. Compactor decode graphs
+                ``compression_ratio < 1.0`` automatically rebuild the internal v1
-                default on (``VLLM_KVPRUNE_COMPACTOR_CUDA_GRAPH`` default ``1``) with
+                engine into eager + 1-block mode before entering kvprune.
-                eager fallback if capture fails; set ``VLLM_KVPRUNE_COMPACTOR_ENFORCE_EAGER=1``
+                Compactor decode graphs default on
-                to skip compactor graph capture entirely.
+                (``VLLM_KVPRUNE_COMPACTOR_CUDA_GRAPH`` default ``1``) with eager
+                fallback if capture fails; set
+                ``VLLM_KVPRUNE_COMPACTOR_ENFORCE_EAGER=1`` to skip compactor graph
+                capture entirely.
        Returns:
            A list of `RequestOutput` objects containing the
            generated completions in the same order as the input prompts.
        """
+        compression_eff = compression
+        if self._kvprune_compression_enabled:
+            target_mode = (
+                "compressed"
+                if self._compression_needs_kvprune(compression_eff)
+                else "normal"
+            )
+            self._rebuild_llm_engine_for_kvprune_mode(target_mode)
+        else:
+            self._ensure_llm_engine_initialized("normal")
        runner_type = self.model_config.runner_type
        if runner_type != "generate":
            raise ValueError(
@@ -530,23 +593,6 @@ class LLM:
                "Try passing `--runner generate` to use the model as a "
                "generative model."
            )
-        compression_eff = compression
-        if compression is None and getattr(self, "_kvprune_compression_enabled", False):
-            pc = self.llm_engine.vllm_config.parallel_config
-            if (
-                pc.tensor_parallel_size > 1
-                and pc.pipeline_parallel_size == 1
-                and pc.data_parallel_size == 1
-            ):
-                from vllm.kvprune.integration.compression_params import CompressionParams
-                from vllm.kvprune.integration.compressed_generate import (
-                    _normalize_prompt_list,
-                )
-                _plist = _normalize_prompt_list(prompts)
-                compression_eff = [
-                    CompressionParams(compression_ratio=1.0) for _ in _plist
-                ]
        if compression_eff is not None:
            from vllm.kvprune.integration.compressed_generate import (
@@ -605,6 +651,7 @@ class LLM:
        Returns:
            A list of request IDs for the enqueued requests.
        """
+        self._ensure_llm_engine_initialized("normal")
        runner_type = self.model_config.runner_type
        if runner_type != "generate":
            raise ValueError("LLM.enqueue() is only supported for generative models.")
@@ -741,6 +788,7 @@ class LLM:
            and set up data-plane communication to pass data.
        """
+        self._ensure_llm_engine_initialized("normal")
        return self.llm_engine.collective_rpc(method, timeout, args, kwargs)
    def apply_model(self, func: Callable[[nn.Module], _R]) -> list[_R]:
@@ -754,6 +802,7 @@ class LLM:
            make sure you move them to CPU first to avoid taking up additional
            VRAM!
        """
+        self._ensure_llm_engine_initialized("normal")
        return self.llm_engine.apply_model(func)
    def beam_search(

--- a/vllm/kvprune/integration/compressed_generate.py
+++ b/vllm/kvprune/integration/compressed_generate.py
@@ -377,14 +377,7 @@ def try_compressed_generate(
    comps = _normalize_compression_params(compression, len(plist))
    pc = llm.llm_engine.vllm_config.parallel_config
-    # TP>1: every worker must run the same collective_rpc session. If all
-    # compression_ratio >= 1, the old code returned None and only the driver ran
-    # v1 _run_engine — other ranks never joined a matching collective, which can
-    # deadlock NCCL / leave workers unsynchronized (hang at "Processed prompts:").
-    if pc.tensor_parallel_size > 1:
    if not _should_use_kvprune_compactor_path(comps):
-            comps = [CompressionParams(compression_ratio=1.0) for _ in plist]
-    elif not _should_use_kvprune_compactor_path(comps):
        return None
    v1_eager = bool(
@@ -394,8 +387,8 @@ def try_compressed_generate(
        logger.warning(
            "KV-prune compression: v1 CUDA graphs are still enabled on this LLM. "
            "The compactor does not reuse v1 graphs; capture wastes VRAM. "
-            "Set kvprune_compression=True, enforce_eager=True, or "
+            "Set enforce_eager=True on LLM() if you need to avoid the extra "
-            "VLLM_KVPRUNE_COMPRESSION_DEFAULT=1 before import vllm."
+            "v1 graph capture overhead for compressed generation."
        )
    if pc.tensor_parallel_size > 1:
@@ -449,4 +442,3 @@ def _sequences_to_request_outputs(seqs: list[Any], engine: LLMEngine) -> list[Re
        )
        out.append(ro)
    return out