Remove V0 Encoder-Decoder Support (#24907)

Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>

Remove V0 Encoder-Decoder Support (#24907)
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
759ef49b · Woosuk Kwon · GitHub · 5206ab20 · 759ef49b · 759ef49b
Unverified Commit 759ef49b authored Sep 15, 2025 by Woosuk Kwon Committed by GitHub Sep 15, 2025
20 changed files
--- a/.buildkite/scripts/hardware_ci/run-cpu-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-cpu-test.sh
@@ -66,7 +66,6 @@ function cpu_tests() {

    pytest -x -v -s tests/models/language/pooling -m cpu_model
    pytest -x -v -s tests/models/multimodal/generation \
-                --ignore=tests/models/multimodal/generation/test_mllama.py \
                --ignore=tests/models/multimodal/generation/test_pixtral.py \
                -m cpu_model"


--- a/.buildkite/test-pipeline.yaml
+++ b/.buildkite/test-pipeline.yaml
@@ -549,15 +549,6 @@ steps:
  commands: # LMEval+Transcription WER check
  - pytest -s entrypoints/openai/correctness/

- label: Encoder Decoder tests # 12min
-  timeout_in_minutes: 20
-  mirror_hardwares: [amdexperimental]
-  source_file_dependencies:
-  - vllm/
-  - tests/encoder_decoder
-  commands:
-    - pytest -v -s encoder_decoder
-
 - label: OpenAI-Compatible Tool Use # 23 min
  timeout_in_minutes: 35
  mirror_hardwares: [amdexperimental]

--- a/docs/contributing/model/multimodal.md
+++ b/docs/contributing/model/multimodal.md
@@ -840,7 +840,6 @@ Some HF processors directly insert feature tokens without replacing anything in
 Examples:

 - BLIP-2 (insert at start of prompt): <gh-file:vllm/model_executor/models/blip2.py>
- Florence2 (insert at start of prompt): <gh-file:vllm/model_executor/models/florence2.py>
 - Molmo (insert after `<|endoftext|>` token): <gh-file:vllm/model_executor/models/molmo.py>

 ### Handling prompt updates unrelated to multi-modal data

--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
@@ -331,8 +331,6 @@ th {
 | `BailingMoeV2ForCausalLM` | Ling | `inclusionAI/Ling-mini-2.0`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | ✅︎ |
 | `BloomForCausalLM` | BLOOM, BLOOMZ, BLOOMChat | `bigscience/bloom`, `bigscience/bloomz`, etc. | | ✅︎ | ✅︎ |
-| `BartForConditionalGeneration` | BART | `facebook/bart-base`, `facebook/bart-large-cnn`, etc. | | | |
-| `MBartForConditionalGeneration` | mBART | `facebook/mbart-large-en-ro`, `facebook/mbart-large-50`, etc. | | | |
 | `ChatGLMModel`, `ChatGLMForConditionalGeneration` | ChatGLM | `zai-org/chatglm2-6b`, `zai-org/chatglm3-6b`, `ShieldLM-6B-chatglm3`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `CohereForCausalLM`, `Cohere2ForCausalLM` | Command-R, Command-A | `CohereLabs/c4ai-command-r-v01`, `CohereLabs/c4ai-command-r7b-12-2024`, `CohereLabs/c4ai-command-a-03-2025`, `CohereLabs/command-a-reasoning-08-2025`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `DbrxForCausalLM` | DBRX | `databricks/dbrx-base`, `databricks/dbrx-instruct`, etc. | | ✅︎ | ✅︎ |
@@ -426,9 +424,6 @@ Some models are supported only via the [Transformers backend](#transformers). Th
 !!! note
    Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.

-!!! note
-    Some mBART models' config files do not have an `architecture` defined. Therefore, you need to use `--hf-overrides '{"architectures": ["MBartForConditionalGeneration"]}'` to explicitly specify the use of the `MBartForConditionalGeneration` architecture.
-
 ### Pooling Models

 See [this page](./pooling_models.md) for more information on how to use pooling models.
@@ -625,9 +620,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
 | `ChameleonForConditionalGeneration` | Chameleon | T + I | `facebook/chameleon-7b`, etc. | | ✅︎ | ✅︎ |
 | `Cohere2VisionForConditionalGeneration` | Command A Vision | T + I<sup>+</sup> | `CohereLabs/command-a-vision-07-2025`, etc. | | ✅︎ | ✅︎ |
 | `DeepseekVLV2ForCausalLM`<sup>^</sup> | DeepSeek-VL2 | T + I<sup>+</sup> | `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2`, etc. | | ✅︎ | ✅︎ |
-| `DonutForConditionalGeneration`<sup>^</sup> | Donut | T + I | `ByteDance/Dolphin`, `naver-clova-ix/donut-base-finetuned-docvqa`, etc. | | | |
 | `Ernie4_5_VLMoeForConditionalGeneration` | Ernie4.5-VL | T + I<sup>+</sup>/ V<sup>+</sup> | `baidu/ERNIE-4.5-VL-28B-A3B-PT`, `baidu/ERNIE-4.5-VL-424B-A47B-PT` | | ✅︎ | ✅︎ |
-| `Florence2ForConditionalGeneration` | Florence-2 | T + I | `microsoft/Florence-2-base`, `microsoft/Florence-2-large`, etc. | | | |
 | `FuyuForCausalLM` | Fuyu | T + I | `adept/fuyu-8b`, etc. | | ✅︎ | ✅︎ |
 | `Gemma3ForConditionalGeneration` | Gemma 3 | T + I<sup>+</sup> | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ | ⚠️ |
 | `Gemma3nForConditionalGeneration` | Gemma 3n | T + I + A | `google/gemma-3n-E2B-it`, `google/gemma-3n-E4B-it`, etc. | | | ✅︎ |
@@ -654,7 +647,6 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
 | `MiniCPMV` | MiniCPM-V | T + I<sup>E+</sup> + V<sup>E+</sup> | `openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, `openbmb/MiniCPM-V-4`, `openbmb/MiniCPM-V-4_5`, etc. | ✅︎ | | ✅︎ |
 | `MiniMaxVL01ForConditionalGeneration` | MiniMax-VL | T + I<sup>E+</sup> | `MiniMaxAI/MiniMax-VL-01`, etc. | | ✅︎ | ✅︎ |
 | `Mistral3ForConditionalGeneration` | Mistral3 (HF Transformers) | T + I<sup>+</sup> | `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, etc. | ✅︎ | ✅︎ | ✅︎ |
-| `MllamaForConditionalGeneration` | Llama 3.2 | T + I<sup>+</sup> | `meta-llama/Llama-3.2-90B-Vision-Instruct`, `meta-llama/Llama-3.2-11B-Vision`, etc. | | | |
 | `MolmoForCausalLM` | Molmo | T + I<sup>+</sup> | `allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `NVLM_D_Model` | NVLM-D 1.0 | T + I<sup>+</sup> | `nvidia/NVLM-D-72B`, etc. | | ✅︎ | ✅︎ |
 | `Ovis` | Ovis2, Ovis1.6 | T + I<sup>+</sup> | `AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, etc. | | ✅︎ | ✅︎ |

--- a/docs/usage/v1_guide.md
+++ b/docs/usage/v1_guide.md
@@ -120,7 +120,7 @@ Please note that prefix caching is not yet supported for any of the above models

 Whisper is supported. Other models requiring cross-attention between separate
 encoder and decoder (e.g., `BartForConditionalGeneration`,
-`MllamaForConditionalGeneration`) are not yet supported.
+`MllamaForConditionalGeneration`) are not supported.

 ### Features


--- a/examples/offline_inference/dolphin.py
+++ b/examples/offline_inference/dolphin.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-import argparse
-import copy
-import os
-from dataclasses import dataclass
-
-import cv2
-import numpy as np
-import regex as re
-from PIL import Image
-from transformers import DonutProcessor
-
-from vllm import LLM, SamplingParams
-from vllm.inputs import ExplicitEncoderDecoderPrompt, TextPrompt, TokensPrompt
-from vllm.multimodal.utils import fetch_image
-
-
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-@dataclass
-class ImageDimensions:
-    original_w: int
-    original_h: int
-    padded_w: int
-    padded_h: int
-
-
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-def map_to_original_coordinates(
-    x1, y1, x2, y2, dims: ImageDimensions
-) -> tuple[int, int, int, int]:
-    try:
-        top = (dims.padded_h - dims.original_h) // 2
-        left = (dims.padded_w - dims.original_w) // 2
-        orig_x1 = max(0, x1 - left)
-        orig_y1 = max(0, y1 - top)
-        orig_x2 = min(dims.original_w, x2 - left)
-        orig_y2 = min(dims.original_h, y2 - top)
-        if orig_x2 <= orig_x1:
-            orig_x2 = min(orig_x1 + 1, dims.original_w)
-        if orig_y2 <= orig_y1:
-            orig_y2 = min(orig_y1 + 1, dims.original_h)
-        return int(orig_x1), int(orig_y1), int(orig_x2), int(orig_y2)
-    except Exception as e:
-        print(f"map_to_original_coordinates error: {str(e)}")
-        return 0, 0, min(100, dims.original_w), min(100, dims.original_h)
-
-
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-def adjust_box_edges(image, boxes: list[list[float]], max_pixels=15, threshold=0.2):
-    if isinstance(image, str):
-        image = cv2.imread(image)
-    img_h, img_w = image.shape[:2]
-    new_boxes = []
-    for box in boxes:
-        best_box = copy.deepcopy(box)
-
-        def check_edge(img, current_box, i, is_vertical):
-            edge = current_box[i]
-            gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
-            _, binary = cv2.threshold(
-                gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU
-            )
-            if is_vertical:
-                line = binary[current_box[1] : current_box[3] + 1, edge]
-            else:
-                line = binary[edge, current_box[0] : current_box[2] + 1]
-            transitions = np.abs(np.diff(line))
-            return np.sum(transitions) / len(transitions)
-
-        edges = [(0, -1, True), (2, 1, True), (1, -1, False), (3, 1, False)]
-        current_box = copy.deepcopy(box)
-        current_box[0] = min(max(current_box[0], 0), img_w - 1)
-        current_box[1] = min(max(current_box[1], 0), img_h - 1)
-        current_box[2] = min(max(current_box[2], 0), img_w - 1)
-        current_box[3] = min(max(current_box[3], 0), img_h - 1)
-
-        for i, direction, is_vertical in edges:
-            best_score = check_edge(image, current_box, i, is_vertical)
-            if best_score <= threshold:
-                continue
-            for step in range(max_pixels):
-                current_box[i] += direction
-                if i == 0 or i == 2:
-                    current_box[i] = min(max(current_box[i], 0), img_w - 1)
-                else:
-                    current_box[i] = min(max(current_box[i], 0), img_h - 1)
-                score = check_edge(image, current_box, i, is_vertical)
-                if score < best_score:
-                    best_score = score
-                    best_box = copy.deepcopy(current_box)
-                if score <= threshold:
-                    break
-        new_boxes.append(best_box)
-    return new_boxes
-
-
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-def process_coordinates(coords, padded_image, dims: ImageDimensions, previous_box=None):
-    try:
-        x1, y1 = int(coords[0] * dims.padded_w), int(coords[1] * dims.padded_h)
-        x2, y2 = int(coords[2] * dims.padded_w), int(coords[3] * dims.padded_h)
-        x1, y1, x2, y2 = (
-            max(0, min(x1, dims.padded_w - 1)),
-            max(0, min(y1, dims.padded_h - 1)),
-            max(0, min(x2, dims.padded_w)),
-            max(0, min(y2, dims.padded_h)),
-        )
-        if x2 <= x1:
-            x2 = min(x1 + 1, dims.padded_w)
-        if y2 <= y1:
-            y2 = min(y1 + 1, dims.padded_h)
-        new_boxes = adjust_box_edges(padded_image, [[x1, y1, x2, y2]])
-        x1, y1, x2, y2 = new_boxes[0]
-        x1, y1, x2, y2 = (
-            max(0, min(x1, dims.padded_w - 1)),
-            max(0, min(y1, dims.padded_h - 1)),
-            max(0, min(x2, dims.padded_w)),
-            max(0, min(y2, dims.padded_h)),
-        )
-        if x2 <= x1:
-            x2 = min(x1 + 1, dims.padded_w)
-        if y2 <= y1:
-            y2 = min(y1 + 1, dims.padded_h)
-        if previous_box is not None:
-            prev_x1, prev_y1, prev_x2, prev_y2 = previous_box
-            if (x1 < prev_x2 and x2 > prev_x1) and (y1 < prev_y2 and y2 > prev_y1):
-                y1 = prev_y2
-                y1 = min(y1, dims.padded_h - 1)
-                if y2 <= y1:
-                    y2 = min(y1 + 1, dims.padded_h)
-        new_previous_box = [x1, y1, x2, y2]
-        orig_x1, orig_y1, orig_x2, orig_y2 = map_to_original_coordinates(
-            x1, y1, x2, y2, dims
-        )
-        return x1, y1, x2, y2, orig_x1, orig_y1, orig_x2, orig_y2, new_previous_box
-    except Exception as e:
-        print(f"process_coordinates error: {str(e)}")
-        orig_x1, orig_y1, orig_x2, orig_y2 = (
-            0,
-            0,
-            min(100, dims.original_w),
-            min(100, dims.original_h),
-        )
-        return 0, 0, 100, 100, orig_x1, orig_y1, orig_x2, orig_y2, [0, 0, 100, 100]
-
-
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-def prepare_image(image) -> tuple[np.ndarray, ImageDimensions]:
-    try:
-        image_cv = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
-        original_h, original_w = image_cv.shape[:2]
-        max_size = max(original_h, original_w)
-        top = (max_size - original_h) // 2
-        bottom = max_size - original_h - top
-        left = (max_size - original_w) // 2
-        right = max_size - original_w - left
-        padded_image = cv2.copyMakeBorder(
-            image_cv, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(0, 0, 0)
-        )
-        padded_h, padded_w = padded_image.shape[:2]
-        dimensions = ImageDimensions(
-            original_w=original_w,
-            original_h=original_h,
-            padded_w=padded_w,
-            padded_h=padded_h,
-        )
-        return padded_image, dimensions
-    except Exception as e:
-        print(f"prepare_image error: {str(e)}")
-        h, w = image.height, image.width
-        dimensions = ImageDimensions(original_w=w, original_h=h, padded_w=w, padded_h=h)
-        return np.zeros((h, w, 3), dtype=np.uint8), dimensions
-
-
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-def parse_layout_string(bbox_str):
-    """Parse layout string using regular expressions"""
-    pattern = r"\[(\d*\.?\d+),\s*(\d*\.?\d+),\s*(\d*\.?\d+),\s*(\d*\.?\d+)\]\s*(\w+)"
-    matches = re.finditer(pattern, bbox_str)
-
-    parsed_results = []
-    for match in matches:
-        coords = [float(match.group(i)) for i in range(1, 5)]
-        label = match.group(5).strip()
-        parsed_results.append((coords, label))
-
-    return parsed_results
-
-
-model_id = "ByteDance/Dolphin"
-
-# The input image size for Dolphin is 896 x 896,
-# and the patch_size is 4 x 4.
-# Therefore, the initial number of patches is:
-# Height: 896 / 4 = 224 patches
-# Width: 896 / 4 = 224 patches
-
-# The Dolphin model uses a staged downsampling approach,
-# defined by the "depths": [2, 2, 14, 2] configuration.
-# Before entering stages 2, 3, and 4, a "Patch Merging" operation is performed,
-# which halves the feature map's dimensions (dividing both height and width by 2).
-# Before Stage 2: The size changes from 224 x 224 to (224/2) x (224/2) = 112 x 112.
-# Before Stage 3: The size changes from 112 x 112 to (112/2) x (112/2) = 56 x 56.
-# Before Stage 4: The size changes from 56 x 56 to (56/2) x (56/2) = 28 x 28.
-
-# Because vLLM needs to fill the image features with an encoder_prompt,
-# and the encoder_prompt will have `<pad>` tokens added when tokenized,
-# we need to construct an encoder_prompt with a length of 28 x 28 - 1 = 783.
-encoder_prompt = "".join(["0"] * 783)
-sampling_params = SamplingParams(
-    temperature=0.0,
-    max_tokens=2048,
-)
-
-processor = DonutProcessor.from_pretrained(model_id)
-llm = LLM(
-    model=model_id,
-    dtype="float16",
-    max_num_seqs=8,
-    hf_overrides={"architectures": ["DonutForConditionalGeneration"]},
-)
-
-parser = argparse.ArgumentParser()
-parser.add_argument(
-    "--image_path", type=str, default=None, help="Path to a local image file."
-)
-args = parser.parse_args()
-
-if args.image_path:
-    if not os.path.exists(args.image_path):
-        raise FileNotFoundError(f"Error: File not found at {args.image_path}")
-    image = Image.open(args.image_path).convert("RGB")
-else:
-    image = fetch_image(
-        "https://huggingface.co/datasets/hf-internal-testing/example-documents/resolve/main/jpeg_images/0.jpg"
-    )
-
-
-prompt = "Parse the reading order of this document. "
-decoder_prompt = f"<s>{prompt}<Answer/>"
-decoder_prompt_tokens = TokensPrompt(
-    prompt_token_ids=processor.tokenizer(decoder_prompt, add_special_tokens=False)[
-        "input_ids"
-    ]
-)
-enc_dec_prompt = ExplicitEncoderDecoderPrompt(
-    encoder_prompt=TextPrompt(prompt=encoder_prompt, multi_modal_data={"image": image}),
-    decoder_prompt=decoder_prompt_tokens,
-)
-layout_outputs = llm.generate(prompts=enc_dec_prompt, sampling_params=sampling_params)
-layout_result_str = layout_outputs[0].outputs[0].text
-print(f"Layout analysis output:\n{layout_result_str}")
-
-padded_image, dims = prepare_image(image)
-layout_results = parse_layout_string(layout_result_str)
-text_table_elements = []
-previous_box = None
-reading_order = 0
-for bbox_coords, label in layout_results:
-    if label == "fig":
-        continue
-    try:
-        x1, y1, x2, y2, orig_x1, orig_y1, orig_x2, orig_y2, previous_box = (
-            process_coordinates(bbox_coords, padded_image, dims, previous_box)
-        )
-        cropped = padded_image[y1:y2, x1:x2]
-        if cropped.size > 0 and cropped.shape[0] > 3 and cropped.shape[1] > 3:
-            pil_crop = Image.fromarray(cv2.cvtColor(cropped, cv2.COLOR_BGR2RGB))
-            prompt_ocr = (
-                "Parse the table in the image. "
-                if label == "tab"
-                else "Read text in the image. "
-            )
-            text_table_elements.append(
-                {
-                    "crop": pil_crop,
-                    "prompt": prompt_ocr,
-                    "reading_order": reading_order,
-                }
-            )
-        reading_order += 1
-    except Exception as e:
-        print(f"Error processing bbox (label: {label}): {str(e)}")
-        continue
-
-if text_table_elements:
-    batch_prompts = []
-    for elem in text_table_elements:
-        decoder_prompt_str = f"<s>{elem['prompt']}<Answer/>"
-        decoder_prompt_tokens = TokensPrompt(
-            prompt_token_ids=processor.tokenizer(
-                decoder_prompt_str, add_special_tokens=False
-            )["input_ids"]
-        )
-        enc_dec_prompt = ExplicitEncoderDecoderPrompt(
-            encoder_prompt=TextPrompt(
-                prompt=encoder_prompt, multi_modal_data={"image": elem["crop"]}
-            ),
-            decoder_prompt=decoder_prompt_tokens,
-        )
-        batch_prompts.append(enc_dec_prompt)
-    batch_outputs = llm.generate(prompts=batch_prompts, sampling_params=sampling_params)
-    for i, output in enumerate(batch_outputs):
-        text_table_elements[i]["text"] = output.outputs[0].text.strip()
-
-print("------" * 8)
-text_table_elements.sort(key=lambda x: x["reading_order"])
-for elem in text_table_elements:
-    print(elem.get("text", ""))
--- a/examples/offline_inference/encoder_decoder.py
+++ b/examples/offline_inference/encoder_decoder.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-"""
-Demonstrate prompting of text-to-text
-encoder/decoder models, specifically BART and mBART.
-
-This script is refactored to allow model selection via command-line arguments.
-
-NOTE: This example is not yet supported in V1.
-"""
-
-import argparse
-from typing import NamedTuple, Optional
-
-from vllm import LLM, SamplingParams
-from vllm.inputs import (
-    ExplicitEncoderDecoderPrompt,
-    TextPrompt,
-    TokensPrompt,
-    zip_enc_dec_prompts,
-)
-
-
-class ModelRequestData(NamedTuple):
-    """
-    Holds the configuration for a specific model, including its
-    HuggingFace ID and the prompts to use for the demo.
-    """
-
-    model_id: str
-    encoder_prompts: list
-    decoder_prompts: list
-    hf_overrides: Optional[dict] = None
-
-
-def get_bart_config() -> ModelRequestData:
-    """
-    Returns the configuration for facebook/bart-large-cnn.
-    This uses the exact test cases from the original script.
-    """
-    encoder_prompts = [
-        "Hello, my name is",
-        "The president of the United States is",
-        "The capital of France is",
-        "An encoder prompt",
-    ]
-    decoder_prompts = [
-        "A decoder prompt",
-        "Another decoder prompt",
-    ]
-    return ModelRequestData(
-        model_id="facebook/bart-large-cnn",
-        encoder_prompts=encoder_prompts,
-        decoder_prompts=decoder_prompts,
-    )
-
-
-def get_mbart_config() -> ModelRequestData:
-    """
-    Returns the configuration for facebook/mbart-large-en-ro.
-    This uses prompts suitable for an English-to-Romanian translation task.
-    """
-    encoder_prompts = [
-        "The quick brown fox jumps over the lazy dog.",
-        "How are you today?",
-    ]
-    decoder_prompts = ["", ""]
-    hf_overrides = {"architectures": ["MBartForConditionalGeneration"]}
-    return ModelRequestData(
-        model_id="facebook/mbart-large-en-ro",
-        encoder_prompts=encoder_prompts,
-        decoder_prompts=decoder_prompts,
-        hf_overrides=hf_overrides,
-    )
-
-
-MODEL_GETTERS = {
-    "bart": get_bart_config,
-    "mbart": get_mbart_config,
-}
-
-
-def create_all_prompt_types(
-    encoder_prompts_raw: list,
-    decoder_prompts_raw: list,
-    tokenizer,
-) -> list:
-    """
-    Generates a list of diverse prompt types for demonstration.
-    This function is generic and uses the provided raw prompts
-    to create various vLLM input objects.
-    """
-    text_prompt_raw = encoder_prompts_raw[0]
-    text_prompt = TextPrompt(prompt=encoder_prompts_raw[1 % len(encoder_prompts_raw)])
-    tokens_prompt = TokensPrompt(
-        prompt_token_ids=tokenizer.encode(
-            encoder_prompts_raw[2 % len(encoder_prompts_raw)]
-        )
-    )
-
-    decoder_tokens_prompt = TokensPrompt(
-        prompt_token_ids=tokenizer.encode(decoder_prompts_raw[0])
-    )
-    single_prompt_examples = [
-        text_prompt_raw,
-        text_prompt,
-        tokens_prompt,
-    ]
-    explicit_pair_examples = [
-        ExplicitEncoderDecoderPrompt(
-            encoder_prompt=text_prompt_raw,
-            decoder_prompt=decoder_tokens_prompt,
-        ),
-        ExplicitEncoderDecoderPrompt(
-            encoder_prompt=text_prompt,
-            decoder_prompt=decoder_prompts_raw[1 % len(decoder_prompts_raw)],
-        ),
-        ExplicitEncoderDecoderPrompt(
-            encoder_prompt=tokens_prompt,
-            decoder_prompt=text_prompt,
-        ),
-    ]
-    zipped_prompt_list = zip_enc_dec_prompts(
-        encoder_prompts_raw,
-        decoder_prompts_raw,
-    )
-    return single_prompt_examples + explicit_pair_examples + zipped_prompt_list
-
-
-def create_sampling_params() -> SamplingParams:
-    """Create a sampling params object."""
-    return SamplingParams(
-        temperature=0,
-        top_p=1.0,
-        min_tokens=0,
-        max_tokens=30,
-    )
-
-
-def print_outputs(outputs: list):
-    """Formats and prints the generation outputs."""
-    print("-" * 80)
-    for i, output in enumerate(outputs):
-        prompt = output.prompt
-        encoder_prompt = output.encoder_prompt
-        generated_text = output.outputs[0].text
-        print(f"Output {i + 1}:")
-        print(f"Encoder Prompt: {encoder_prompt!r}")
-        print(f"Decoder Prompt: {prompt!r}")
-        print(f"Generated Text: {generated_text!r}")
-        print("-" * 80)
-
-
-def main(args):
-    """Main execution function."""
-    model_key = args.model
-    if model_key not in MODEL_GETTERS:
-        raise ValueError(
-            f"Unknown model: {model_key}. "
-            f"Available models: {list(MODEL_GETTERS.keys())}"
-        )
-    config_getter = MODEL_GETTERS[model_key]
-    model_config = config_getter()
-
-    print(f"🚀 Running demo for model: {model_config.model_id}")
-    llm = LLM(
-        model=model_config.model_id,
-        dtype="float",
-        hf_overrides=model_config.hf_overrides,
-    )
-    tokenizer = llm.llm_engine.get_tokenizer_group()
-    prompts = create_all_prompt_types(
-        encoder_prompts_raw=model_config.encoder_prompts,
-        decoder_prompts_raw=model_config.decoder_prompts,
-        tokenizer=tokenizer,
-    )
-    sampling_params = create_sampling_params()
-    outputs = llm.generate(prompts, sampling_params)
-    print_outputs(outputs)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="A flexible demo for vLLM encoder-decoder models."
-    )
-    parser.add_argument(
-        "--model",
-        "-m",
-        type=str,
-        default="bart",
-        choices=MODEL_GETTERS.keys(),
-        help="The short name of the model to run.",
-    )
-    args = parser.parse_args()
-    main(args)
--- a/examples/offline_inference/encoder_decoder_multimodal.py
+++ b/examples/offline_inference/encoder_decoder_multimodal.py
@@ -13,8 +13,6 @@ from typing import NamedTuple

 from vllm import LLM, EngineArgs, PromptType, SamplingParams
 from vllm.assets.audio import AudioAsset
-from vllm.assets.image import ImageAsset
-from vllm.multimodal.utils import fetch_image
 from vllm.utils import FlexibleArgumentParser


@@ -23,113 +21,6 @@ class ModelRequestData(NamedTuple):
    prompts: Sequence[PromptType]


-def run_donut():
-    engine_args = EngineArgs(
-        model="naver-clova-ix/donut-base-finetuned-docvqa",
-        max_num_seqs=2,
-        limit_mm_per_prompt={"image": 1},
-        dtype="float16",
-        hf_overrides={"architectures": ["DonutForConditionalGeneration"]},
-    )
-
-    # The input image size for donut-base-finetuned-docvqa is 2560 x 1920,
-    # and the patch_size is 4 x 4.
-    # Therefore, the initial number of patches is:
-    # Height: 1920 / 4 = 480 patches
-    # Width: 2560 / 4 = 640 patches
-    # The Swin model uses a staged downsampling approach,
-    # defined by the "depths": [2, 2, 14, 2] configuration.
-    # Before entering stages 2, 3, and 4, a "Patch Merging" operation is performed,
-    # which halves the feature map's dimensions (dividing both height and width by 2).
-    # Before Stage 2: The size changes from 480 x 640 to (480/2) x (640/2) = 240 x 320.
-    # Before Stage 3: The size changes from 240 x 320 to (240/2) x (320/2) = 120 x 160.
-    # Before Stage 4: The size changes from 120 x 160 to (120/2) x (160/2) = 60 x 80.
-    # Because vLLM needs to fill the image features with an encoder_prompt,
-    # and the encoder_prompt will have `<pad>` tokens added when tokenized,
-    # we need to construct an encoder_prompt with a length of 60 x 80 - 1 = 4799.
-    prompts = [
-        {
-            "encoder_prompt": {
-                "prompt": "".join(["$"] * 4799),
-                "multi_modal_data": {
-                    "image": fetch_image(
-                        "https://huggingface.co/datasets/hf-internal-testing/example-documents/resolve/main/jpeg_images/0.jpg"
-                    )  # noqa: E501
-                },
-            },
-            "decoder_prompt": "<s_docvqa><s_question>What time is the coffee break?</s_question><s_answer>",  # noqa: E501
-        },
-    ]
-
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompts=prompts,
-    )
-
-
-def run_florence2():
-    engine_args = EngineArgs(
-        model="microsoft/Florence-2-large",
-        tokenizer="Isotr0py/Florence-2-tokenizer",
-        max_num_seqs=8,
-        trust_remote_code=True,
-        limit_mm_per_prompt={"image": 1},
-        dtype="half",
-    )
-
-    prompts = [
-        {  # implicit prompt with task token
-            "prompt": "<DETAILED_CAPTION>",
-            "multi_modal_data": {"image": ImageAsset("stop_sign").pil_image},
-        },
-        {  # explicit encoder/decoder prompt
-            "encoder_prompt": {
-                "prompt": "Describe in detail what is shown in the image.",
-                "multi_modal_data": {"image": ImageAsset("cherry_blossom").pil_image},
-            },
-            "decoder_prompt": "",
-        },
-    ]
-
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompts=prompts,
-    )
-
-
-def run_mllama():
-    engine_args = EngineArgs(
-        model="meta-llama/Llama-3.2-11B-Vision-Instruct",
-        max_model_len=8192,
-        max_num_seqs=2,
-        limit_mm_per_prompt={"image": 1},
-        dtype="half",
-    )
-
-    prompts = [
-        {  # Implicit prompt
-            "prompt": "<|image|><|begin_of_text|>What is the content of this image?",  # noqa: E501
-            "multi_modal_data": {
-                "image": ImageAsset("stop_sign").pil_image,
-            },
-        },
-        {  # Explicit prompt
-            "encoder_prompt": {
-                "prompt": "<|image|>",
-                "multi_modal_data": {
-                    "image": ImageAsset("stop_sign").pil_image,
-                },
-            },
-            "decoder_prompt": "<|image|><|begin_of_text|>Please describe the image.",  # noqa: E501
-        },
-    ]
-
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompts=prompts,
-    )
-
-
 def run_whisper():
    os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

@@ -166,9 +57,6 @@ def run_whisper():


 model_example_map = {
-    "donut": run_donut,
-    "florence2": run_florence2,
-    "mllama": run_mllama,
    "whisper": run_whisper,
 }

@@ -182,7 +70,7 @@ def parse_args():
        "--model-type",
        "-m",
        type=str,
-        default="mllama",
+        default="whisper",
        choices=model_example_map.keys(),
        help='Huggingface "model_type".',
    )

--- a/examples/offline_inference/vision_language.py
+++ b/examples/offline_inference/vision_language.py
@@ -204,28 +204,6 @@ def run_ernie45_vl(questions: list[str], modality: str) -> ModelRequestData:
    )


-# Florence2
-def run_florence2(questions: list[str], modality: str) -> ModelRequestData:
-    assert modality == "image"
-
-    engine_args = EngineArgs(
-        model="microsoft/Florence-2-large",
-        tokenizer="Isotr0py/Florence-2-tokenizer",
-        max_model_len=4096,
-        max_num_seqs=2,
-        trust_remote_code=True,
-        dtype="bfloat16",
-        limit_mm_per_prompt={modality: 1},
-    )
-
-    prompts = ["<MORE_DETAILED_CAPTION>" for _ in questions]
-
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompts=prompts,
-    )
-
-
 # Fuyu
 def run_fuyu(questions: list[str], modality: str) -> ModelRequestData:
    assert modality == "image"
@@ -1008,44 +986,6 @@ def run_mistral3(questions: list[str], modality: str) -> ModelRequestData:
    )


-# LLama 3.2
-def run_mllama(questions: list[str], modality: str) -> ModelRequestData:
-    assert modality == "image"
-
-    model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
-
-    # Note: The default setting of max_num_seqs (256) and
-    # max_model_len (131072) for this model may cause OOM.
-    # You may lower either to run this example on lower-end GPUs.
-
-    # The configuration below has been confirmed to launch on a single L40 GPU.
-    engine_args = EngineArgs(
-        model=model_name,
-        max_model_len=8192,
-        max_num_seqs=2,
-        limit_mm_per_prompt={modality: 1},
-    )
-
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    messages = [
-        [
-            {
-                "role": "user",
-                "content": [{"type": "image"}, {"type": "text", "text": question}],
-            }
-        ]
-        for question in questions
-    ]
-    prompts = tokenizer.apply_chat_template(
-        messages, add_generation_prompt=True, tokenize=False
-    )
-
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompts=prompts,
-    )
-
-
 # Molmo
 def run_molmo(questions: list[str], modality: str) -> ModelRequestData:
    assert modality == "image"
@@ -1665,7 +1605,6 @@ model_example_map = {
    "command_a_vision": run_command_a_vision,
    "deepseek_vl_v2": run_deepseek_vl2,
    "ernie45_vl": run_ernie45_vl,
-    "florence2": run_florence2,
    "fuyu": run_fuyu,
    "gemma3": run_gemma3,
    "gemma3n": run_gemma3n,
@@ -1691,7 +1630,6 @@ model_example_map = {
    "minicpmv": run_minicpmv,
    "minimax_vl_01": run_minimax_vl_01,
    "mistral3": run_mistral3,
-    "mllama": run_mllama,
    "molmo": run_molmo,
    "nemotron_vl": run_nemotron_vl,
    "NVLM_D": run_nvlm_d,

--- a/examples/offline_inference/vision_language_multi_image.py
+++ b/examples/offline_inference/vision_language_multi_image.py
@@ -637,26 +637,6 @@ def load_mistral3(question: str, image_urls: list[str]) -> ModelRequestData:
    )


-def load_mllama(question: str, image_urls: list[str]) -> ModelRequestData:
-    model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
-
-    # The configuration below has been confirmed to launch on a single L40 GPU.
-    engine_args = EngineArgs(
-        model=model_name,
-        max_model_len=8192,
-        max_num_seqs=2,
-        limit_mm_per_prompt={"image": len(image_urls)},
-    )
-
-    img_prompt = "Given the first image <|image|> and the second image<|image|>"
-    prompt = f"<|begin_of_text|>{img_prompt}, {question}?"
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompt=prompt,
-        image_data=[fetch_image(url) for url in image_urls],
-    )
-
-
 def load_nvlm_d(question: str, image_urls: list[str]) -> ModelRequestData:
    model_name = "nvidia/NVLM-D-72B"

@@ -1253,7 +1233,6 @@ model_example_map = {
    "llava-next": load_llava_next,
    "llava-onevision": load_llava_onevision,
    "mistral3": load_mistral3,
-    "mllama": load_mllama,
    "NVLM_D": load_nvlm_d,
    "ovis": load_ovis,
    "ovis2_5": load_ovis2_5,

--- a/tests/core/block/test_block_manager.py
+++ b/tests/core/block/test_block_manager.py
@@ -3,15 +3,12 @@

 import pytest

-from vllm.core.block.utils import (STR_NOT_IMPL_ENC_DEC_PREFIX_CACHE,
-                                   STR_NOT_IMPL_ENC_DEC_SWA)
 from vllm.core.block_manager import SelfAttnBlockSpaceManager
 from vllm.core.interfaces import AllocStatus
 from vllm.sequence import Logprob, SequenceStatus
 from vllm.utils import chunk_list

-from ..utils import (create_dummy_prompt, create_seq_group,
-                     create_seq_group_encoder_decoder)
+from ..utils import create_dummy_prompt, create_seq_group


 @pytest.mark.parametrize("block_size", [16])
@@ -58,156 +55,6 @@ def test_can_allocate_seq_group(block_size: int, num_seqs_per_group: int,
            assert can_allocate_result == AllocStatus.LATER


-@pytest.mark.parametrize("block_size", [16])
-@pytest.mark.parametrize("num_gpu_blocks", [16, 80, 160])
-@pytest.mark.parametrize("num_seqs_per_group", [1, 4])
-@pytest.mark.parametrize("watermark", [0.0, 0.5])
-def test_can_allocate_seq_group_encoder_decoder(block_size: int,
-                                                num_seqs_per_group: int,
-                                                num_gpu_blocks: int,
-                                                watermark: float):
-    block_manager = SelfAttnBlockSpaceManager(
-        block_size=block_size,
-        num_gpu_blocks=num_gpu_blocks,
-        num_cpu_blocks=1024,
-        watermark=watermark,
-    )
-    num_watermark_blocks = int(watermark * num_gpu_blocks)
-
-    num_output_blocks_per_seq = 1
-
-    # NOTE: This should be num_output_blocks_per_seq * num_seqs_per_group, but
-    # the current implementation assumes all seqs are new prompts / don't have
-    # different output lens.
-    num_output_blocks = num_output_blocks_per_seq
-
-    for bdx, num_prompt_blocks in enumerate(
-            range(1, num_gpu_blocks - num_output_blocks)):
-        num_cross_blocks_per_seq = num_prompt_blocks
-
-        seq_group = create_seq_group_encoder_decoder(
-            seq_prompt_len=block_size * num_prompt_blocks,
-            seq_output_lens=[
-                block_size * num_output_blocks_per_seq
-                for _ in range(num_seqs_per_group)
-            ],
-            request_id=str(bdx))
-
-        assert num_prompt_blocks + num_output_blocks <= num_gpu_blocks
-
-        can_allocate_result = block_manager.can_allocate(seq_group)
-
-        num_required_blocks = num_prompt_blocks + \
-                              num_output_blocks + \
-                              num_cross_blocks_per_seq
-
-        if num_gpu_blocks - num_required_blocks < num_watermark_blocks:
-            assert can_allocate_result == AllocStatus.NEVER
-        elif num_gpu_blocks >= num_required_blocks:
-            assert can_allocate_result == AllocStatus.OK
-        else:
-            assert can_allocate_result == AllocStatus.LATER
-
-
-@pytest.mark.parametrize("block_size", [16])
-@pytest.mark.parametrize("num_gpu_blocks", [16])
-@pytest.mark.parametrize("num_seqs_per_group", [1])
-@pytest.mark.parametrize("watermark", [0.0, 0.5])
-def test_can_allocate_encoder_decoder_fails_with_swa(block_size: int,
-                                                     num_seqs_per_group: int,
-                                                     num_gpu_blocks: int,
-                                                     watermark: float):
-    '''
-    SWA short for Sliding Window Attention.
-
-    At time of writing block manager does not support SWA.
-
-    However even when SWA is implemented for block manager,
-    there will still most likely be a separate workstream required
-    to enable SWA for encoder/decoder models.
-
-    Therefore this test enforces that one of the following cases
-    hold true:
-    1. Block manager does not support SWA at all (true at time of writing)
-    2. Block manager fails with NotImplementError when SWA is enabled
-       AND a SequenceGroup with an encoder sequence (i.e. in support of an
-       encoder/decoder model) is passed into can_allocate() as an argument
-
-    The setup for this test is stripped down version of
-    test_can_allocate_seq_group_encoder_decoder()
-    '''
-
-    with pytest.raises((NotImplementedError, AssertionError)) as exc_info:
-        block_manager = SelfAttnBlockSpaceManager(
-            block_size=block_size,
-            num_gpu_blocks=num_gpu_blocks,
-            num_cpu_blocks=1024,
-            watermark=watermark,
-            sliding_window=5  # SWA
-        )
-
-        num_output_blocks_per_seq = 1
-        num_prompt_blocks = 1
-        num_output_blocks = num_output_blocks_per_seq
-        seq_group = create_seq_group_encoder_decoder(
-            seq_prompt_len=block_size * num_prompt_blocks,
-            seq_output_lens=[
-                block_size * num_output_blocks_per_seq
-                for _ in range(num_seqs_per_group)
-            ],
-            request_id="0")
-
-        assert num_prompt_blocks + num_output_blocks <= num_gpu_blocks
-        block_manager.can_allocate(seq_group)
-
-    # Assert that either
-    # 1. Block manager constructor fails with assertion that sliding window
-    #    is not yet supported (most likely near-term outcome at time of
-    #    writing), or
-    # 2. can_allocate() fails with NotImplementedError due to combination of
-    #    encoder/decoder and sliding window attention
-    if isinstance(exc_info.value, NotImplementedError):
-        assert str(exc_info.value) == STR_NOT_IMPL_ENC_DEC_SWA
-    elif isinstance(exc_info.value, AssertionError):
-        assert str(exc_info.value) == "Sliding window not yet supported"
-
-
-@pytest.mark.parametrize("block_size", [16])
-@pytest.mark.parametrize("num_gpu_blocks", [16])
-@pytest.mark.parametrize("num_seqs_per_group", [1])
-@pytest.mark.parametrize("watermark", [0.0, 0.5])
-def test_can_allocate_encoder_decoder_fails_with_prefix_cache(
-        block_size: int, num_seqs_per_group: int, num_gpu_blocks: int,
-        watermark: float):
-
-    block_manager = SelfAttnBlockSpaceManager(
-        block_size=block_size,
-        num_gpu_blocks=num_gpu_blocks,
-        num_cpu_blocks=1024,
-        watermark=watermark,
-        enable_caching=True  # Prefix cache
-    )
-
-    num_output_blocks_per_seq = 1
-    num_prompt_blocks = 1
-    num_output_blocks = num_output_blocks_per_seq
-    seq_group = create_seq_group_encoder_decoder(
-        seq_prompt_len=block_size * num_prompt_blocks,
-        seq_output_lens=[
-            block_size * num_output_blocks_per_seq
-            for _ in range(num_seqs_per_group)
-        ],
-        request_id="0")
-
-    assert num_prompt_blocks + num_output_blocks <= num_gpu_blocks
-
-    # Assert that either can_allocate() fails with NotImplementedError
-    # due to combination of encoder/decoder and prefix cache
-    with pytest.raises(NotImplementedError) as exc_info:
-        block_manager.can_allocate(seq_group)
-    assert str(exc_info.value) == STR_NOT_IMPL_ENC_DEC_PREFIX_CACHE
-
-
 @pytest.mark.parametrize("block_size", [1, 8])
 @pytest.mark.parametrize("prompt_len", [1, 7, 8])
 @pytest.mark.parametrize("num_slots_to_append", [1, 8, 129])

--- a/tests/core/test_scheduler_encoder_decoder.py
+++ b/tests/core/test_scheduler_encoder_decoder.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-import pytest  # noqa
-
-from vllm.config import CacheConfig, SchedulerConfig
-from vllm.core.scheduler import Scheduler
-from vllm.sequence import SequenceGroup
-
-from .utils import (append_new_token, create_dummy_prompt_encoder_decoder,
-                    get_sequence_groups, schedule_and_update_computed_tokens)
-
-
-def test_scheduler_schedule_simple_encoder_decoder():
-    '''
-    Test basic scheduler functionality in the context
-    of an encoder/decoder model. Focus on testing
-    enc/dec-specific functionality sense tests already
-    exist for decoder-only functionality
-
-    Test behavior:
-    * Construct Scheduler
-    * Construct dummy encoder/decoder sequence groups
-    * Add dummy seq groups to scheduler backlog
-    * Schedule the next seq group & validate:
-        * Cross-attn block tables
-        * Updated states of seq groups
-        * Number of batched tokens
-        * Number of blocks to copy/swap-in/swap-out
-        * Number of scheduled seq groups
-    * Repeat for both prefill- and decode-phase
-    * Abort scheduled seq groups
-    * Assert that aborted seq groups no longer appear in
-      cross-attention block table
-    '''
-
-    block_size = 4
-    num_seq_group = 4
-    max_model_len = 16
-    scheduler_config = SchedulerConfig(
-        "generate",
-        max_num_batched_tokens=64,
-        max_num_seqs=num_seq_group,
-        max_model_len=max_model_len,
-    )
-    cache_config = CacheConfig(block_size, 1.0, 1, "auto")
-    cache_config.num_cpu_blocks = 16  # enc and dec prompts per seq_group
-    cache_config.num_gpu_blocks = 16  # enc and dec prompts per seq_group
-    scheduler = Scheduler(scheduler_config, cache_config, None)
-    running: list[SequenceGroup] = []
-
-    # Add seq groups to scheduler.
-    req_id_list = []
-    for i in range(num_seq_group):
-        req_id = str(i)
-        req_id_list.append(req_id)
-        _, _, seq_group = create_dummy_prompt_encoder_decoder(
-            req_id, block_size, block_size, block_size)
-        scheduler.add_seq_group(seq_group)
-        running.append(seq_group)
-
-    # Schedule seq groups prefill.
-    num_tokens = block_size * num_seq_group
-    seq_group_meta_list, out = schedule_and_update_computed_tokens(scheduler)
-    # - Verify that sequence group cross-attention block tables are
-    #   registered with the block manager
-    assert all([(req_id in scheduler.block_manager.cross_block_tables)
-                for req_id in req_id_list])
-    # - Validate sequence-group status
-    assert set(get_sequence_groups(out)) == set(running)
-    # - Validate number of batched tokens
-    assert out.num_batched_tokens == num_tokens
-    # - Validate there are no remaining blocks to swap
-    assert (not out.blocks_to_copy and not out.blocks_to_swap_in
-            and not out.blocks_to_swap_out)
-    # - Validate all seq groups were scheduled
-    assert len(seq_group_meta_list) == num_seq_group
-    append_new_token(out, 1)
-
-    # Schedule seq groups decode.
-    seq_group_meta_list, out = schedule_and_update_computed_tokens(scheduler)
-    # - Verify that sequence group metadata includes encoder attention
-    #   and cross-attention metadata
-    assert all([
-        not ((seq_group_meta.encoder_seq_data is None) or
-             (seq_group_meta.cross_block_table is None))
-        for seq_group_meta in seq_group_meta_list
-    ])
-    # - Validate sequence-group status
-    assert set(get_sequence_groups(out)) == set(running)
-    # - Validate there is one batched token per seq group
-    assert out.num_batched_tokens == num_seq_group
-    # - Validate there are no remaining blocks to swap
-    assert (not out.blocks_to_copy and not out.blocks_to_swap_in
-            and not out.blocks_to_swap_out)
-    # - Validate that all seq groups were scheduled
-    assert len(seq_group_meta_list) == num_seq_group
-    append_new_token(out, 1)
-
-    # Abort sequences
-    for req_id in req_id_list:
-        scheduler.abort_seq_group(req_id)
-        # - Verify that sequence group cross-attention block tables are
-        #   NO LONGER registered with the block manager
-        assert req_id not in scheduler.block_manager.cross_block_tables
--- a/tests/distributed/test_pipeline_parallel.py
+++ b/tests/distributed/test_pipeline_parallel.py
@@ -242,9 +242,6 @@ MULTIMODAL_MODELS = {
    "Qwen/Qwen2-Audio-7B-Instruct": PPTestSettings.fast(),
    "Qwen/Qwen2-VL-2B-Instruct": PPTestSettings.fast(),
    "fixie-ai/ultravox-v0_5-llama-3_2-1b": PPTestSettings.fast(),
-    # [Encoder-decoder]
-    # TODO: Implement PP
-    # "meta-llama/Llama-3.2-11B-Vision-Instruct": PPTestSettings.fast(),
 }
 # yapf: enable


--- a/tests/encoder_decoder/__init__.py
+++ b/tests/encoder_decoder/__init__.py
--- a/tests/encoder_decoder/test_e2e_correctness.py
+++ b/tests/encoder_decoder/test_e2e_correctness.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-"""E2E tests to verify the correctness of the encoder-decoder framework
-
-Run `pytest tests/encoder_decoder/test_e2e_correctness.py`.
-"""
-from typing import Optional
-
-import pytest
-from transformers import AutoModelForSeq2SeqLM
-
-from vllm.attention.selector import (_Backend, _cached_get_attn_backend,
-                                     global_force_attn_backend_context_manager)
-from vllm.platforms import current_platform
-from vllm.sequence import SampleLogprobs
-
-from ..conftest import DecoderPromptType
-from ..models.utils import check_logprobs_close
-
-LIST_ENC_DEC_SUPPORTED_BACKENDS = [
-    _Backend.XFORMERS, _Backend.FLASH_ATTN, None
-]
-
-
-@pytest.fixture(scope="function", autouse=True)
-def use_v0_only(monkeypatch):
-    """
-    Since this module is V0 only, set VLLM_USE_V1=0 for
-    all tests in the module.
-    """
-    monkeypatch.setenv('VLLM_USE_V1', '0')
-
-
-def vllm_to_hf_output(
-    vllm_output: tuple[list[int], str, Optional[SampleLogprobs]],
-    decoder_prompt_type: DecoderPromptType,
-):
-    """Sanitize vllm output to be comparable with hf output."""
-    output_ids, output_str, out_logprobs = vllm_output
-
-    hf_output_str = output_str + "</s>"
-    if decoder_prompt_type == DecoderPromptType.NONE:
-        hf_output_str = "<s>" + hf_output_str
-
-    return output_ids, hf_output_str, out_logprobs
-
-
-@pytest.fixture(autouse=True)
-def clear_cache():
-    """Fixture to clear backend cache before each test."""
-    _cached_get_attn_backend.cache_clear()  # Clear the cache
-    yield  # This allows the test to run
-
-
-@pytest.mark.parametrize("model", ["facebook/bart-large-cnn"])
-@pytest.mark.parametrize("dtype", ["float"])
-@pytest.mark.parametrize("attn_backend", LIST_ENC_DEC_SUPPORTED_BACKENDS)
-@pytest.mark.parametrize("max_tokens", [128])
-@pytest.mark.parametrize("num_logprobs", [5])
-@pytest.mark.parametrize("decoder_prompt_type", list(DecoderPromptType))
-@pytest.mark.parametrize("enforce_eager", [True, False])
-@pytest.mark.skipif(
-    current_platform.is_cpu(),
-    reason="CPU backend is not currently supported with encoder/decoder models"
-)
-@pytest.mark.skip(reason="bart not supported in V1")
-def test_encoder_decoder_e2e(
-    hf_runner,
-    vllm_runner,
-    example_encoder_decoder_prompts,
-    model: str,
-    dtype: str,
-    max_tokens: int,
-    num_logprobs: int,
-    decoder_prompt_type: DecoderPromptType,
-    enforce_eager: bool,
-    attn_backend: _Backend,
-) -> None:
-    '''
-    End-to-End (E2E) test for the encoder-decoder framework.
-    This test evaluates the encoder-decoder functionality using the BART
-    model. We compare the outputs of the Hugging Face and vLLM
-    implementations to ensure that both implementations produce consistent
-    and correct results.
-    '''
-    with global_force_attn_backend_context_manager(attn_backend):
-        if attn_backend == _Backend.FLASH_ATTN:
-            # Flash Attention works only with bfloat16 data-type
-            dtype = 'bfloat16'
-        test_case_prompts = example_encoder_decoder_prompts[
-            decoder_prompt_type]
-
-        # Configuration settings for HF baseline
-        hf_kwargs = {
-            "top_k": None,
-            "num_beams": 1,
-            "repetition_penalty": 1.0,
-            "top_p": 1.0,
-            "length_penalty": 1.0,
-            "early_stopping": False,
-            "no_repeat_ngram_size": None,
-            "min_length": 0
-        }
-
-        with hf_runner(model, dtype=dtype,
-                       auto_cls=AutoModelForSeq2SeqLM) as hf_model:
-            hf_outputs = (
-                hf_model.generate_encoder_decoder_greedy_logprobs_limit(
-                    test_case_prompts,
-                    max_tokens,
-                    num_logprobs,
-                    **hf_kwargs,
-                ))
-        with vllm_runner(model, dtype=dtype,
-                         enforce_eager=enforce_eager) as vllm_model:
-            vllm_outputs = vllm_model.generate_encoder_decoder_greedy_logprobs(
-                test_case_prompts, max_tokens, num_logprobs)
-
-        hf_skip_tokens = (1 if decoder_prompt_type == DecoderPromptType.NONE
-                          else 0)
-
-        check_logprobs_close(
-            outputs_0_lst=hf_outputs,
-            outputs_1_lst=[
-                vllm_to_hf_output(vllm_output, decoder_prompt_type)
-                for vllm_output in vllm_outputs
-            ],
-            name_0="hf",
-            name_1="vllm",
-            num_outputs_0_skip_tokens=hf_skip_tokens,
-        )
--- a/tests/entrypoints/openai/test_encoder_decoder.py
+++ b/tests/entrypoints/openai/test_encoder_decoder.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-import openai
-import pytest
-import pytest_asyncio
-
-from ...utils import RemoteOpenAIServer
-
-MODEL_NAME = "facebook/bart-base"
-
-
-@pytest.fixture(scope="module")
-def server():
-    args = [
-        "--dtype",
-        "bfloat16",
-        "--enforce-eager",
-    ]
-
-    with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
-        yield remote_server
-
-
-@pytest_asyncio.fixture
-async def client(server):
-    async with server.get_async_client() as async_client:
-        yield async_client
-
-
-@pytest.mark.asyncio
-@pytest.mark.parametrize("model_name", [MODEL_NAME])
-@pytest.mark.skip(reason="bart is not yet supported in V1")
-async def test_single_completion(client: openai.AsyncOpenAI, model_name: str):
-    completion = await client.completions.create(model=model_name,
-                                                 prompt="Hello, my name is",
-                                                 max_tokens=5,
-                                                 temperature=0.0)
-
-    assert completion.id is not None
-    assert completion.choices is not None and len(completion.choices) == 1
-
-    choice = completion.choices[0]
-    assert len(choice.text) >= 5
-    assert choice.finish_reason == "length"
-    assert completion.usage == openai.types.CompletionUsage(
-        completion_tokens=5, prompt_tokens=2, total_tokens=7)
-
-    # test using token IDs
-    completion = await client.completions.create(
-        model=model_name,
-        prompt=[0, 0, 0, 0, 0],
-        max_tokens=5,
-        temperature=0.0,
-    )
-    assert len(completion.choices[0].text) >= 1
--- a/tests/entrypoints/test_chat_utils.py
+++ b/tests/entrypoints/test_chat_utils.py
@@ -20,7 +20,6 @@ from vllm.entrypoints.chat_utils import (_try_extract_ast, load_chat_template,
                                         parse_chat_messages_futures,
                                         resolve_chat_template_content_format,
                                         resolve_hf_chat_template)
-from vllm.entrypoints.llm import apply_hf_chat_template
 from vllm.multimodal import MultiModalDataDict, MultiModalUUIDDict
 from vllm.multimodal.utils import (encode_audio_base64, encode_image_base64,
                                   encode_video_base64)
@@ -38,7 +37,6 @@ QWEN2AUDIO_MODEL_ID = "Qwen/Qwen2-Audio-7B-Instruct"
 QWEN2VL_MODEL_ID = "Qwen/Qwen2-VL-2B-Instruct"
 QWEN25VL_MODEL_ID = "Qwen/Qwen2.5-VL-3B-Instruct"
 QWEN25OMNI_MODEL_ID = "Qwen/Qwen2.5-Omni-7B"
-MLLAMA_MODEL_ID = "meta-llama/Llama-3.2-11B-Vision-Instruct"
 LLAMA_GUARD_MODEL_ID = "meta-llama/Llama-Guard-3-1B"
 HERMES_MODEL_ID = "NousResearch/Hermes-3-Llama-3.1-8B"
 MISTRAL_MODEL_ID = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
@@ -125,27 +123,6 @@ def qwen25omni_tokenizer():
    )


-@pytest.fixture(scope="module")
-def mllama_model_config():
-    return ModelConfig(
-        MLLAMA_MODEL_ID,
-        runner="generate",
-        limit_mm_per_prompt={
-            "image": 2,
-        },
-    )
-
-
-@pytest.fixture(scope="module")
-def mllama_tokenizer():
-    return TokenizerGroup(
-        MLLAMA_MODEL_ID,
-        enable_lora=False,
-        max_num_seqs=5,
-        max_input_length=None,
-    )
-
-
 @pytest.fixture(scope="function")
 def mistral_model_config():
    return ModelConfig(
@@ -2249,180 +2226,6 @@ def test_parse_chat_messages_multiple_images_interleave_with_placeholders(
        )


-### Mllama currently wraps images / texts as interleaved dictionaries
-def test_mllama_single_image(
-    mllama_model_config,
-    mllama_tokenizer,
-    image_url,
-):
-    """Ensures that a single image is parsed correctly mllama."""
-    conversation, mm_data, mm_uuids = parse_chat_messages(
-        [{
-            "role":
-            "user",
-            "content": [
-                {
-                    "type": "text",
-                    "text": "The content of this image is:"
-                },
-                {
-                    "image_url": image_url
-                },
-            ],
-        }],
-        mllama_model_config,
-        mllama_tokenizer,
-        content_format="openai",
-    )
-    _assert_mm_data_is_image_input(mm_data, 1)
-    _assert_mm_uuids(mm_uuids, 1, expected_uuids=[None])
-    assert conversation == [{
-        "role":
-        "user",
-        "content": [
-            {
-                "type": "text",
-                "text": "The content of this image is:"
-            },
-            {
-                "type": "image"
-            },
-        ],
-    }]
-
-
-def test_mllama_interleaved_images(
-    mllama_model_config,
-    mllama_tokenizer,
-    image_url,
-):
-    """Ensures that multiple image are parsed as interleaved dicts."""
-    conversation, mm_data, mm_uuids = parse_chat_messages(
-        [{
-            "role":
-            "user",
-            "content": [
-                {
-                    "type": "text",
-                    "text": "The content of the first image is:",
-                },
-                {
-                    "image_url": image_url
-                },
-                {
-                    "type": "text",
-                    "text": "The content of the second image is:",
-                },
-                {
-                    "image_url": image_url
-                },
-            ],
-        }],
-        mllama_model_config,
-        mllama_tokenizer,
-        content_format="openai",
-    )
-    _assert_mm_data_is_image_input(mm_data, 2)
-    _assert_mm_uuids(mm_uuids, 2, expected_uuids=[None, None])
-    assert conversation == [{
-        "role":
-        "user",
-        "content": [
-            {
-                "type": "text",
-                "text": "The content of the first image is:"
-            },
-            {
-                "type": "image"
-            },
-            {
-                "type": "text",
-                "text": "The content of the second image is:"
-            },
-            {
-                "type": "image"
-            },
-        ],
-    }]
-
-
-@pytest.mark.parametrize("model", [MLLAMA_MODEL_ID])
-def test_multimodal_image_parsing_matches_hf(model, image_url):
-    """Checks end to end hf alignment for multimodal [image] parsing."""
-
-    def get_conversation(is_hf: bool):
-        img_part = {"type": "image_url", "image_url": {"url": image_url}}
-        if is_hf:
-            img_part = {"type": "image"}
-        return [{
-            "role":
-            "user",
-            "content": [
-                {
-                    "type": "text",
-                    "text": "The content of the first image is:",
-                },
-                img_part,
-                {
-                    "type": "text",
-                    "text": "The content of the second image is:",
-                },
-                img_part,
-                {
-                    "type": "text",
-                    "text": "What animal is in the first image?",
-                },
-            ],
-        }]
-
-    # Build a config for the model
-    model_config = ModelConfig(
-        model,
-        runner="generate",
-        limit_mm_per_prompt={
-            "image": 2,
-        },
-    )
-
-    # Build the tokenizer group and grab the underlying tokenizer
-    tokenizer_group = TokenizerGroup(
-        model,
-        enable_lora=False,
-        max_num_seqs=5,
-        max_input_length=None,
-        trust_remote_code=model_config.trust_remote_code,
-    )
-    tokenizer = tokenizer_group.tokenizer
-
-    # Build and parse a conversation with {"type": "image"} using the tokenizer
-    hf_conversation = get_conversation(is_hf=True)
-    hf_result = tokenizer.apply_chat_template(
-        hf_conversation,
-        tokenize=False,
-        add_generation_prompt=True,
-    )
-
-    # Now parse with vLLMs chat utils & apply the template
-    vllm_conversation = get_conversation(is_hf=False)
-    conversation, _, _ = parse_chat_messages(
-        vllm_conversation,
-        model_config,
-        tokenizer_group,
-        content_format="openai",
-    )
-
-    vllm_result = apply_hf_chat_template(
-        tokenizer=tokenizer,
-        conversation=conversation,
-        chat_template=None,
-        model_config=model_config,
-        tools=None,
-        add_generation_prompt=True,
-    )
-
-    assert hf_result == vllm_result
-
-
 @pytest.mark.parametrize(
    "model",
    [
@@ -2486,7 +2289,6 @@ def test_resolve_hf_chat_template(sample_json_schema, model, use_tools):
     (QWEN25VL_MODEL_ID, "openai"),
     (ULTRAVOX_MODEL_ID, "string"),
     (QWEN2AUDIO_MODEL_ID, "openai"),
-     (MLLAMA_MODEL_ID, "openai"),
     (LLAMA_GUARD_MODEL_ID, "openai")],
 )
 # yapf: enable
@@ -2545,7 +2347,6 @@ def test_resolve_content_format_hf_defined(model, expected_format):
    [("Salesforce/blip2-opt-2.7b", "string"),
     ("facebook/chameleon-7b", "string"),
     ("deepseek-ai/deepseek-vl2-tiny", "string"),
-     ("microsoft/Florence-2-base", "string"),
     ("adept/fuyu-8b", "string"),
     ("google/paligemma-3b-mix-224", "string"),
     ("Qwen/Qwen-VL", "string"),

--- a/tests/kernels/attention/test_encoder_decoder_attn.py
+++ b/tests/kernels/attention/test_encoder_decoder_attn.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-"""
-Tests:
-
-* E2E test of Encoder attention + Decoder self-attention +
-      Encoder/decoder cross-attention (collectively
-      "encoder/decoder attention")
-
-"""
-
-from typing import NamedTuple, Optional
-
-import pytest
-import torch
-
-from tests.kernels.utils import *
-from vllm.attention import Attention, AttentionMetadata, AttentionType
-from vllm.attention.backends.utils import STR_NOT_IMPL_ENC_DEC_ROCM_HIP
-from vllm.attention.selector import (_Backend, _cached_get_attn_backend,
-                                     global_force_attn_backend_context_manager)
-from vllm.config import VllmConfig, set_current_vllm_config
-from vllm.forward_context import set_forward_context
-from vllm.platforms import current_platform
-
-
-@pytest.fixture(scope="function", autouse=True)
-def use_v0_only(monkeypatch):
-    """
-    Encoder-decoder is only supported on V0, so set 
-    VLLM_USE_V1=0 for all tests in the module.
-    """
-    monkeypatch.setenv('VLLM_USE_V1', '0')
-
-
-# List of support backends for encoder/decoder models
-LIST_ENC_DEC_SUPPORTED_BACKENDS = [_Backend.XFORMERS, _Backend.FLASH_ATTN]
-HEAD_SIZES = [64, 256]
-
-NUM_HEADS = [1, 16]
-
-BATCH_SIZES = [1, 16]
-BLOCK_SIZES = [16]
-CUDA_DEVICE = "cuda:0"
-
-MAX_DEC_SEQ_LENS = [128]
-MAX_ENC_SEQ_LENS = [128]
-
-# Narrow test-cases for unsupported-scenario
-# tests
-HEAD_SIZES_FOR_UNSUPP = [HEAD_SIZES[0]]
-
-
-class TestPoint(NamedTuple):
-    """
-    Encapsulates the attributes which define a single invocation
-    of the test_e2e_enc_dec_attn() test
-
-    Attributes:
-        num_heads: The number of heads in the model.
-        head_size: Head dimension
-        backend_name: Name of the backend framework used.
-        batch_size: Number of samples per batch.
-        block_size: Size of each block of data processed.
-        max_dec_seq_len: Maximum sequence length for the decoder.
-        max_enc_seq_len: Maximum sequence length for the encoder.
-        num_blocks: Number of blocks in the model.
-    """
-
-    num_heads: int
-    head_size: int
-    backend_name: str
-    batch_size: int
-    block_size: int
-    max_dec_seq_len: int
-    max_enc_seq_len: int
-    num_blocks: int
-    attn_type: AttentionType
-
-
-class TestResources(NamedTuple):
-    '''
-    Encapsulates key components for performing an
-    encoder/decoder attention test
-
-    Note that
-    (1) attn automatically selects an attention backend
-        based on platform info & a set of canned
-        heuristics
-    (2) attn_backend is thus *not the same backend
-        instance* used by attn, but rather it is
-        intended to be a
-        *different instance* of the *same backend class*;
-        it is assumed that the user of TestResources
-        will leverage attn_backend for the purpose of
-        constructing backend-compatible attention
-        metadata instances
-
-    Attributes:
-
-    * scale: 1/sqrt(d) scale factor for attn
-    * attn_backend: implementations of abstraction
-                    attention interface using
-                    a particular kernel library
-                    i.e. XFormers
-    * attn: Attention layer instance
-    * kv_cache: shared key/value cache for all attention
-    '''
-
-    scale: float
-    attn: Attention
-    kv_cache: torch.Tensor
-
-
-def _make_test_resources(test_pt: TestPoint, ) -> TestResources:
-    '''
-    Build key components for performing encoder/decoder attention test.
-
-    Note that
-    (1) The Attention instance constructed here, automatically selects
-        an attention backend class based on platform info & a set of canned
-        heuristics, so
-    (2) The attention backend instance constructed here is thus *not
-        the same backend instance* used by attn, but rather it is
-        intended to be a *different instance* of the *same backend class*;
-        therefore,
-    (3) This function requires that test_pt.backend_name matches the backend
-        class that Attention will automatically select when it is constructed.
-
-
-    Arguments:
-
-    * test_pt: TestPoint data structure; this function relies on the
-               following fields: num_heads, head_size, num_blocks,
-               block_size, backend_name
-
-    Returns:
-
-    * TestResources data structure.
-    '''
-
-    scale = float(1.0 / (test_pt.head_size**0.5))
-    attn = Attention(
-        test_pt.num_heads,
-        test_pt.head_size,
-        scale=scale,
-        prefix=f"{test_pt.attn_type}",
-        attn_type=test_pt.attn_type,
-    )
-    if test_pt.num_blocks is None or test_pt.num_heads is None:
-        # Caller does not require a KV cache
-        return TestResources(
-            scale, attn,
-            torch.tensor([], dtype=torch.float32, device=CUDA_DEVICE))
-
-    # Construct KV cache
-    if test_pt.attn_type in (AttentionType.DECODER,
-                             AttentionType.ENCODER_DECODER):
-        kv_cache = make_kv_cache(test_pt.num_blocks,
-                                 test_pt.num_heads,
-                                 test_pt.head_size,
-                                 test_pt.block_size,
-                                 device=CUDA_DEVICE,
-                                 backend=test_pt.backend_name)
-    else:
-        kv_cache = torch.tensor([])
-
-    attn.kv_cache = [kv_cache]
-    return TestResources(scale, attn, kv_cache)
-
-
-def _encoder_attn_setup(
-    test_pt: TestPoint,
-    test_rsrcs: TestResources,
-) -> PhaseTestParameters:
-    '''
-    Set up test vectors & data structures for encoder attention test.
-
-    A triplet of synthetic query/key/value tensors are constructed.
-    Given this is an encoder attention test, the key & value
-    sequences will have the same length as the corresponding queries.
-
-    The query/key/value tensors are passed to an ideal reference
-    self-attention implementation to generate an ideal output tensor.
-
-    Encoder inference does not populate the KV cache, therefore
-    no KV cache memory mapping is constructed
-
-    Arguments:
-
-    * test_pt: TestPoint data structure; this function relies on the
-               following fields: batch_size, num_heads, head_size,
-               block_size, max_q_seq_len
-    * test_rsrcs: TestResources data structure; this function relies on the
-                  scale field
-
-
-    Returns:
-
-    * PhaseTestParameters data structure comprising (1) packed query/key/value
-      tensors, (2) the ideal output of attention computed using a naive
-      implementation, and (3) KVCache field set to None
-    '''
-
-    (
-        num_heads,
-        head_size,
-        _,
-        batch_size,
-        _,
-        _,
-        max_q_seq_len,
-        _,
-        _,
-    ) = test_pt
-
-    scale = test_rsrcs.scale
-
-    max_kv_seq_len = max_q_seq_len
-
-    # Make test tensors
-
-    qkv_in, _, _ = make_qkv(batch_size,
-                            max_q_seq_len,
-                            max_kv_seq_len,
-                            num_heads,
-                            head_size,
-                            attn_type=AttentionType.ENCODER,
-                            device=CUDA_DEVICE)
-
-    # Compute correct answer using naive non-causal attention
-    # implementation
-
-    ideal_output = ref_masked_attention(qkv_in.query,
-                                        qkv_in.key,
-                                        qkv_in.value,
-                                        scale=scale,
-                                        q_seq_lens=qkv_in.q_seq_lens,
-                                        kv_seq_lens=qkv_in.kv_seq_lens)
-
-    packed_ideal_output, _ = pack_tensor(ideal_output,
-                                         qkv_in.q_seq_lens,
-                                         device=CUDA_DEVICE)
-
-    packed_qkv = pack_qkv(qkv_in, device=CUDA_DEVICE)
-
-    return PhaseTestParameters(
-        PackedQKVO(packed_qkv, packed_ideal_output),
-        None  # No KV cache
-    )
-
-
-def _decoder_attn_setup(
-    test_pt: TestPoint,
-    test_rsrcs: TestResources,
-    block_base_addr: int = 0,
-) -> tuple[QKVInputs, PhaseTestParameters, PhaseTestParameters, int]:
-    '''
-    Set up test vectors & data structures for self-attention test.
-
-    A triplet of synthetic query/key/value tensors are constructed ("baseline"
-    query/key/value). Given this is a self-attention test, the key & value
-    sequences will have the same length as the corresponding queries.
-
-    "Prefill" query/key/value tensors are derived by masking out the last value
-    in each baseline query/key/value. These tensors are used to test prefill &
-    populate KV cache for a subsequent decode test.
-
-    "Decode" query/key/value tensors are derived by extracting *only* the last
-    value from each baseline query/key/value (i.e. complement of the prefill
-    tensors.) These tensors are used to test decode, conditional on the kv cache
-    being populated during the prefill test.
-
-    The baseline query/key/value tensors are passed to an ideal reference
-    self-attention implementation to generate a "Baseline" ideal output tensor.
-    This tensor is split into the "Prefill" ideal output tensor (all but the
-    last element of each output sequence) and the "Decode" ideal output tensor
-    (*only* the last element of each output sequence); the "Prefill" and
-    "Decode" ideal output tensors can be used to validate the prefill and decode
-    test results, respectively.
-
-    This function also constructs the self-attention KV cache memory mapping
-    (slot mapping and block table), ensuring that the block table starts at
-    block_base_addr
-
-    Arguments:
-
-    * test_pt: TestPoint data structure; this function relies on the
-               following fields: batch_size, num_heads, head_size,
-               block_size, max_q_seq_len
-    * test_rsrcs: TestResources data structure; this function relies on the
-                  scale field
-    * block_base_addr: decoder self-attention block-table base address
-
-    Returns:
-    * qkv: Unpacked (batch_size x padded_seq_len x num_heads x
-           head_size) query/key/value tensors
-    * Prefill-phase decoder self-attention PhaseTestParameters data structure,
-      including (1) packed (number_of_tokens x num_heads x head_size)
-      query/key/value tensors along with (2) ideal attention output
-      computed using a naive implementation, and (3) memory-mapping data
-      structures appropriate for prefill phase.
-    * Decode-phase decoder self-attention PhaseTestParameters data structure,
-      including (1) packed (number_of_tokens x num_heads x head_size)
-      query/key/value tensors along with (2) ideal attention output
-      computed using a naive implementation, and (3) memory-mapping data
-      structures appropriate for decode phase.
-    * max_block_idx: max physical address in decoder self-attention block-table
-                     (intended to be used as the base address for the encoder/
-                      decoder cross-attention block-table, which is not
-                      constructed in this function)
-    '''
-
-    (
-        num_heads,
-        head_size,
-        _,
-        batch_size,
-        block_size,
-        max_q_seq_len,
-        _,
-        _,
-        _,
-    ) = test_pt
-
-    scale = test_rsrcs.scale
-
-    max_kv_seq_len = max_q_seq_len
-
-    # Build test tensors
-
-    (
-        qkv,
-        prefill_qkv,
-        decode_qkv,
-    ) = make_qkv(batch_size,
-                 max_q_seq_len,
-                 max_kv_seq_len,
-                 num_heads,
-                 head_size,
-                 attn_type=AttentionType.DECODER,
-                 device=CUDA_DEVICE)
-
-    # Compute correct answer using naive attention implementation
-    # with causal attention mask
-
-    causal_mask = make_causal_mask(max_q_seq_len,
-                                   max_kv_seq_len).to(CUDA_DEVICE)
-
-    ideal_output = ref_masked_attention(qkv.query,
-                                        qkv.key,
-                                        qkv.value,
-                                        scale=scale,
-                                        custom_mask=causal_mask,
-                                        q_seq_lens=qkv.q_seq_lens,
-                                        kv_seq_lens=qkv.kv_seq_lens)
-
-    # Split out the prefill- & decode-phase ideal answers & pack them
-
-    prefill_ideal_output = torch.zeros_like(ideal_output)
-    decode_ideal_output = torch.zeros_like(ideal_output[:, 0:1])
-    for bdx, prefill_q_seq_len in enumerate(prefill_qkv.q_seq_lens):
-        prefill_ideal_output[bdx, :prefill_q_seq_len] = ideal_output[
-            bdx, :prefill_q_seq_len]
-        decode_ideal_output[bdx, :] = ideal_output[bdx, prefill_q_seq_len:(
-            prefill_q_seq_len + 1)]
-
-    prefill_packed_ideal_output, _ = pack_tensor(prefill_ideal_output,
-                                                 prefill_qkv.q_seq_lens,
-                                                 device=CUDA_DEVICE)
-    decode_packed_ideal_output, _ = pack_tensor(decode_ideal_output,
-                                                [1 for _ in range(batch_size)],
-                                                device=CUDA_DEVICE)
-
-    # Build prefill- & decode-phase data structures
-    # for decoder self-attention. Block tables and
-    # slot mapping must be in a format compatible
-    # with KV caching & attention kernels
-    #
-    # Prefill-phase:
-    #
-    # * Empty block-tables tensor
-    # * Slot-mapping with entries for prompt tokens
-    #
-    # Decode-phase:
-    # * Block-tables tensor with minimum number of blocks
-    #   required by total num. tokens in the entirety of all sequences
-    #   (including both prefill & decode)
-    # * Slot-mapping with entries for tokens that will be decoded in the
-    #   current decode iteration
-    #
-    #  Note: the format described above is simply mirroring what ModelRunner
-    #        produces
-
-    prefill_block_tables = make_empty_block_tables_tensor(device=CUDA_DEVICE)
-
-    (
-        decode_block_tables,
-        slot_mapping_list,
-        max_block_idx,
-    ) = make_block_tables_slot_mapping(block_size,
-                                       qkv.q_seq_lens,
-                                       device=CUDA_DEVICE,
-                                       block_base_addr=block_base_addr)
-
-    (
-        prefill_slot_mapping,
-        decode_slot_mapping,
-    ) = split_slot_mapping(slot_mapping_list,
-                           qkv.q_seq_lens,
-                           device=CUDA_DEVICE)
-
-    prefill_pckd_qkv = pack_qkv(prefill_qkv, device=CUDA_DEVICE)
-
-    decode_pckd_qkv = pack_qkv(decode_qkv, device=CUDA_DEVICE)
-
-    return (
-        qkv,
-        PhaseTestParameters(  # Prefill test params
-            PackedQKVO(prefill_pckd_qkv, prefill_packed_ideal_output),
-            KVMemoryMap(prefill_block_tables, prefill_slot_mapping)),
-        PhaseTestParameters(  # Decode test params
-            PackedQKVO(decode_pckd_qkv, decode_packed_ideal_output),
-            KVMemoryMap(decode_block_tables, decode_slot_mapping)),
-        max_block_idx)
-
-
-def _enc_dec_cross_attn_setup_reuses_query(
-    decoder_qkv: QKVInputs,
-    encoder_test_params: PhaseTestParameters,
-    prefill_decoder_phase_test_params: PhaseTestParameters,
-    test_pt: TestPoint,
-    test_rsrcs: TestResources,
-    block_base_addr: int = 0,
-) -> tuple[PhaseTestParameters, PhaseTestParameters]:
-    '''
-    Set up test vectors & data structures for cross-attention test.
-
-    A triplet of synthetic cross-attention key/value tensors are constructed
-    ("baseline" key/value). Given this is a cross-attention test, we assume
-    query tensors were already synthesized for a prior self-attention test and
-    will be reused for cross-attention. The key & value sequences generated here
-    may have a different length than the corresponding queries (as is often
-    the case for cross-attention between decoder and encoder sequences.)
-
-    Cross attention key & value tensors do not grow during autoregressive
-    inference; thus this function obtains a single key/value pair suitable for
-    both prefill and decode.
-
-    The "baseline" query tensor is received as an argument. The "baseline"
-    query/key/value tensors are passed to an ideal reference cross-attention
-    implementation to generate a "baseline" ideal output tensor. This tensor is
-    split into the "Prefill" ideal output tensor (all but the last element of
-    each output sequence) and the "Decode" ideal output tensor (*only* the last
-    element of each output sequence); the "Prefill" and "Decode" ideal output
-    tensors can be used to validate the prefill and decode test results,
-    respectively.
-
-    This function also constructs the cross-attention KV cache memory mapping
-    (slot mapping and block table), ensuring that the block table starts at
-    block_base_addr.
-
-    Arguments:
-
-    * decoder_qkv: pre-existing unpacked (batch_size x padded_seq_len x
-                   num_heads x head_size) decoder self-attention inputs;
-                   this function relies on the query and q_seq_lens
-                   fields
-    * encoder_test_params: PhaseTestParameters data structure which was
-                           used for encoder inference; KV cache field
-                           is not used by this function
-    * prefill_decoder_phase_test_params: PhaseTestParameters data structure
-                                         used for prefill-phase decoder
-                                         self-attention; all fields
-                                         including KV cache required
-    * test_pt: TestPoint data structure; this function relies on the
-               following fields: batch_size, num_heads, head_size,
-               block_size, max_q_seq_len
-    * test_rsrcs: TestResources data structure; this function relies on the
-                  scale field
-    * block_base_addr: decoder self-attention block-table base address
-
-    Returns:
-
-    * Prefill-phase encoder/decoder cross-attention PhaseTestParameters data
-      structure, including (1) packed
-      (number_of_tokens x num_heads x head_size) query/key/value tensors
-      along with (2) ideal attention output computed using a
-      naive implementation, and (3) memory-mapping data structures appropriate
-      for prefill phase.
-    * Decode-phase encoder/decoder cross-attention PhaseTestParameters data
-      structure, including (1) packed
-      (number_of_tokens x num_heads x head_size) query/key/value tensors
-      along with (2) ideal attention output computed using a
-      naive implementation, and (3) memory-mapping data structures appropriate
-      for decode phase.
-    '''
-
-    assert encoder_test_params.packed_qkvo.packed_qkv is not None
-    assert prefill_decoder_phase_test_params.packed_qkvo.packed_qkv is not None
-
-    (
-        num_heads,
-        head_size,
-        _,
-        batch_size,
-        block_size,
-        max_decoder_seq_len,
-        max_encoder_seq_len,
-        _,
-        _,
-    ) = test_pt
-
-    scale = test_rsrcs.scale
-
-    decoder_query = decoder_qkv.query
-    decoder_seq_lens = decoder_qkv.q_seq_lens
-    encoder_seq_lens = encoder_test_params.packed_qkvo.packed_qkv.q_seq_lens
-    prefill_q_seq_lens = (
-        prefill_decoder_phase_test_params.packed_qkvo.packed_qkv.q_seq_lens)
-
-    assert prefill_q_seq_lens is not None
-
-    (
-        cross_kv,
-        _,
-        _,
-    ) = make_qkv(batch_size,
-                 max_decoder_seq_len,
-                 max_encoder_seq_len,
-                 num_heads,
-                 head_size,
-                 force_kv_seq_lens=encoder_seq_lens,
-                 attn_type=AttentionType.ENCODER_DECODER,
-                 device=CUDA_DEVICE)
-
-    ideal_output = ref_masked_attention(decoder_query,
-                                        cross_kv.key,
-                                        cross_kv.value,
-                                        scale=scale,
-                                        q_seq_lens=decoder_seq_lens,
-                                        kv_seq_lens=cross_kv.kv_seq_lens)
-
-    prefill_ideal_output = torch.zeros_like(ideal_output)
-    decode_ideal_output = torch.zeros_like(ideal_output[:, 0:1])
-    for bdx, prefill_q_seq_len in enumerate(prefill_q_seq_lens):
-        prefill_ideal_output[bdx, :prefill_q_seq_len] = ideal_output[
-            bdx, :prefill_q_seq_len]
-        decode_ideal_output[bdx, :] = ideal_output[bdx, prefill_q_seq_len:(
-            prefill_q_seq_len + 1)]
-
-    prefill_packed_ideal_output, _ = pack_tensor(prefill_ideal_output,
-                                                 prefill_q_seq_lens,
-                                                 device=CUDA_DEVICE)
-    decode_packed_ideal_output, _ = pack_tensor(decode_ideal_output,
-                                                [1 for _ in range(batch_size)],
-                                                device=CUDA_DEVICE)
-
-    # Build prefill- & decode-phase data structures
-    # for encoder/decoder cross-attention. Block tables and
-    # slot mapping must be in a format compatible
-    # with KV caching & attention kernels
-    #
-    # Whereas decoder self-attention extracts relationships between
-    # equal-length Q/K/V sequences, which mutually grow in length
-    # with each decoded token, cross-attention relates the Q sequence
-    # - which grows with each new decoded token - to fixed-length
-    # K and V sequences derived from the encoder hidden states.
-    #
-    # Prefill-phase:
-    #
-    # * Empty block-tables tensor
-    # * Slot-mapping with as many entries as there are tokens in the encoder
-    #   prompt.
-    #
-    # Decode-phase:
-    # * Block-tables tensor with minimum number of blocks to
-    #   accommodate K & V tensors which are equal in lnegth
-    #   to the encoder prompt length
-    # * Empty slot-mapping tensor (since K & V are fixed in size,
-    #   new decoded tokens are not KV-cached and require no slot-
-    #   mapping)
-    #
-    # Note: the format above is simply an extension of what ModelRunner
-    #       produces for decoder-only models
-
-    prefill_block_tables = make_empty_block_tables_tensor(device=CUDA_DEVICE)
-    decode_slot_mapping = make_empty_slot_mapping_tensor(device=CUDA_DEVICE)
-
-    (
-        decode_block_tables,
-        prefill_slot_mapping_list,
-        _,
-    ) = make_block_tables_slot_mapping(block_size,
-                                       cross_kv.kv_seq_lens,
-                                       block_base_addr=block_base_addr,
-                                       device=CUDA_DEVICE)
-
-    prefill_slot_mapping = maybe_make_long_tensor(prefill_slot_mapping_list,
-                                                  device=CUDA_DEVICE)
-
-    # Packed key/value (query is already provided)
-    packed_cross_kv = pack_qkv(cross_kv, device=CUDA_DEVICE)
-
-    return (
-        PhaseTestParameters(  # Prefill-phase test params
-            PackedQKVO(packed_cross_kv, prefill_packed_ideal_output),
-            KVMemoryMap(prefill_block_tables, prefill_slot_mapping)),
-        PhaseTestParameters(  # Decode-phase test params
-            PackedQKVO(None, decode_packed_ideal_output),
-            KVMemoryMap(decode_block_tables, decode_slot_mapping)))
-
-
-def _run_encoder_attention_test(
-    attn: Attention,
-    encoder_test_params: PhaseTestParameters,
-    attn_metadata: AttentionMetadata,
-    test_pt: TestPoint,
-    vllm_config: VllmConfig,
-) -> torch.Tensor:
-    '''
-    Run encoder attention.
-
-    attn.forward() is passed attn_type=AttentionType.ENCODER in order
-    to configure the kernel invocation for encoder attention
-
-    Requires attn_metadata.num_decode_tokens == 0
-    (There is no encoder execution in the decode-phase)
-
-    Arguments:
-
-    * attn: Attention wrapper instance
-    * encoder_test_params: encoder PhaseTestParameters data structure;
-                           this function relies on the packed
-                           (number_of_tokens x num_heads x head_size)
-                           query/key/value fields
-    * attn_metadata: attention metadata for encoder/decoder-self attention
-    * test_pt: The TestPoint object containing test details like number of
-               model heads, head size, name of the backend being used etc.
-
-    Returns:
-    * Attention.forward() applied to packed {query,key,value} and
-      & attn_metadata
-    '''
-    assert attn_metadata.num_decode_tokens == 0
-    packed_qkv = encoder_test_params.packed_qkvo.packed_qkv
-    assert packed_qkv is not None
-    with set_forward_context(attn_metadata, vllm_config):
-        # In the test setup the shape of the query is
-        # [batch_size, seq_len, num_heads, head_size]. However
-        # the attention backend expect the shape to be
-        # [num_tokens, hidden_size]. Hence reshape the query before
-        # invoking the forward method.
-        # TODO - Update the way we construct the query so that it
-        # is shaped as [num_tokens, hidden_size] and we can skip the reshape.
-        reshaped_query = packed_qkv.query.view(
-            -1, test_pt.num_heads * test_pt.head_size)
-        return attn.forward(reshaped_query, packed_qkv.key, packed_qkv.value)
-
-
-def _run_decoder_self_attention_test(
-    test_rsrcs: TestResources,
-    decoder_test_params: PhaseTestParameters,
-    attn_metadata: AttentionMetadata,
-    test_pt: TestPoint,
-    vllm_config: VllmConfig,
-) -> torch.Tensor:
-    '''
-    Run decoder self-attention test.
-
-    attn.forward() is passed attn_type=AttentionType.DECODER
-    in order to configure the kernel invocation for decoder self-attention.
-
-    Arguments:
-
-    * test_rsrcs: TestResources instance; this function relies on the kv_cache
-                  and attn (Attention wrapper instance) fields
-    * decoder_test_params: decoder PhaseTestParameters data structure;
-                           this function relies on the packed
-                           (number_of_tokens x num_heads x head_size)
-                           query/key/value fields
-    * attn_metadata: attention metadata for decoder-self attention
-                     (contains KV cache memory-mapping)
-    * test_pt: The TestPoint object containing test details like number of
-               model heads, head size, name of the backend being used etc.
-
-    Returns:
-    * Attention.forward() applied to packed_{query,key,value}, kv_cache
-      & attn_metadata
-    '''
-    attn = test_rsrcs.attn
-    packed_qkv = decoder_test_params.packed_qkvo.packed_qkv
-    assert packed_qkv is not None
-    with set_forward_context(attn_metadata, vllm_config):
-        # In the test setup the shape of the query is
-        # [batch_size, seq_len, num_heads, head_size]. However
-        # the attention backend expect the shape to be
-        # [num_tokens, hidden_size]. Hence reshape the query before
-        # invoking the forward method.
-        # TODO - Update the way we construct the query so that it
-        # is shaped as [num_tokens, hidden_size] and we can skip the reshape.
-        reshaped_query = packed_qkv.query.view(
-            -1, test_pt.num_heads * test_pt.head_size)
-        return attn.forward(reshaped_query, packed_qkv.key, packed_qkv.value)
-
-
-def _run_encoder_decoder_cross_attention_test(
-    test_rsrcs: TestResources,
-    decoder_test_params: PhaseTestParameters,
-    cross_test_params: Optional[PhaseTestParameters],
-    attn_metadata: AttentionMetadata,
-    test_pt: TestPoint,
-    vllm_config: VllmConfig,
-) -> torch.Tensor:
-    '''
-    Run encoder/decoder cross-attention test.
-
-    Via PhaseTestParameters data structures, consumes the same query utilized
-    for decoder self-attention, plus a key/value specific to cross-attention.
-
-    if cross_test_params is None or cross_test_params.packed_qkvo.packed_qkv
-    is None, this reflects that in decode-phase cross attention there
-    is no growth in the key and value tensors.
-
-    attn.forward() is passed attn_type=AttentionType.ENCODER_DECODER
-    in order to configure the kernel invocation for encoder/decoder cross-
-    attention.
-
-    Arguments:
-
-    * test_rsrcs: TestResources instance; this function relies on the kv_cache
-                  and attn (Attention wrapper instance) fields
-    * decoder_test_params: decoder PhaseTestParameters data structure;
-                           this function relies on the packed
-                           (number_of_tokens x num_heads x head_size)
-                           query field
-    * cross_test_params: encoder/decoder PhaseTestParameters data structure;
-                         this function relies on the packed
-                         (number_of_tokens x num_heads x head_size)
-                         key/value fields
-    * attn_metadata: attention metadata for encoder/decoder-self attention
-    * test_pt: The TestPoint object containing test details like number of
-               model heads, head size, name of the backend being used etc.
-
-    Returns:
-    * Attention.forward() applied to packed_{query,key,value}, kv_cache
-      & attn_metadata
-    '''
-    assert decoder_test_params.packed_qkvo.packed_qkv is not None
-
-    attn = test_rsrcs.attn
-    if cross_test_params is None:
-        key = None
-        value = None
-    else:
-        cross_pckd_qkv = cross_test_params.packed_qkvo.packed_qkv
-        key = (None if cross_pckd_qkv is None else cross_pckd_qkv.key)
-        value = (None if cross_pckd_qkv is None else cross_pckd_qkv.value)
-    with set_forward_context(attn_metadata, vllm_config):
-        # In the test setup the shape of the query is
-        # [batch_size, seq_len, num_heads, head_size]. However
-        # the attention backend expect the shape to be
-        # [num_tokens, hidden_size]. Hence reshape the query before
-        # invoking the forward method.
-        # TODO - Update the way we construct the query so that it
-        # is shaped as [num_tokens, hidden_size] and we can skip the reshape.
-        reshaped_query = decoder_test_params.packed_qkvo.packed_qkv.query.view(
-            -1, test_pt.num_heads * test_pt.head_size)
-        return attn.forward(reshaped_query, key, value)
-
-
-@pytest.fixture(autouse=True)
-def set_reset_environment(attn_backend):
-    # Set the default torch datatype to bfloat16 to enable
-    # testing of the Flash Attention backend. Also clear the
-    # cached value of the backend.
-    default_dtype = torch.get_default_dtype()
-    if attn_backend.name == 'FLASH_ATTN':
-        torch.set_default_dtype(torch.bfloat16)
-    _cached_get_attn_backend.cache_clear()
-    yield
-    # Reset the torch datatype to what it was before the test
-    # so as not to impact the remaining tests.
-    torch.set_default_dtype(default_dtype)
-
-
-@pytest.mark.skipif(current_platform.is_rocm(),
-                    reason=STR_NOT_IMPL_ENC_DEC_ROCM_HIP)
-@pytest.mark.parametrize("num_heads", NUM_HEADS)
-@pytest.mark.parametrize("head_size", HEAD_SIZES)
-@pytest.mark.parametrize("attn_backend", LIST_ENC_DEC_SUPPORTED_BACKENDS)
-@pytest.mark.parametrize("batch_size", BATCH_SIZES)
-@pytest.mark.parametrize("block_size", BLOCK_SIZES)
-@pytest.mark.parametrize("max_dec_seq_len", MAX_DEC_SEQ_LENS)
-@pytest.mark.parametrize("max_enc_seq_len", MAX_ENC_SEQ_LENS)
-def test_encoder_only(
-    num_heads: int,
-    head_size: int,
-    attn_backend: _Backend,
-    batch_size: int,
-    block_size: int,
-    max_dec_seq_len: int,
-    max_enc_seq_len: int,
-):
-    '''
-    End-to-end encoder-only attention test:
-
-    * Construct fake test vectors for (1) encoder attention
-    * Construct (1) attention metadata structure with prefill-phase
-      encoder attention, and (2) an analogous attention metadata
-      structure but for decode-phase
-    * Test & validate encoder attention against ideal output
-
-    No KV cache is required for encoder-only attention.
-
-    Note on ROCm/HIP: currently encoder/decoder models are not supported on
-    AMD GPUs, therefore this test simply is skipped if
-    current_platform.is_rocm().
-
-    This test globally forces an override of the usual backend
-    auto-selection process, forcing the specific backend-under-test
-    to be utilized.
-
-    Arguments:
-
-    * num_heads
-    * head_size,
-    * attn_backend: The attention backend to employ for testing
-    * batch_size
-    * block_size: KV cache block size
-    * max_dec_seq_len: max length of decoder input sequences
-    * max_enc_seq_len: max length of encoder input sequences
-    '''
-    # Force Attention wrapper backend
-    with global_force_attn_backend_context_manager(attn_backend):
-        # Note: KV cache size of 4096 is arbitrary & chosen intentionally
-        # to be more than necessary, since exceeding the kv cache size
-        # is not part of this test
-        test_pt = TestPoint(num_heads, head_size, attn_backend.name,
-                            batch_size, block_size, max_dec_seq_len,
-                            max_enc_seq_len, 4096, AttentionType.ENCODER)
-
-        # Attention scale factor, attention backend instance, attention wrapper
-        # instance, KV cache init
-        vllm_config = VllmConfig()
-        with set_current_vllm_config(vllm_config):
-            test_rsrcs = _make_test_resources(test_pt)
-
-        # Construct encoder attention test params (only used
-        # during prefill)
-
-        enc_test_params = _encoder_attn_setup(test_pt, test_rsrcs)
-
-        # Shared prefill metadata structure
-
-        prephase_attn_metadata: AttentionMetadata = make_test_metadata(
-            attn_backend,
-            True,
-            None,
-            decoder_test_params=None,
-            encoder_test_params=enc_test_params,
-            cross_test_params=None,
-            device=CUDA_DEVICE)
-
-        # PREFILL: encoder attention
-
-        enc_pckd_act_out: torch.Tensor = (_run_encoder_attention_test(
-            test_rsrcs.attn,
-            enc_test_params,
-            prephase_attn_metadata,
-            test_pt=test_pt,
-            vllm_config=vllm_config))
-
-        # - Is encoder attention result correct?
-        assert_actual_matches_ideal(enc_test_params, enc_pckd_act_out,
-                                    attn_backend.name)
-
-
-@pytest.mark.skipif(current_platform.is_rocm(),
-                    reason=STR_NOT_IMPL_ENC_DEC_ROCM_HIP)
-@pytest.mark.parametrize("num_heads", NUM_HEADS)
-@pytest.mark.parametrize("head_size", HEAD_SIZES)
-@pytest.mark.parametrize("attn_backend", LIST_ENC_DEC_SUPPORTED_BACKENDS)
-@pytest.mark.parametrize("batch_size", BATCH_SIZES)
-@pytest.mark.parametrize("block_size", BLOCK_SIZES)
-@pytest.mark.parametrize("max_dec_seq_len", MAX_DEC_SEQ_LENS)
-@pytest.mark.parametrize("max_enc_seq_len", MAX_ENC_SEQ_LENS)
-def test_e2e_enc_dec_attn(
-    num_heads: int,
-    head_size: int,
-    attn_backend: _Backend,
-    batch_size: int,
-    block_size: int,
-    max_dec_seq_len: int,
-    max_enc_seq_len: int,
-) -> None:
-    '''
-    End-to-end encoder/decoder test:
-
-    * Construct fake test vectors for (1) encoder attention,
-      (2) decoder self-attention, and (3) encoder/decoder cross-attention
-    * Construct (1) attention metadata structure with self- and cross-attention
-      attributes for prefill-phase, and (2) an analogous attention metadata
-      structure but for decode-phase
-    * Test attention steps in the following order
-
-        * Encoder attention
-        * Prefill self-attention
-        * Prefill cross-attention
-        * Decode self-attention
-        * Decode cross-attention
-        * Besides being reflective of realistic use-cases, this order would
-          exacerbate any accidental overlap in the self-/cross-attention
-          block tables, which one hopes to avoid
-
-
-    * Validate output correctness against ideal reference attention
-      implementation
-
-    Block tables are constructed such that cross-attention KV cache is in a
-    higher, non-intersecting address-space than self-attention KV cache.
-
-    Self- and cross-attention share the same query tensor but not the K/V
-    tensors. Self-attention K/Vs must have the same seq len as Q while
-    cross-attention K/Vs are allowed to differ in seq len, as is often the case
-    for cross-attention.
-
-    This test globally forces an override of the usual backend
-    auto-selection process, forcing the specific backend-under-test
-    to be utilized.
-
-    Note on ROCm/HIP: currently encoder/decoder models are not supported on
-    AMD GPUs, therefore this test simply is skipped if
-    current_platform.is_rocm().
-
-    Note on metadata: there is a single attention metadata structure shared by
-    all prefill-phase attention operations (encoder, decoder, enc/dec cross),
-    and a single one shared by all decode-phase attention operations
-    (decoder & enc/dec cross.) This is intended to reflect the behavior
-    of EncoderDecoderModelRunner, which constructs a single attention metadata
-    structure for each prefill or decode run. A realistic scenario would rely
-    on the attention backend to utilize the appropriate attention metadata
-    fields according to the value of attn_metadata.attention_type. Thus,
-    this test is organized so as to confirm that the backend-under-test can
-    handle a shared prefill attention metadata structure & a shared decode\
-    attention metadata structure.
-
-    Arguments:
-
-    * num_heads
-    * head_size,
-    * attn_backend: The attention backend to employ for testing
-    * batch_size
-    * block_size: KV cache block size
-    * max_dec_seq_len: max length of decoder input sequences
-    * max_enc_seq_len: max length of encoder input sequences
-    '''
-    # Force Attention wrapper backend
-    with global_force_attn_backend_context_manager(attn_backend):
-        # Note: KV cache size of 4096 is arbitrary & chosen intentionally
-        # to be more than necessary, since exceeding the kv cache size
-        # is not part of this test
-        enc_test_pt = TestPoint(num_heads, head_size, attn_backend.name,
-                                batch_size, block_size, max_dec_seq_len,
-                                max_enc_seq_len, 4096, AttentionType.ENCODER)
-        enc_dec_test_pt = TestPoint(num_heads, head_size, attn_backend.name,
-                                    batch_size, block_size, max_dec_seq_len,
-                                    max_enc_seq_len, 4096,
-                                    AttentionType.ENCODER_DECODER)
-        dec_test_pt = TestPoint(num_heads, head_size, attn_backend.name,
-                                batch_size, block_size, max_dec_seq_len,
-                                max_enc_seq_len, 4096, AttentionType.DECODER)
-
-        # Attention scale factor, attention backend instance, attention wrapper
-        # instance, KV cache init
-        vllm_config = VllmConfig()
-        with set_current_vllm_config(vllm_config):
-            enc_test_rsrcs = _make_test_resources(enc_test_pt)
-            enc_dec_test_rsrcs = _make_test_resources(enc_dec_test_pt)
-            dec_test_rsrcs = _make_test_resources(dec_test_pt)
-
-        # Construct encoder attention test params (only used
-        # during prefill)
-
-        enc_test_params = _encoder_attn_setup(enc_test_pt, enc_test_rsrcs)
-
-        # Construct Decoder self-attention prefill-phase & decode-phase
-        # test params, including query/key/value tensors, decoder self-attention
-        # memory-mapping. cross_block_base_addr is the uppermost address in the
-        # decoder self-attention block-table, i.e. a base address which the
-        # encoder/decoder cross-attention block-table may build downward toward.
-
-        (
-            dec_qkv,
-            prephase_dec_test_params,
-            decphase_dec_test_params,
-            cross_block_base_addr,
-        ) = _decoder_attn_setup(dec_test_pt, dec_test_rsrcs)
-
-        # Construct encoder/decoder cross-attention prefill-phase
-        # & decode-phase test params, including key/value tensors,
-        # cross-attention memory-mapping
-
-        (
-            prephase_cross_test_params,
-            decphase_cross_test_params,
-        ) = _enc_dec_cross_attn_setup_reuses_query(
-            dec_qkv,
-            enc_test_params,
-            prephase_dec_test_params,
-            enc_dec_test_pt,
-            enc_dec_test_rsrcs,
-            block_base_addr=cross_block_base_addr)
-
-        # Shared prefill metadata structure
-        assert prephase_dec_test_params.packed_qkvo.packed_qkv is not None
-        prephase_attn_metadata: AttentionMetadata = make_test_metadata(
-            attn_backend,
-            True,
-            prephase_dec_test_params.packed_qkvo.packed_qkv.q_seq_lens,
-            decoder_test_params=prephase_dec_test_params,
-            encoder_test_params=enc_test_params,
-            cross_test_params=prephase_cross_test_params,
-            device=CUDA_DEVICE)
-
-        # PREFILL: encoder attention
-
-        enc_pckd_act_out = _run_encoder_attention_test(enc_test_rsrcs.attn,
-                                                       enc_test_params,
-                                                       prephase_attn_metadata,
-                                                       test_pt=enc_test_pt,
-                                                       vllm_config=vllm_config)
-
-        # - Is encoder attention result correct?
-        assert_actual_matches_ideal(enc_test_params, enc_pckd_act_out,
-                                    attn_backend.name)
-
-        # PREFILL: decoder self-attention test
-
-        prephase_dec_pckd_act_out = _run_decoder_self_attention_test(
-            dec_test_rsrcs,
-            prephase_dec_test_params,
-            prephase_attn_metadata,
-            test_pt=dec_test_pt,
-            vllm_config=vllm_config)
-
-        # - Is prefill decoder self-attention correct?
-        assert_actual_matches_ideal(prephase_dec_test_params,
-                                    prephase_dec_pckd_act_out,
-                                    attn_backend.name)
-
-        # PREFILL: encoder/decoder cross-attention test
-
-        prephase_cross_pckd_act_out = _run_encoder_decoder_cross_attention_test(
-            enc_dec_test_rsrcs,
-            prephase_dec_test_params,
-            prephase_cross_test_params,
-            prephase_attn_metadata,
-            test_pt=enc_dec_test_pt,
-            vllm_config=vllm_config)
-
-        # - Is prefill encoder/decoder cross-attention correct?
-        assert_actual_matches_ideal(prephase_cross_test_params,
-                                    prephase_cross_pckd_act_out,
-                                    attn_backend.name)
-
-        # DECODE: build decode-phase attention metadata
-
-        decphase_attn_metadata: AttentionMetadata = make_test_metadata(
-            attn_backend,
-            False,
-            dec_qkv.q_seq_lens,
-            decoder_test_params=decphase_dec_test_params,
-            encoder_test_params=enc_test_params,
-            cross_test_params=decphase_cross_test_params,
-            device=CUDA_DEVICE)
-
-        # DECODE: decoder self-attention test
-
-        decphase_dec_pckd_act_out = _run_decoder_self_attention_test(
-            dec_test_rsrcs,
-            decphase_dec_test_params,
-            decphase_attn_metadata,
-            test_pt=dec_test_pt,
-            vllm_config=vllm_config)
-
-        # - Is decode-phase decoder self-attention correct?
-        assert_actual_matches_ideal(decphase_dec_test_params,
-                                    decphase_dec_pckd_act_out,
-                                    attn_backend.name)
-
-        # DECODE: encoder/decoder cross-attention test
-
-        decphase_cross_pckd_act_out = _run_encoder_decoder_cross_attention_test(
-            enc_dec_test_rsrcs,
-            decphase_dec_test_params,
-            None,
-            decphase_attn_metadata,
-            test_pt=enc_dec_test_pt,
-            vllm_config=vllm_config)
-
-        # - Is decode-phase encoder/decoder cross-attention correct?
-        assert_actual_matches_ideal(decphase_cross_test_params,
-                                    decphase_cross_pckd_act_out,
-                                    attn_backend.name)
--- a/tests/models/language/generation/test_bart.py
+++ b/tests/models/language/generation/test_bart.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-from typing import Optional
-
-import pytest
-from transformers import AutoModelForSeq2SeqLM
-
-from vllm.sequence import SampleLogprobs
-
-from ....conftest import (DecoderPromptType, ExplicitEncoderDecoderPrompt,
-                          HfRunner, VllmRunner)
-from ....utils import multi_gpu_test
-from ...utils import check_logprobs_close
-
-
-def vllm_to_hf_output(
-    vllm_output: tuple[list[int], str, Optional[SampleLogprobs]],
-    decoder_prompt_type: DecoderPromptType,
-):
-    """Sanitize vllm output to be comparable with hf output."""
-    output_ids, output_str, out_logprobs = vllm_output
-
-    hf_output_str = output_str + "</s>"
-    if decoder_prompt_type == DecoderPromptType.NONE:
-        hf_output_str = "<s>" + hf_output_str
-
-    return output_ids, hf_output_str, out_logprobs
-
-
-def run_test(
-    hf_runner: type[HfRunner],
-    vllm_runner: type[VllmRunner],
-    prompts: list[ExplicitEncoderDecoderPrompt[str, str]],
-    decoder_prompt_type: DecoderPromptType,
-    model: str,
-    *,
-    dtype: str,
-    max_tokens: int,
-    num_logprobs: int,
-    tensor_parallel_size: int,
-    distributed_executor_backend: Optional[str] = None,
-) -> None:
-    '''
-    Test the vLLM BART model for a variety of encoder/decoder input prompts,
-    by validating it against HuggingFace (HF) BART.
-
-    Arguments:
-
-    * hf_runner: HuggingFace (HF) test model runner
-    * vllm_runner: vLLM test model runner
-    * example_encoder_decoder_prompts: test fixture which provides a 
-                                       dictionary of dummy prompts
-    * model: the HF ID of the specific BART variant under test
-    * dtype: the tensor datatype to employ
-    * max_tokens
-    * num_logprobs
-    * decoder_prompt_type: key into the example_encoder_decoder_prompts
-                           dictionary; selects specific encoder/decoder
-                           prompt scenarios to test
-
-    A note on using HF BART as a baseline for validating vLLM BART,
-    specifically when the decoder prompt is None. 
-    
-    The HF GenerationMixin's default behavior is to force the first
-    decoded token to be <BOS> if the prompt does not already contain
-    <BOS> (this is accomplished using a logit
-    processor setting.)
-    
-    So when we use HF BART as our baseline for comparison, note that
-    when the user provides a request with a None decoder prompt
-    (i.e. a singleton encoder prompt, or else an explicit encoder/
-    decoder prompt with the decoder sub-prompt set to None), HF and
-    vLLM handle this in different ways:
-    
-    * HF will (1) tokenize the None prompt as an empty token-list, 
-      (2) append <decoder-start-token> to the beginning, yielding
-      [<decoder-start-token>], (3) pass this token list to the model, and
-      then (4) after computing logits during prefill, override the model
-      logits & force <BOS> to be the first generated token.
-    
-    * vLLM will (1) tokenize the None prompt as [<BOS>], (2) append decoder-
-      start-token to the beginning, yielding [<decoder-start-token><BOS>],
-      (3) pass these tokens to the model & proceed with generation.
-    
-    The net effect is that compared to vLLM, the list of HF *decoded* tokens
-    will contain one more initial <BOS> than the vLLM generated tokens,
-    because vLLM's <BOS> token is injected into the prompt rather than into
-    the generated output. This is in spite of the fact that overall, the
-    complete sequences (prompt + decoded tokens) produced by vLLM will match
-    HF.
-    
-    So when we use HF decoded token output to validate vLLM's decoded token
-    output, the testing process must account for the difference in decoded
-    token sequences between vLLM and HF specifically in the
-    decoder-prompt-is-None case. 
-    
-    One option is to disable the logit processor feature that forces the
-    <BOS> token to be decoded (forced_bos_token_id = None), eliminating
-    the problem entirely. However this is not "normal" BART usage.
-    
-    The other option is - only in the decoder-prompt-is-None case - to
-    discard the first decoded token from the HF output before comparing it
-    to vLLM.
-
-    To that end, when testing the scenario where the decoder prompt is None
-    (and only in that one scenario), this test skips the first HF decoded
-    token during the process of validating the vLLM decoded output.
-    '''
-
-    # NOTE: take care of the order. run vLLM first, and then run HF.
-    # vLLM needs a fresh new process without cuda initialization.
-    # if we run HF first, the cuda initialization will be done and it
-    # will hurt multiprocessing backend with fork method (the default).
-
-    # Note: currently encoder/decoder models are only compatible with
-    # enforce_eager=True. Normally this is not a problem because
-    # for encoder/decoder models vLLM will
-    # default to enforce_eager=True if enforce_eager
-    # is left unspecified. However, the
-    # VllmRunner test fixture (which wraps around the LLM class) defaults to
-    # enforce_eager=False (a behavior which a number of already-existing
-    # decoder-only unit tests expect), so when testing an encoder/decoder
-    # model we must explicitly specify enforce_eager=True in the VllmRunner
-    # constructor.
-    with vllm_runner(model,
-                     dtype=dtype,
-                     tensor_parallel_size=tensor_parallel_size,
-                     distributed_executor_backend=distributed_executor_backend,
-                     enforce_eager=True) as vllm_model:
-        vllm_outputs = vllm_model.generate_encoder_decoder_greedy_logprobs(
-            prompts, max_tokens, num_logprobs)
-
-    # Configuration settings for HF baseline
-    hf_kwargs = {
-        "top_k": None,
-        "num_beams": 1,
-        "repetition_penalty": 1.0,
-        "top_p": 1.0,
-        "length_penalty": 1.0,
-        "early_stopping": False,
-        "no_repeat_ngram_size": None,
-        "min_length": 0
-    }
-
-    with hf_runner(model, dtype=dtype,
-                   auto_cls=AutoModelForSeq2SeqLM) as hf_model:
-        hf_outputs = (hf_model.generate_encoder_decoder_greedy_logprobs_limit(
-            prompts,
-            max_tokens,
-            num_logprobs,
-            **hf_kwargs,
-        ))
-
-    hf_skip_tokens = (1
-                      if decoder_prompt_type == DecoderPromptType.NONE else 0)
-
-    check_logprobs_close(
-        outputs_0_lst=hf_outputs,
-        outputs_1_lst=[
-            vllm_to_hf_output(vllm_output, decoder_prompt_type)
-            for vllm_output in vllm_outputs
-        ],
-        name_0="hf",
-        name_1="vllm",
-        num_outputs_0_skip_tokens=hf_skip_tokens,
-    )
-
-
-@pytest.mark.parametrize(
-    "model",
-    [
-        pytest.param("facebook/bart-base",
-                     marks=[pytest.mark.core_model, pytest.mark.cpu_model]),
-        pytest.param("facebook/bart-large-cnn"),
-    ],
-)
-@pytest.mark.parametrize("dtype", ["float", "bfloat16"])
-@pytest.mark.parametrize("max_tokens", [64])
-@pytest.mark.parametrize("num_logprobs", [5])
-@pytest.mark.parametrize("decoder_prompt_type", list(DecoderPromptType))
-@pytest.mark.skip(reason="bart not supported in V1")
-def test_models(hf_runner, vllm_runner, example_encoder_decoder_prompts, model,
-                dtype, max_tokens, num_logprobs, decoder_prompt_type) -> None:
-
-    run_test(
-        hf_runner,
-        vllm_runner,
-        example_encoder_decoder_prompts[decoder_prompt_type],
-        decoder_prompt_type,
-        model,
-        dtype=dtype,
-        max_tokens=max_tokens,
-        num_logprobs=num_logprobs,
-        tensor_parallel_size=1,
-    )
-
-
-@multi_gpu_test(num_gpus=2)
-@pytest.mark.parametrize("distributed_executor_backend", ["ray", "mp"])
-@pytest.mark.parametrize("model", ["facebook/bart-large-cnn"])
-@pytest.mark.parametrize("dtype", ["float"])
-@pytest.mark.parametrize("max_tokens", [64])
-@pytest.mark.parametrize("num_logprobs", [5])
-@pytest.mark.parametrize("decoder_prompt_type", [DecoderPromptType.CUSTOM])
-@pytest.mark.skip(reason="bart not supported in V1")
-def test_models_distributed(hf_runner, vllm_runner,
-                            example_encoder_decoder_prompts,
-                            distributed_executor_backend, model, dtype,
-                            max_tokens, num_logprobs,
-                            decoder_prompt_type) -> None:
-    run_test(
-        hf_runner,
-        vllm_runner,
-        example_encoder_decoder_prompts[decoder_prompt_type],
-        decoder_prompt_type,
-        model,
-        dtype=dtype,
-        max_tokens=max_tokens,
-        num_logprobs=num_logprobs,
-        tensor_parallel_size=2,
-        distributed_executor_backend=distributed_executor_backend,
-    )
--- a/tests/models/language/generation/test_mbart.py
+++ b/tests/models/language/generation/test_mbart.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-from typing import Optional
-
-import pytest
-from transformers import AutoModelForSeq2SeqLM
-
-from vllm.sequence import SampleLogprobs
-
-from ....conftest import DecoderPromptType, HfRunner, VllmRunner
-from ...utils import check_logprobs_close
-
-
-def vllm_to_hf_output(
-    vllm_output: tuple[list[int], str, Optional[SampleLogprobs]],
-    decoder_prompt_type: DecoderPromptType,
-):
-    """Sanitize vllm output to be comparable with hf output."""
-    output_ids, output_str, out_logprobs = vllm_output
-    hf_output_str = output_str + "</s>"
-    return output_ids, hf_output_str, out_logprobs
-
-
-def run_test(
-    hf_runner: type[HfRunner],
-    vllm_runner: type[VllmRunner],
-    prompts: list[dict[str, str]],
-    decoder_prompt_type: DecoderPromptType,
-    model: str,
-    *,
-    dtype: str,
-    max_tokens: int,
-    num_logprobs: int,
-    tensor_parallel_size: int,
-    distributed_executor_backend: Optional[str] = None,
-) -> None:
-    '''
-    Test the vLLM mBART model by validating it against HuggingFace (HF).
-    (Docstring content is omitted for brevity)
-    '''
-
-    vllm_prompts = prompts
-    if decoder_prompt_type == DecoderPromptType.NONE:
-        vllm_prompts = [{
-            "encoder_prompt": p['encoder_prompt'],
-            "decoder_prompt": ""
-        } for p in prompts]
-
-    vllm_kwargs = {
-        "hf_overrides": {
-            "architectures": ["MBartForConditionalGeneration"]
-        }
-    }
-
-    with vllm_runner(model,
-                     dtype=dtype,
-                     tensor_parallel_size=tensor_parallel_size,
-                     distributed_executor_backend=distributed_executor_backend,
-                     enforce_eager=True,
-                     **vllm_kwargs) as vllm_model:  # type: ignore
-        vllm_outputs = vllm_model.generate_encoder_decoder_greedy_logprobs(
-            vllm_prompts, max_tokens, num_logprobs)
-
-    hf_kwargs = {
-        "top_k": None,
-        "num_beams": 1,
-        "repetition_penalty": 1.0,
-        "top_p": 1.0,
-        "length_penalty": 1.0,
-        "early_stopping": False,
-        "no_repeat_ngram_size": None,
-        "min_length": 0
-    }
-
-    with hf_runner(model, dtype=dtype,
-                   auto_cls=AutoModelForSeq2SeqLM) as hf_model:
-        hf_kwargs["decoder_start_token_id"] = (
-            hf_model.tokenizer.lang_code_to_id["ro_RO"])
-
-        hf_outputs = (
-            hf_model.generate_encoder_decoder_greedy_logprobs_limit(
-                prompts,  # HF runner still uses the original prompts
-                max_tokens,
-                num_logprobs,
-                **hf_kwargs,
-            ))
-
-    hf_skip_tokens = 0
-
-    check_logprobs_close(
-        outputs_0_lst=hf_outputs,
-        outputs_1_lst=[
-            vllm_to_hf_output(vllm_output, decoder_prompt_type)
-            for vllm_output in vllm_outputs
-        ],
-        name_0="hf",
-        name_1="vllm",
-        num_outputs_0_skip_tokens=hf_skip_tokens,
-    )
-
-
-@pytest.mark.parametrize(
-    "model",
-    [pytest.param("facebook/mbart-large-en-ro")],
-)
-@pytest.mark.parametrize("dtype", ["float", "bfloat16"])
-@pytest.mark.parametrize("max_tokens", [64])
-@pytest.mark.parametrize("num_logprobs", [5])
-@pytest.mark.parametrize("decoder_prompt_type", list(DecoderPromptType))
-def test_models(hf_runner, vllm_runner, example_encoder_decoder_prompts, model,
-                dtype, max_tokens, num_logprobs, decoder_prompt_type) -> None:
-
-    run_test(
-        hf_runner,
-        vllm_runner,
-        example_encoder_decoder_prompts[decoder_prompt_type],
-        decoder_prompt_type,
-        model,
-        dtype=dtype,
-        max_tokens=max_tokens,
-        num_logprobs=num_logprobs,
-        tensor_parallel_size=1,
-    )