Remove V0 Encoder-Decoder Support (#24907)

Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>

Remove V0 Encoder-Decoder Support (#24907)
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
759ef49b · Woosuk Kwon · GitHub · 5206ab20 · 759ef49b · 759ef49b
Unverified Commit 759ef49b authored Sep 15, 2025 by Woosuk Kwon Committed by GitHub Sep 15, 2025
20 changed files
--- a/.buildkite/scripts/hardware_ci/run-cpu-test.sh
+++ b/.buildkite/scripts/hardware_ci/run-cpu-test.sh
@@ -66,7 +66,6 @@ function cpu_tests() {
    pytest -x -v -s tests/models/language/pooling -m cpu_model
    pytest -x -v -s tests/models/multimodal/generation \
-                --ignore=tests/models/multimodal/generation/test_mllama.py \
                --ignore=tests/models/multimodal/generation/test_pixtral.py \
                -m cpu_model"

--- a/.buildkite/test-pipeline.yaml
+++ b/.buildkite/test-pipeline.yaml
@@ -549,15 +549,6 @@ steps:
  commands: # LMEval+Transcription WER check
  - pytest -s entrypoints/openai/correctness/
- label: Encoder Decoder tests # 12min
-  timeout_in_minutes: 20
-  mirror_hardwares: [amdexperimental]
-  source_file_dependencies:
-  - vllm/
-  - tests/encoder_decoder
-  commands:
-    - pytest -v -s encoder_decoder
 - label: OpenAI-Compatible Tool Use # 23 min
  timeout_in_minutes: 35
  mirror_hardwares: [amdexperimental]

--- a/docs/contributing/model/multimodal.md
+++ b/docs/contributing/model/multimodal.md
@@ -840,7 +840,6 @@ Some HF processors directly insert feature tokens without replacing anything in
 Examples:
 - BLIP-2 (insert at start of prompt): <gh-file:vllm/model_executor/models/blip2.py>
- Florence2 (insert at start of prompt): <gh-file:vllm/model_executor/models/florence2.py>
 - Molmo (insert after `<|endoftext|>` token): <gh-file:vllm/model_executor/models/molmo.py>
 ### Handling prompt updates unrelated to multi-modal data

--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
@@ -331,8 +331,6 @@ th {
 | `BailingMoeV2ForCausalLM` | Ling | `inclusionAI/Ling-mini-2.0`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | ✅︎ |
 | `BloomForCausalLM` | BLOOM, BLOOMZ, BLOOMChat | `bigscience/bloom`, `bigscience/bloomz`, etc. | | ✅︎ | ✅︎ |
-| `BartForConditionalGeneration` | BART | `facebook/bart-base`, `facebook/bart-large-cnn`, etc. | | | |
-| `MBartForConditionalGeneration` | mBART | `facebook/mbart-large-en-ro`, `facebook/mbart-large-50`, etc. | | | |
 | `ChatGLMModel`, `ChatGLMForConditionalGeneration` | ChatGLM | `zai-org/chatglm2-6b`, `zai-org/chatglm3-6b`, `ShieldLM-6B-chatglm3`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `CohereForCausalLM`, `Cohere2ForCausalLM` | Command-R, Command-A | `CohereLabs/c4ai-command-r-v01`, `CohereLabs/c4ai-command-r7b-12-2024`, `CohereLabs/c4ai-command-a-03-2025`, `CohereLabs/command-a-reasoning-08-2025`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `DbrxForCausalLM` | DBRX | `databricks/dbrx-base`, `databricks/dbrx-instruct`, etc. | | ✅︎ | ✅︎ |
@@ -426,9 +424,6 @@ Some models are supported only via the [Transformers backend](#transformers). Th
 !!! note
    Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
-!!! note
-    Some mBART models' config files do not have an `architecture` defined. Therefore, you need to use `--hf-overrides '{"architectures": ["MBartForConditionalGeneration"]}'` to explicitly specify the use of the `MBartForConditionalGeneration` architecture.
 ### Pooling Models
 See [this page](./pooling_models.md) for more information on how to use pooling models.
@@ -625,9 +620,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
 | `ChameleonForConditionalGeneration` | Chameleon | T + I | `facebook/chameleon-7b`, etc. | | ✅︎ | ✅︎ |
 | `Cohere2VisionForConditionalGeneration` | Command A Vision | T + I<sup>+</sup> | `CohereLabs/command-a-vision-07-2025`, etc. | | ✅︎ | ✅︎ |
 | `DeepseekVLV2ForCausalLM`<sup>^</sup> | DeepSeek-VL2 | T + I<sup>+</sup> | `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2`, etc. | | ✅︎ | ✅︎ |
-| `DonutForConditionalGeneration`<sup>^</sup> | Donut | T + I | `ByteDance/Dolphin`, `naver-clova-ix/donut-base-finetuned-docvqa`, etc. | | | |
 | `Ernie4_5_VLMoeForConditionalGeneration` | Ernie4.5-VL | T + I<sup>+</sup>/ V<sup>+</sup> | `baidu/ERNIE-4.5-VL-28B-A3B-PT`, `baidu/ERNIE-4.5-VL-424B-A47B-PT` | | ✅︎ | ✅︎ |
-| `Florence2ForConditionalGeneration` | Florence-2 | T + I | `microsoft/Florence-2-base`, `microsoft/Florence-2-large`, etc. | | | |
 | `FuyuForCausalLM` | Fuyu | T + I | `adept/fuyu-8b`, etc. | | ✅︎ | ✅︎ |
 | `Gemma3ForConditionalGeneration` | Gemma 3 | T + I<sup>+</sup> | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ | ⚠️ |
 | `Gemma3nForConditionalGeneration` | Gemma 3n | T + I + A | `google/gemma-3n-E2B-it`, `google/gemma-3n-E4B-it`, etc. | | | ✅︎ |
@@ -654,7 +647,6 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
 | `MiniCPMV` | MiniCPM-V | T + I<sup>E+</sup> + V<sup>E+</sup> | `openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, `openbmb/MiniCPM-V-4`, `openbmb/MiniCPM-V-4_5`, etc. | ✅︎ | | ✅︎ |
 | `MiniMaxVL01ForConditionalGeneration` | MiniMax-VL | T + I<sup>E+</sup> | `MiniMaxAI/MiniMax-VL-01`, etc. | | ✅︎ | ✅︎ |
 | `Mistral3ForConditionalGeneration` | Mistral3 (HF Transformers) | T + I<sup>+</sup> | `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, etc. | ✅︎ | ✅︎ | ✅︎ |
-| `MllamaForConditionalGeneration` | Llama 3.2 | T + I<sup>+</sup> | `meta-llama/Llama-3.2-90B-Vision-Instruct`, `meta-llama/Llama-3.2-11B-Vision`, etc. | | | |
 | `MolmoForCausalLM` | Molmo | T + I<sup>+</sup> | `allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `NVLM_D_Model` | NVLM-D 1.0 | T + I<sup>+</sup> | `nvidia/NVLM-D-72B`, etc. | | ✅︎ | ✅︎ |
 | `Ovis` | Ovis2, Ovis1.6 | T + I<sup>+</sup> | `AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, etc. | | ✅︎ | ✅︎ |

--- a/docs/usage/v1_guide.md
+++ b/docs/usage/v1_guide.md
@@ -120,7 +120,7 @@ Please note that prefix caching is not yet supported for any of the above models
 Whisper is supported. Other models requiring cross-attention between separate
 encoder and decoder (e.g., `BartForConditionalGeneration`,
-`MllamaForConditionalGeneration`) are not yet supported.
+`MllamaForConditionalGeneration`) are not supported.
 ### Features

--- a/examples/offline_inference/dolphin.py
+++ b/examples/offline_inference/dolphin.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-import argparse
-import copy
-import os
-from dataclasses import dataclass
-import cv2
-import numpy as np
-import regex as re
-from PIL import Image
-from transformers import DonutProcessor
-from vllm import LLM, SamplingParams
-from vllm.inputs import ExplicitEncoderDecoderPrompt, TextPrompt, TokensPrompt
-from vllm.multimodal.utils import fetch_image
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-@dataclass
-class ImageDimensions:
-    original_w: int
-    original_h: int
-    padded_w: int
-    padded_h: int
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-def map_to_original_coordinates(
-    x1, y1, x2, y2, dims: ImageDimensions
-) -> tuple[int, int, int, int]:
-    try:
-        top = (dims.padded_h - dims.original_h) // 2
-        left = (dims.padded_w - dims.original_w) // 2
-        orig_x1 = max(0, x1 - left)
-        orig_y1 = max(0, y1 - top)
-        orig_x2 = min(dims.original_w, x2 - left)
-        orig_y2 = min(dims.original_h, y2 - top)
-        if orig_x2 <= orig_x1:
-            orig_x2 = min(orig_x1 + 1, dims.original_w)
-        if orig_y2 <= orig_y1:
-            orig_y2 = min(orig_y1 + 1, dims.original_h)
-        return int(orig_x1), int(orig_y1), int(orig_x2), int(orig_y2)
-    except Exception as e:
-        print(f"map_to_original_coordinates error: {str(e)}")
-        return 0, 0, min(100, dims.original_w), min(100, dims.original_h)
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-def adjust_box_edges(image, boxes: list[list[float]], max_pixels=15, threshold=0.2):
-    if isinstance(image, str):
-        image = cv2.imread(image)
-    img_h, img_w = image.shape[:2]
-    new_boxes = []
-    for box in boxes:
-        best_box = copy.deepcopy(box)
-        def check_edge(img, current_box, i, is_vertical):
-            edge = current_box[i]
-            gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
-            _, binary = cv2.threshold(
-                gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU
-            )
-            if is_vertical:
-                line = binary[current_box[1] : current_box[3] + 1, edge]
-            else:
-                line = binary[edge, current_box[0] : current_box[2] + 1]
-            transitions = np.abs(np.diff(line))
-            return np.sum(transitions) / len(transitions)
-        edges = [(0, -1, True), (2, 1, True), (1, -1, False), (3, 1, False)]
-        current_box = copy.deepcopy(box)
-        current_box[0] = min(max(current_box[0], 0), img_w - 1)
-        current_box[1] = min(max(current_box[1], 0), img_h - 1)
-        current_box[2] = min(max(current_box[2], 0), img_w - 1)
-        current_box[3] = min(max(current_box[3], 0), img_h - 1)
-        for i, direction, is_vertical in edges:
-            best_score = check_edge(image, current_box, i, is_vertical)
-            if best_score <= threshold:
-                continue
-            for step in range(max_pixels):
-                current_box[i] += direction
-                if i == 0 or i == 2:
-                    current_box[i] = min(max(current_box[i], 0), img_w - 1)
-                else:
-                    current_box[i] = min(max(current_box[i], 0), img_h - 1)
-                score = check_edge(image, current_box, i, is_vertical)
-                if score < best_score:
-                    best_score = score
-                    best_box = copy.deepcopy(current_box)
-                if score <= threshold:
-                    break
-        new_boxes.append(best_box)
-    return new_boxes
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-def process_coordinates(coords, padded_image, dims: ImageDimensions, previous_box=None):
-    try:
-        x1, y1 = int(coords[0] * dims.padded_w), int(coords[1] * dims.padded_h)
-        x2, y2 = int(coords[2] * dims.padded_w), int(coords[3] * dims.padded_h)
-        x1, y1, x2, y2 = (
-            max(0, min(x1, dims.padded_w - 1)),
-            max(0, min(y1, dims.padded_h - 1)),
-            max(0, min(x2, dims.padded_w)),
-            max(0, min(y2, dims.padded_h)),
-        )
-        if x2 <= x1:
-            x2 = min(x1 + 1, dims.padded_w)
-        if y2 <= y1:
-            y2 = min(y1 + 1, dims.padded_h)
-        new_boxes = adjust_box_edges(padded_image, [[x1, y1, x2, y2]])
-        x1, y1, x2, y2 = new_boxes[0]
-        x1, y1, x2, y2 = (
-            max(0, min(x1, dims.padded_w - 1)),
-            max(0, min(y1, dims.padded_h - 1)),
-            max(0, min(x2, dims.padded_w)),
-            max(0, min(y2, dims.padded_h)),
-        )
-        if x2 <= x1:
-            x2 = min(x1 + 1, dims.padded_w)
-        if y2 <= y1:
-            y2 = min(y1 + 1, dims.padded_h)
-        if previous_box is not None:
-            prev_x1, prev_y1, prev_x2, prev_y2 = previous_box
-            if (x1 < prev_x2 and x2 > prev_x1) and (y1 < prev_y2 and y2 > prev_y1):
-                y1 = prev_y2
-                y1 = min(y1, dims.padded_h - 1)
-                if y2 <= y1:
-                    y2 = min(y1 + 1, dims.padded_h)
-        new_previous_box = [x1, y1, x2, y2]
-        orig_x1, orig_y1, orig_x2, orig_y2 = map_to_original_coordinates(
-            x1, y1, x2, y2, dims
-        )
-        return x1, y1, x2, y2, orig_x1, orig_y1, orig_x2, orig_y2, new_previous_box
-    except Exception as e:
-        print(f"process_coordinates error: {str(e)}")
-        orig_x1, orig_y1, orig_x2, orig_y2 = (
-            0,
-            0,
-            min(100, dims.original_w),
-            min(100, dims.original_h),
-        )
-        return 0, 0, 100, 100, orig_x1, orig_y1, orig_x2, orig_y2, [0, 0, 100, 100]
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-def prepare_image(image) -> tuple[np.ndarray, ImageDimensions]:
-    try:
-        image_cv = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
-        original_h, original_w = image_cv.shape[:2]
-        max_size = max(original_h, original_w)
-        top = (max_size - original_h) // 2
-        bottom = max_size - original_h - top
-        left = (max_size - original_w) // 2
-        right = max_size - original_w - left
-        padded_image = cv2.copyMakeBorder(
-            image_cv, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(0, 0, 0)
-        )
-        padded_h, padded_w = padded_image.shape[:2]
-        dimensions = ImageDimensions(
-            original_w=original_w,
-            original_h=original_h,
-            padded_w=padded_w,
-            padded_h=padded_h,
-        )
-        return padded_image, dimensions
-    except Exception as e:
-        print(f"prepare_image error: {str(e)}")
-        h, w = image.height, image.width
-        dimensions = ImageDimensions(original_w=w, original_h=h, padded_w=w, padded_h=h)
-        return np.zeros((h, w, 3), dtype=np.uint8), dimensions
-# Copied from https://github.com/bytedance/Dolphin/utils/utils.py
-def parse_layout_string(bbox_str):
-    """Parse layout string using regular expressions"""
-    pattern = r"\[(\d*\.?\d+),\s*(\d*\.?\d+),\s*(\d*\.?\d+),\s*(\d*\.?\d+)\]\s*(\w+)"
-    matches = re.finditer(pattern, bbox_str)
-    parsed_results = []
-    for match in matches:
-        coords = [float(match.group(i)) for i in range(1, 5)]
-        label = match.group(5).strip()
-        parsed_results.append((coords, label))
-    return parsed_results
-model_id = "ByteDance/Dolphin"
-# The input image size for Dolphin is 896 x 896,
-# and the patch_size is 4 x 4.
-# Therefore, the initial number of patches is:
-# Height: 896 / 4 = 224 patches
-# Width: 896 / 4 = 224 patches
-# The Dolphin model uses a staged downsampling approach,
-# defined by the "depths": [2, 2, 14, 2] configuration.
-# Before entering stages 2, 3, and 4, a "Patch Merging" operation is performed,
-# which halves the feature map's dimensions (dividing both height and width by 2).
-# Before Stage 2: The size changes from 224 x 224 to (224/2) x (224/2) = 112 x 112.
-# Before Stage 3: The size changes from 112 x 112 to (112/2) x (112/2) = 56 x 56.
-# Before Stage 4: The size changes from 56 x 56 to (56/2) x (56/2) = 28 x 28.
-# Because vLLM needs to fill the image features with an encoder_prompt,
-# and the encoder_prompt will have `<pad>` tokens added when tokenized,
-# we need to construct an encoder_prompt with a length of 28 x 28 - 1 = 783.
-encoder_prompt = "".join(["0"] * 783)
-sampling_params = SamplingParams(
-    temperature=0.0,
-    max_tokens=2048,
-)
-processor = DonutProcessor.from_pretrained(model_id)
-llm = LLM(
-    model=model_id,
-    dtype="float16",
-    max_num_seqs=8,
-    hf_overrides={"architectures": ["DonutForConditionalGeneration"]},
-)
-parser = argparse.ArgumentParser()
-parser.add_argument(
-    "--image_path", type=str, default=None, help="Path to a local image file."
-)
-args = parser.parse_args()
-if args.image_path:
-    if not os.path.exists(args.image_path):
-        raise FileNotFoundError(f"Error: File not found at {args.image_path}")
-    image = Image.open(args.image_path).convert("RGB")
-else:
-    image = fetch_image(
-        "https://huggingface.co/datasets/hf-internal-testing/example-documents/resolve/main/jpeg_images/0.jpg"
-    )
-prompt = "Parse the reading order of this document. "
-decoder_prompt = f"<s>{prompt}<Answer/>"
-decoder_prompt_tokens = TokensPrompt(
-    prompt_token_ids=processor.tokenizer(decoder_prompt, add_special_tokens=False)[
-        "input_ids"
-    ]
-)
-enc_dec_prompt = ExplicitEncoderDecoderPrompt(
-    encoder_prompt=TextPrompt(prompt=encoder_prompt, multi_modal_data={"image": image}),
-    decoder_prompt=decoder_prompt_tokens,
-)
-layout_outputs = llm.generate(prompts=enc_dec_prompt, sampling_params=sampling_params)
-layout_result_str = layout_outputs[0].outputs[0].text
-print(f"Layout analysis output:\n{layout_result_str}")
-padded_image, dims = prepare_image(image)
-layout_results = parse_layout_string(layout_result_str)
-text_table_elements = []
-previous_box = None
-reading_order = 0
-for bbox_coords, label in layout_results:
-    if label == "fig":
-        continue
-    try:
-        x1, y1, x2, y2, orig_x1, orig_y1, orig_x2, orig_y2, previous_box = (
-            process_coordinates(bbox_coords, padded_image, dims, previous_box)
-        )
-        cropped = padded_image[y1:y2, x1:x2]
-        if cropped.size > 0 and cropped.shape[0] > 3 and cropped.shape[1] > 3:
-            pil_crop = Image.fromarray(cv2.cvtColor(cropped, cv2.COLOR_BGR2RGB))
-            prompt_ocr = (
-                "Parse the table in the image. "
-                if label == "tab"
-                else "Read text in the image. "
-            )
-            text_table_elements.append(
-                {
-                    "crop": pil_crop,
-                    "prompt": prompt_ocr,
-                    "reading_order": reading_order,
-                }
-            )
-        reading_order += 1
-    except Exception as e:
-        print(f"Error processing bbox (label: {label}): {str(e)}")
-        continue
-if text_table_elements:
-    batch_prompts = []
-    for elem in text_table_elements:
-        decoder_prompt_str = f"<s>{elem['prompt']}<Answer/>"
-        decoder_prompt_tokens = TokensPrompt(
-            prompt_token_ids=processor.tokenizer(
-                decoder_prompt_str, add_special_tokens=False
-            )["input_ids"]
-        )
-        enc_dec_prompt = ExplicitEncoderDecoderPrompt(
-            encoder_prompt=TextPrompt(
-                prompt=encoder_prompt, multi_modal_data={"image": elem["crop"]}
-            ),
-            decoder_prompt=decoder_prompt_tokens,
-        )
-        batch_prompts.append(enc_dec_prompt)
-    batch_outputs = llm.generate(prompts=batch_prompts, sampling_params=sampling_params)
-    for i, output in enumerate(batch_outputs):
-        text_table_elements[i]["text"] = output.outputs[0].text.strip()
-print("------" * 8)
-text_table_elements.sort(key=lambda x: x["reading_order"])
-for elem in text_table_elements:
-    print(elem.get("text", ""))
--- a/examples/offline_inference/encoder_decoder.py
+++ b/examples/offline_inference/encoder_decoder.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-"""
-Demonstrate prompting of text-to-text
-encoder/decoder models, specifically BART and mBART.
-This script is refactored to allow model selection via command-line arguments.
-NOTE: This example is not yet supported in V1.
-"""
-import argparse
-from typing import NamedTuple, Optional
-from vllm import LLM, SamplingParams
-from vllm.inputs import (
-    ExplicitEncoderDecoderPrompt,
-    TextPrompt,
-    TokensPrompt,
-    zip_enc_dec_prompts,
-)
-class ModelRequestData(NamedTuple):
-    """
-    Holds the configuration for a specific model, including its
-    HuggingFace ID and the prompts to use for the demo.
-    """
-    model_id: str
-    encoder_prompts: list
-    decoder_prompts: list
-    hf_overrides: Optional[dict] = None
-def get_bart_config() -> ModelRequestData:
-    """
-    Returns the configuration for facebook/bart-large-cnn.
-    This uses the exact test cases from the original script.
-    """
-    encoder_prompts = [
-        "Hello, my name is",
-        "The president of the United States is",
-        "The capital of France is",
-        "An encoder prompt",
-    ]
-    decoder_prompts = [
-        "A decoder prompt",
-        "Another decoder prompt",
-    ]
-    return ModelRequestData(
-        model_id="facebook/bart-large-cnn",
-        encoder_prompts=encoder_prompts,
-        decoder_prompts=decoder_prompts,
-    )
-def get_mbart_config() -> ModelRequestData:
-    """
-    Returns the configuration for facebook/mbart-large-en-ro.
-    This uses prompts suitable for an English-to-Romanian translation task.
-    """
-    encoder_prompts = [
-        "The quick brown fox jumps over the lazy dog.",
-        "How are you today?",
-    ]
-    decoder_prompts = ["", ""]
-    hf_overrides = {"architectures": ["MBartForConditionalGeneration"]}
-    return ModelRequestData(
-        model_id="facebook/mbart-large-en-ro",
-        encoder_prompts=encoder_prompts,
-        decoder_prompts=decoder_prompts,
-        hf_overrides=hf_overrides,
-    )
-MODEL_GETTERS = {
-    "bart": get_bart_config,
-    "mbart": get_mbart_config,
-}
-def create_all_prompt_types(
-    encoder_prompts_raw: list,
-    decoder_prompts_raw: list,
-    tokenizer,
-) -> list:
-    """
-    Generates a list of diverse prompt types for demonstration.
-    This function is generic and uses the provided raw prompts
-    to create various vLLM input objects.
-    """
-    text_prompt_raw = encoder_prompts_raw[0]
-    text_prompt = TextPrompt(prompt=encoder_prompts_raw[1 % len(encoder_prompts_raw)])
-    tokens_prompt = TokensPrompt(
-        prompt_token_ids=tokenizer.encode(
-            encoder_prompts_raw[2 % len(encoder_prompts_raw)]
-        )
-    )
-    decoder_tokens_prompt = TokensPrompt(
-        prompt_token_ids=tokenizer.encode(decoder_prompts_raw[0])
-    )
-    single_prompt_examples = [
-        text_prompt_raw,
-        text_prompt,
-        tokens_prompt,
-    ]
-    explicit_pair_examples = [
-        ExplicitEncoderDecoderPrompt(
-            encoder_prompt=text_prompt_raw,
-            decoder_prompt=decoder_tokens_prompt,
-        ),
-        ExplicitEncoderDecoderPrompt(
-            encoder_prompt=text_prompt,
-            decoder_prompt=decoder_prompts_raw[1 % len(decoder_prompts_raw)],
-        ),
-        ExplicitEncoderDecoderPrompt(
-            encoder_prompt=tokens_prompt,
-            decoder_prompt=text_prompt,
-        ),
-    ]
-    zipped_prompt_list = zip_enc_dec_prompts(
-        encoder_prompts_raw,
-        decoder_prompts_raw,
-    )
-    return single_prompt_examples + explicit_pair_examples + zipped_prompt_list
-def create_sampling_params() -> SamplingParams:
-    """Create a sampling params object."""
-    return SamplingParams(
-        temperature=0,
-        top_p=1.0,
-        min_tokens=0,
-        max_tokens=30,
-    )
-def print_outputs(outputs: list):
-    """Formats and prints the generation outputs."""
-    print("-" * 80)
-    for i, output in enumerate(outputs):
-        prompt = output.prompt
-        encoder_prompt = output.encoder_prompt
-        generated_text = output.outputs[0].text
-        print(f"Output {i + 1}:")
-        print(f"Encoder Prompt: {encoder_prompt!r}")
-        print(f"Decoder Prompt: {prompt!r}")
-        print(f"Generated Text: {generated_text!r}")
-        print("-" * 80)
-def main(args):
-    """Main execution function."""
-    model_key = args.model
-    if model_key not in MODEL_GETTERS:
-        raise ValueError(
-            f"Unknown model: {model_key}. "
-            f"Available models: {list(MODEL_GETTERS.keys())}"
-        )
-    config_getter = MODEL_GETTERS[model_key]
-    model_config = config_getter()
-    print(f"🚀 Running demo for model: {model_config.model_id}")
-    llm = LLM(
-        model=model_config.model_id,
-        dtype="float",
-        hf_overrides=model_config.hf_overrides,
-    )
-    tokenizer = llm.llm_engine.get_tokenizer_group()
-    prompts = create_all_prompt_types(
-        encoder_prompts_raw=model_config.encoder_prompts,
-        decoder_prompts_raw=model_config.decoder_prompts,
-        tokenizer=tokenizer,
-    )
-    sampling_params = create_sampling_params()
-    outputs = llm.generate(prompts, sampling_params)
-    print_outputs(outputs)
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="A flexible demo for vLLM encoder-decoder models."
-    )
-    parser.add_argument(
-        "--model",
-        "-m",
-        type=str,
-        default="bart",
-        choices=MODEL_GETTERS.keys(),
-        help="The short name of the model to run.",
-    )
-    args = parser.parse_args()
-    main(args)
--- a/examples/offline_inference/encoder_decoder_multimodal.py
+++ b/examples/offline_inference/encoder_decoder_multimodal.py
@@ -13,8 +13,6 @@ from typing import NamedTuple
 from vllm import LLM, EngineArgs, PromptType, SamplingParams
 from vllm.assets.audio import AudioAsset
-from vllm.assets.image import ImageAsset
-from vllm.multimodal.utils import fetch_image
 from vllm.utils import FlexibleArgumentParser
@@ -23,113 +21,6 @@ class ModelRequestData(NamedTuple):
    prompts: Sequence[PromptType]
-def run_donut():
-    engine_args = EngineArgs(
-        model="naver-clova-ix/donut-base-finetuned-docvqa",
-        max_num_seqs=2,
-        limit_mm_per_prompt={"image": 1},
-        dtype="float16",
-        hf_overrides={"architectures": ["DonutForConditionalGeneration"]},
-    )
-    # The input image size for donut-base-finetuned-docvqa is 2560 x 1920,
-    # and the patch_size is 4 x 4.
-    # Therefore, the initial number of patches is:
-    # Height: 1920 / 4 = 480 patches
-    # Width: 2560 / 4 = 640 patches
-    # The Swin model uses a staged downsampling approach,
-    # defined by the "depths": [2, 2, 14, 2] configuration.
-    # Before entering stages 2, 3, and 4, a "Patch Merging" operation is performed,
-    # which halves the feature map's dimensions (dividing both height and width by 2).
-    # Before Stage 2: The size changes from 480 x 640 to (480/2) x (640/2) = 240 x 320.
-    # Before Stage 3: The size changes from 240 x 320 to (240/2) x (320/2) = 120 x 160.
-    # Before Stage 4: The size changes from 120 x 160 to (120/2) x (160/2) = 60 x 80.
-    # Because vLLM needs to fill the image features with an encoder_prompt,
-    # and the encoder_prompt will have `<pad>` tokens added when tokenized,
-    # we need to construct an encoder_prompt with a length of 60 x 80 - 1 = 4799.
-    prompts = [
-        {
-            "encoder_prompt": {
-                "prompt": "".join(["$"] * 4799),
-                "multi_modal_data": {
-                    "image": fetch_image(
-                        "https://huggingface.co/datasets/hf-internal-testing/example-documents/resolve/main/jpeg_images/0.jpg"
-                    )  # noqa: E501
-                },
-            },
-            "decoder_prompt": "<s_docvqa><s_question>What time is the coffee break?</s_question><s_answer>",  # noqa: E501
-        },
-    ]
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompts=prompts,
-    )
-def run_florence2():
-    engine_args = EngineArgs(
-        model="microsoft/Florence-2-large",
-        tokenizer="Isotr0py/Florence-2-tokenizer",
-        max_num_seqs=8,
-        trust_remote_code=True,
-        limit_mm_per_prompt={"image": 1},
-        dtype="half",
-    )
-    prompts = [
-        {  # implicit prompt with task token
-            "prompt": "<DETAILED_CAPTION>",
-            "multi_modal_data": {"image": ImageAsset("stop_sign").pil_image},
-        },
-        {  # explicit encoder/decoder prompt
-            "encoder_prompt": {
-                "prompt": "Describe in detail what is shown in the image.",
-                "multi_modal_data": {"image": ImageAsset("cherry_blossom").pil_image},
-            },
-            "decoder_prompt": "",
-        },
-    ]
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompts=prompts,
-    )
-def run_mllama():
-    engine_args = EngineArgs(
-        model="meta-llama/Llama-3.2-11B-Vision-Instruct",
-        max_model_len=8192,
-        max_num_seqs=2,
-        limit_mm_per_prompt={"image": 1},
-        dtype="half",
-    )
-    prompts = [
-        {  # Implicit prompt
-            "prompt": "<|image|><|begin_of_text|>What is the content of this image?",  # noqa: E501
-            "multi_modal_data": {
-                "image": ImageAsset("stop_sign").pil_image,
-            },
-        },
-        {  # Explicit prompt
-            "encoder_prompt": {
-                "prompt": "<|image|>",
-                "multi_modal_data": {
-                    "image": ImageAsset("stop_sign").pil_image,
-                },
-            },
-            "decoder_prompt": "<|image|><|begin_of_text|>Please describe the image.",  # noqa: E501
-        },
-    ]
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompts=prompts,
-    )
 def run_whisper():
    os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
@@ -166,9 +57,6 @@ def run_whisper():
 model_example_map = {
-    "donut": run_donut,
-    "florence2": run_florence2,
-    "mllama": run_mllama,
    "whisper": run_whisper,
 }
@@ -182,7 +70,7 @@ def parse_args():
        "--model-type",
        "-m",
        type=str,
-        default="mllama",
+        default="whisper",
        choices=model_example_map.keys(),
        help='Huggingface "model_type".',
    )

--- a/examples/offline_inference/vision_language.py
+++ b/examples/offline_inference/vision_language.py
@@ -204,28 +204,6 @@ def run_ernie45_vl(questions: list[str], modality: str) -> ModelRequestData:
    )
-# Florence2
-def run_florence2(questions: list[str], modality: str) -> ModelRequestData:
-    assert modality == "image"
-    engine_args = EngineArgs(
-        model="microsoft/Florence-2-large",
-        tokenizer="Isotr0py/Florence-2-tokenizer",
-        max_model_len=4096,
-        max_num_seqs=2,
-        trust_remote_code=True,
-        dtype="bfloat16",
-        limit_mm_per_prompt={modality: 1},
-    )
-    prompts = ["<MORE_DETAILED_CAPTION>" for _ in questions]
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompts=prompts,
-    )
 # Fuyu
 def run_fuyu(questions: list[str], modality: str) -> ModelRequestData:
    assert modality == "image"
@@ -1008,44 +986,6 @@ def run_mistral3(questions: list[str], modality: str) -> ModelRequestData:
    )
-# LLama 3.2
-def run_mllama(questions: list[str], modality: str) -> ModelRequestData:
-    assert modality == "image"
-    model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
-    # Note: The default setting of max_num_seqs (256) and
-    # max_model_len (131072) for this model may cause OOM.
-    # You may lower either to run this example on lower-end GPUs.
-    # The configuration below has been confirmed to launch on a single L40 GPU.
-    engine_args = EngineArgs(
-        model=model_name,
-        max_model_len=8192,
-        max_num_seqs=2,
-        limit_mm_per_prompt={modality: 1},
-    )
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    messages = [
-        [
-            {
-                "role": "user",
-                "content": [{"type": "image"}, {"type": "text", "text": question}],
-            }
-        ]
-        for question in questions
-    ]
-    prompts = tokenizer.apply_chat_template(
-        messages, add_generation_prompt=True, tokenize=False
-    )
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompts=prompts,
-    )
 # Molmo
 def run_molmo(questions: list[str], modality: str) -> ModelRequestData:
    assert modality == "image"
@@ -1665,7 +1605,6 @@ model_example_map = {
    "command_a_vision": run_command_a_vision,
    "deepseek_vl_v2": run_deepseek_vl2,
    "ernie45_vl": run_ernie45_vl,
-    "florence2": run_florence2,
    "fuyu": run_fuyu,
    "gemma3": run_gemma3,
    "gemma3n": run_gemma3n,
@@ -1691,7 +1630,6 @@ model_example_map = {
    "minicpmv": run_minicpmv,
    "minimax_vl_01": run_minimax_vl_01,
    "mistral3": run_mistral3,
-    "mllama": run_mllama,
    "molmo": run_molmo,
    "nemotron_vl": run_nemotron_vl,
    "NVLM_D": run_nvlm_d,

--- a/examples/offline_inference/vision_language_multi_image.py
+++ b/examples/offline_inference/vision_language_multi_image.py
@@ -637,26 +637,6 @@ def load_mistral3(question: str, image_urls: list[str]) -> ModelRequestData:
    )
-def load_mllama(question: str, image_urls: list[str]) -> ModelRequestData:
-    model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
-    # The configuration below has been confirmed to launch on a single L40 GPU.
-    engine_args = EngineArgs(
-        model=model_name,
-        max_model_len=8192,
-        max_num_seqs=2,
-        limit_mm_per_prompt={"image": len(image_urls)},
-    )
-    img_prompt = "Given the first image <|image|> and the second image<|image|>"
-    prompt = f"<|begin_of_text|>{img_prompt}, {question}?"
-    return ModelRequestData(
-        engine_args=engine_args,
-        prompt=prompt,
-        image_data=[fetch_image(url) for url in image_urls],
-    )
 def load_nvlm_d(question: str, image_urls: list[str]) -> ModelRequestData:
    model_name = "nvidia/NVLM-D-72B"
@@ -1253,7 +1233,6 @@ model_example_map = {
    "llava-next": load_llava_next,
    "llava-onevision": load_llava_onevision,
    "mistral3": load_mistral3,
-    "mllama": load_mllama,
    "NVLM_D": load_nvlm_d,
    "ovis": load_ovis,
    "ovis2_5": load_ovis2_5,

--- a/tests/core/block/test_block_manager.py
+++ b/tests/core/block/test_block_manager.py
@@ -3,15 +3,12 @@
 import pytest
-from vllm.core.block.utils import (STR_NOT_IMPL_ENC_DEC_PREFIX_CACHE,
-                                   STR_NOT_IMPL_ENC_DEC_SWA)
 from vllm.core.block_manager import SelfAttnBlockSpaceManager
 from vllm.core.interfaces import AllocStatus
 from vllm.sequence import Logprob, SequenceStatus
 from vllm.utils import chunk_list
-from ..utils import (create_dummy_prompt, create_seq_group,
+from ..utils import create_dummy_prompt, create_seq_group
-                     create_seq_group_encoder_decoder)
 @pytest.mark.parametrize("block_size", [16])
@@ -58,156 +55,6 @@ def test_can_allocate_seq_group(block_size: int, num_seqs_per_group: int,
            assert can_allocate_result == AllocStatus.LATER
-@pytest.mark.parametrize("block_size", [16])
-@pytest.mark.parametrize("num_gpu_blocks", [16, 80, 160])
-@pytest.mark.parametrize("num_seqs_per_group", [1, 4])
-@pytest.mark.parametrize("watermark", [0.0, 0.5])
-def test_can_allocate_seq_group_encoder_decoder(block_size: int,
-                                                num_seqs_per_group: int,
-                                                num_gpu_blocks: int,
-                                                watermark: float):
-    block_manager = SelfAttnBlockSpaceManager(
-        block_size=block_size,
-        num_gpu_blocks=num_gpu_blocks,
-        num_cpu_blocks=1024,
-        watermark=watermark,
-    )
-    num_watermark_blocks = int(watermark * num_gpu_blocks)
-    num_output_blocks_per_seq = 1
-    # NOTE: This should be num_output_blocks_per_seq * num_seqs_per_group, but
-    # the current implementation assumes all seqs are new prompts / don't have
-    # different output lens.
-    num_output_blocks = num_output_blocks_per_seq
-    for bdx, num_prompt_blocks in enumerate(
-            range(1, num_gpu_blocks - num_output_blocks)):
-        num_cross_blocks_per_seq = num_prompt_blocks
-        seq_group = create_seq_group_encoder_decoder(
-            seq_prompt_len=block_size * num_prompt_blocks,
-            seq_output_lens=[
-                block_size * num_output_blocks_per_seq
-                for _ in range(num_seqs_per_group)
-            ],
-            request_id=str(bdx))
-        assert num_prompt_blocks + num_output_blocks <= num_gpu_blocks
-        can_allocate_result = block_manager.can_allocate(seq_group)
-        num_required_blocks = num_prompt_blocks + \
-                              num_output_blocks + \
-                              num_cross_blocks_per_seq
-        if num_gpu_blocks - num_required_blocks < num_watermark_blocks:
-            assert can_allocate_result == AllocStatus.NEVER
-        elif num_gpu_blocks >= num_required_blocks:
-            assert can_allocate_result == AllocStatus.OK
-        else:
-            assert can_allocate_result == AllocStatus.LATER
-@pytest.mark.parametrize("block_size", [16])
-@pytest.mark.parametrize("num_gpu_blocks", [16])
-@pytest.mark.parametrize("num_seqs_per_group", [1])
-@pytest.mark.parametrize("watermark", [0.0, 0.5])
-def test_can_allocate_encoder_decoder_fails_with_swa(block_size: int,
-                                                     num_seqs_per_group: int,
-                                                     num_gpu_blocks: int,
-                                                     watermark: float):
-    '''
-    SWA short for Sliding Window Attention.
-    At time of writing block manager does not support SWA.
-    However even when SWA is implemented for block manager,
-    there will still most likely be a separate workstream required
-    to enable SWA for encoder/decoder models.
-    Therefore this test enforces that one of the following cases
-    hold true:
-    1. Block manager does not support SWA at all (true at time of writing)
-    2. Block manager fails with NotImplementError when SWA is enabled
-       AND a SequenceGroup with an encoder sequence (i.e. in support of an
-       encoder/decoder model) is passed into can_allocate() as an argument
-    The setup for this test is stripped down version of
-    test_can_allocate_seq_group_encoder_decoder()
-    '''
-    with pytest.raises((NotImplementedError, AssertionError)) as exc_info:
-        block_manager = SelfAttnBlockSpaceManager(
-            block_size=block_size,
-            num_gpu_blocks=num_gpu_blocks,
-            num_cpu_blocks=1024,
-            watermark=watermark,
-            sliding_window=5  # SWA
-        )
-        num_output_blocks_per_seq = 1
-        num_prompt_blocks = 1
-        num_output_blocks = num_output_blocks_per_seq
-        seq_group = create_seq_group_encoder_decoder(
-            seq_prompt_len=block_size * num_prompt_blocks,
-            seq_output_lens=[
-                block_size * num_output_blocks_per_seq
-                for _ in range(num_seqs_per_group)
-            ],
-            request_id="0")
-        assert num_prompt_blocks + num_output_blocks <= num_gpu_blocks
-        block_manager.can_allocate(seq_group)
-    # Assert that either
-    # 1. Block manager constructor fails with assertion that sliding window
-    #    is not yet supported (most likely near-term outcome at time of
-    #    writing), or
-    # 2. can_allocate() fails with NotImplementedError due to combination of
-    #    encoder/decoder and sliding window attention
-    if isinstance(exc_info.value, NotImplementedError):
-        assert str(exc_info.value) == STR_NOT_IMPL_ENC_DEC_SWA
-    elif isinstance(exc_info.value, AssertionError):
-        assert str(exc_info.value) == "Sliding window not yet supported"
-@pytest.mark.parametrize("block_size", [16])
-@pytest.mark.parametrize("num_gpu_blocks", [16])
-@pytest.mark.parametrize("num_seqs_per_group", [1])
-@pytest.mark.parametrize("watermark", [0.0, 0.5])
-def test_can_allocate_encoder_decoder_fails_with_prefix_cache(
-        block_size: int, num_seqs_per_group: int, num_gpu_blocks: int,
-        watermark: float):
-    block_manager = SelfAttnBlockSpaceManager(
-        block_size=block_size,
-        num_gpu_blocks=num_gpu_blocks,
-        num_cpu_blocks=1024,
-        watermark=watermark,
-        enable_caching=True  # Prefix cache
-    )
-    num_output_blocks_per_seq = 1
-    num_prompt_blocks = 1
-    num_output_blocks = num_output_blocks_per_seq
-    seq_group = create_seq_group_encoder_decoder(
-        seq_prompt_len=block_size * num_prompt_blocks,
-        seq_output_lens=[
-            block_size * num_output_blocks_per_seq
-            for _ in range(num_seqs_per_group)
-        ],
-        request_id="0")
-    assert num_prompt_blocks + num_output_blocks <= num_gpu_blocks
-    # Assert that either can_allocate() fails with NotImplementedError
-    # due to combination of encoder/decoder and prefix cache
-    with pytest.raises(NotImplementedError) as exc_info:
-        block_manager.can_allocate(seq_group)
-    assert str(exc_info.value) == STR_NOT_IMPL_ENC_DEC_PREFIX_CACHE
 @pytest.mark.parametrize("block_size", [1, 8])
 @pytest.mark.parametrize("prompt_len", [1, 7, 8])
 @pytest.mark.parametrize("num_slots_to_append", [1, 8, 129])

--- a/tests/core/test_scheduler_encoder_decoder.py
+++ b/tests/core/test_scheduler_encoder_decoder.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-import pytest  # noqa
-from vllm.config import CacheConfig, SchedulerConfig
-from vllm.core.scheduler import Scheduler
-from vllm.sequence import SequenceGroup
-from .utils import (append_new_token, create_dummy_prompt_encoder_decoder,
-                    get_sequence_groups, schedule_and_update_computed_tokens)
-def test_scheduler_schedule_simple_encoder_decoder():
-    '''
-    Test basic scheduler functionality in the context
-    of an encoder/decoder model. Focus on testing
-    enc/dec-specific functionality sense tests already
-    exist for decoder-only functionality
-    Test behavior:
-    * Construct Scheduler
-    * Construct dummy encoder/decoder sequence groups
-    * Add dummy seq groups to scheduler backlog
-    * Schedule the next seq group & validate:
-        * Cross-attn block tables
-        * Updated states of seq groups
-        * Number of batched tokens
-        * Number of blocks to copy/swap-in/swap-out
-        * Number of scheduled seq groups
-    * Repeat for both prefill- and decode-phase
-    * Abort scheduled seq groups
-    * Assert that aborted seq groups no longer appear in
-      cross-attention block table
-    '''
-    block_size = 4
-    num_seq_group = 4
-    max_model_len = 16
-    scheduler_config = SchedulerConfig(
-        "generate",
-        max_num_batched_tokens=64,
-        max_num_seqs=num_seq_group,
-        max_model_len=max_model_len,
-    )
-    cache_config = CacheConfig(block_size, 1.0, 1, "auto")
-    cache_config.num_cpu_blocks = 16  # enc and dec prompts per seq_group
-    cache_config.num_gpu_blocks = 16  # enc and dec prompts per seq_group
-    scheduler = Scheduler(scheduler_config, cache_config, None)
-    running: list[SequenceGroup] = []
-    # Add seq groups to scheduler.
-    req_id_list = []
-    for i in range(num_seq_group):
-        req_id = str(i)
-        req_id_list.append(req_id)
-        _, _, seq_group = create_dummy_prompt_encoder_decoder(
-            req_id, block_size, block_size, block_size)
-        scheduler.add_seq_group(seq_group)
-        running.append(seq_group)
-    # Schedule seq groups prefill.
-    num_tokens = block_size * num_seq_group
-    seq_group_meta_list, out = schedule_and_update_computed_tokens(scheduler)
-    # - Verify that sequence group cross-attention block tables are
-    #   registered with the block manager
-    assert all([(req_id in scheduler.block_manager.cross_block_tables)
-                for req_id in req_id_list])
-    # - Validate sequence-group status
-    assert set(get_sequence_groups(out)) == set(running)
-    # - Validate number of batched tokens
-    assert out.num_batched_tokens == num_tokens
-    # - Validate there are no remaining blocks to swap
-    assert (not out.blocks_to_copy and not out.blocks_to_swap_in
-            and not out.blocks_to_swap_out)
-    # - Validate all seq groups were scheduled
-    assert len(seq_group_meta_list) == num_seq_group
-    append_new_token(out, 1)
-    # Schedule seq groups decode.
-    seq_group_meta_list, out = schedule_and_update_computed_tokens(scheduler)
-    # - Verify that sequence group metadata includes encoder attention
-    #   and cross-attention metadata
-    assert all([
-        not ((seq_group_meta.encoder_seq_data is None) or
-             (seq_group_meta.cross_block_table is None))
-        for seq_group_meta in seq_group_meta_list
-    ])
-    # - Validate sequence-group status
-    assert set(get_sequence_groups(out)) == set(running)
-    # - Validate there is one batched token per seq group
-    assert out.num_batched_tokens == num_seq_group
-    # - Validate there are no remaining blocks to swap
-    assert (not out.blocks_to_copy and not out.blocks_to_swap_in
-            and not out.blocks_to_swap_out)
-    # - Validate that all seq groups were scheduled
-    assert len(seq_group_meta_list) == num_seq_group
-    append_new_token(out, 1)
-    # Abort sequences
-    for req_id in req_id_list:
-        scheduler.abort_seq_group(req_id)
-        # - Verify that sequence group cross-attention block tables are
-        #   NO LONGER registered with the block manager
-        assert req_id not in scheduler.block_manager.cross_block_tables
--- a/tests/distributed/test_pipeline_parallel.py
+++ b/tests/distributed/test_pipeline_parallel.py
@@ -242,9 +242,6 @@ MULTIMODAL_MODELS = {
    "Qwen/Qwen2-Audio-7B-Instruct": PPTestSettings.fast(),
    "Qwen/Qwen2-VL-2B-Instruct": PPTestSettings.fast(),
    "fixie-ai/ultravox-v0_5-llama-3_2-1b": PPTestSettings.fast(),
-    # [Encoder-decoder]
-    # TODO: Implement PP
-    # "meta-llama/Llama-3.2-11B-Vision-Instruct": PPTestSettings.fast(),
 }
 # yapf: enable

--- a/tests/encoder_decoder/__init__.py
+++ b/tests/encoder_decoder/__init__.py
--- a/tests/encoder_decoder/test_e2e_correctness.py
+++ b/tests/encoder_decoder/test_e2e_correctness.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-"""E2E tests to verify the correctness of the encoder-decoder framework
-Run `pytest tests/encoder_decoder/test_e2e_correctness.py`.
-"""
-from typing import Optional
-import pytest
-from transformers import AutoModelForSeq2SeqLM
-from vllm.attention.selector import (_Backend, _cached_get_attn_backend,
-                                     global_force_attn_backend_context_manager)
-from vllm.platforms import current_platform
-from vllm.sequence import SampleLogprobs
-from ..conftest import DecoderPromptType
-from ..models.utils import check_logprobs_close
-LIST_ENC_DEC_SUPPORTED_BACKENDS = [
-    _Backend.XFORMERS, _Backend.FLASH_ATTN, None
-]
-@pytest.fixture(scope="function", autouse=True)
-def use_v0_only(monkeypatch):
-    """
-    Since this module is V0 only, set VLLM_USE_V1=0 for
-    all tests in the module.
-    """
-    monkeypatch.setenv('VLLM_USE_V1', '0')
-def vllm_to_hf_output(
-    vllm_output: tuple[list[int], str, Optional[SampleLogprobs]],
-    decoder_prompt_type: DecoderPromptType,
-):
-    """Sanitize vllm output to be comparable with hf output."""
-    output_ids, output_str, out_logprobs = vllm_output
-    hf_output_str = output_str + "</s>"
-    if decoder_prompt_type == DecoderPromptType.NONE:
-        hf_output_str = "<s>" + hf_output_str
-    return output_ids, hf_output_str, out_logprobs
-@pytest.fixture(autouse=True)
-def clear_cache():
-    """Fixture to clear backend cache before each test."""
-    _cached_get_attn_backend.cache_clear()  # Clear the cache
-    yield  # This allows the test to run
-@pytest.mark.parametrize("model", ["facebook/bart-large-cnn"])
-@pytest.mark.parametrize("dtype", ["float"])
-@pytest.mark.parametrize("attn_backend", LIST_ENC_DEC_SUPPORTED_BACKENDS)
-@pytest.mark.parametrize("max_tokens", [128])
-@pytest.mark.parametrize("num_logprobs", [5])
-@pytest.mark.parametrize("decoder_prompt_type", list(DecoderPromptType))
-@pytest.mark.parametrize("enforce_eager", [True, False])
-@pytest.mark.skipif(
-    current_platform.is_cpu(),
-    reason="CPU backend is not currently supported with encoder/decoder models"
-)
-@pytest.mark.skip(reason="bart not supported in V1")
-def test_encoder_decoder_e2e(
-    hf_runner,
-    vllm_runner,
-    example_encoder_decoder_prompts,
-    model: str,
-    dtype: str,
-    max_tokens: int,
-    num_logprobs: int,
-    decoder_prompt_type: DecoderPromptType,
-    enforce_eager: bool,
-    attn_backend: _Backend,
-) -> None:
-    '''
-    End-to-End (E2E) test for the encoder-decoder framework.
-    This test evaluates the encoder-decoder functionality using the BART
-    model. We compare the outputs of the Hugging Face and vLLM
-    implementations to ensure that both implementations produce consistent
-    and correct results.
-    '''
-    with global_force_attn_backend_context_manager(attn_backend):
-        if attn_backend == _Backend.FLASH_ATTN:
-            # Flash Attention works only with bfloat16 data-type
-            dtype = 'bfloat16'
-        test_case_prompts = example_encoder_decoder_prompts[
-            decoder_prompt_type]
-        # Configuration settings for HF baseline
-        hf_kwargs = {
-            "top_k": None,
-            "num_beams": 1,
-            "repetition_penalty": 1.0,
-            "top_p": 1.0,
-            "length_penalty": 1.0,
-            "early_stopping": False,
-            "no_repeat_ngram_size": None,
-            "min_length": 0
-        }
-        with hf_runner(model, dtype=dtype,
-                       auto_cls=AutoModelForSeq2SeqLM) as hf_model:
-            hf_outputs = (
-                hf_model.generate_encoder_decoder_greedy_logprobs_limit(
-                    test_case_prompts,
-                    max_tokens,
-                    num_logprobs,
-                    **hf_kwargs,
-                ))
-        with vllm_runner(model, dtype=dtype,
-                         enforce_eager=enforce_eager) as vllm_model:
-            vllm_outputs = vllm_model.generate_encoder_decoder_greedy_logprobs(
-                test_case_prompts, max_tokens, num_logprobs)
-        hf_skip_tokens = (1 if decoder_prompt_type == DecoderPromptType.NONE
-                          else 0)
-        check_logprobs_close(
-            outputs_0_lst=hf_outputs,
-            outputs_1_lst=[
-                vllm_to_hf_output(vllm_output, decoder_prompt_type)
-                for vllm_output in vllm_outputs
-            ],
-            name_0="hf",
-            name_1="vllm",
-            num_outputs_0_skip_tokens=hf_skip_tokens,
-        )
--- a/tests/entrypoints/openai/test_encoder_decoder.py
+++ b/tests/entrypoints/openai/test_encoder_decoder.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-import openai
-import pytest
-import pytest_asyncio
-from ...utils import RemoteOpenAIServer
-MODEL_NAME = "facebook/bart-base"
-@pytest.fixture(scope="module")
-def server():
-    args = [
-        "--dtype",
-        "bfloat16",
-        "--enforce-eager",
-    ]
-    with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
-        yield remote_server
-@pytest_asyncio.fixture
-async def client(server):
-    async with server.get_async_client() as async_client:
-        yield async_client
-@pytest.mark.asyncio
-@pytest.mark.parametrize("model_name", [MODEL_NAME])
-@pytest.mark.skip(reason="bart is not yet supported in V1")
-async def test_single_completion(client: openai.AsyncOpenAI, model_name: str):
-    completion = await client.completions.create(model=model_name,
-                                                 prompt="Hello, my name is",
-                                                 max_tokens=5,
-                                                 temperature=0.0)
-    assert completion.id is not None
-    assert completion.choices is not None and len(completion.choices) == 1
-    choice = completion.choices[0]
-    assert len(choice.text) >= 5
-    assert choice.finish_reason == "length"
-    assert completion.usage == openai.types.CompletionUsage(
-        completion_tokens=5, prompt_tokens=2, total_tokens=7)
-    # test using token IDs
-    completion = await client.completions.create(
-        model=model_name,
-        prompt=[0, 0, 0, 0, 0],
-        max_tokens=5,
-        temperature=0.0,
-    )
-    assert len(completion.choices[0].text) >= 1
--- a/tests/entrypoints/test_chat_utils.py
+++ b/tests/entrypoints/test_chat_utils.py
@@ -20,7 +20,6 @@ from vllm.entrypoints.chat_utils import (_try_extract_ast, load_chat_template,
                                         parse_chat_messages_futures,
                                         resolve_chat_template_content_format,
                                         resolve_hf_chat_template)
-from vllm.entrypoints.llm import apply_hf_chat_template
 from vllm.multimodal import MultiModalDataDict, MultiModalUUIDDict
 from vllm.multimodal.utils import (encode_audio_base64, encode_image_base64,
                                   encode_video_base64)
@@ -38,7 +37,6 @@ QWEN2AUDIO_MODEL_ID = "Qwen/Qwen2-Audio-7B-Instruct"
 QWEN2VL_MODEL_ID = "Qwen/Qwen2-VL-2B-Instruct"
 QWEN25VL_MODEL_ID = "Qwen/Qwen2.5-VL-3B-Instruct"
 QWEN25OMNI_MODEL_ID = "Qwen/Qwen2.5-Omni-7B"
-MLLAMA_MODEL_ID = "meta-llama/Llama-3.2-11B-Vision-Instruct"
 LLAMA_GUARD_MODEL_ID = "meta-llama/Llama-Guard-3-1B"
 HERMES_MODEL_ID = "NousResearch/Hermes-3-Llama-3.1-8B"
 MISTRAL_MODEL_ID = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
@@ -125,27 +123,6 @@ def qwen25omni_tokenizer():
    )
-@pytest.fixture(scope="module")
-def mllama_model_config():
-    return ModelConfig(
-        MLLAMA_MODEL_ID,
-        runner="generate",
-        limit_mm_per_prompt={
-            "image": 2,
-        },
-    )
-@pytest.fixture(scope="module")
-def mllama_tokenizer():
-    return TokenizerGroup(
-        MLLAMA_MODEL_ID,
-        enable_lora=False,
-        max_num_seqs=5,
-        max_input_length=None,
-    )
 @pytest.fixture(scope="function")
 def mistral_model_config():
    return ModelConfig(
@@ -2249,180 +2226,6 @@ def test_parse_chat_messages_multiple_images_interleave_with_placeholders(
        )
-### Mllama currently wraps images / texts as interleaved dictionaries
-def test_mllama_single_image(
-    mllama_model_config,
-    mllama_tokenizer,
-    image_url,
-):
-    """Ensures that a single image is parsed correctly mllama."""
-    conversation, mm_data, mm_uuids = parse_chat_messages(
-        [{
-            "role":
-            "user",
-            "content": [
-                {
-                    "type": "text",
-                    "text": "The content of this image is:"
-                },
-                {
-                    "image_url": image_url
-                },
-            ],
-        }],
-        mllama_model_config,
-        mllama_tokenizer,
-        content_format="openai",
-    )
-    _assert_mm_data_is_image_input(mm_data, 1)
-    _assert_mm_uuids(mm_uuids, 1, expected_uuids=[None])
-    assert conversation == [{
-        "role":
-        "user",
-        "content": [
-            {
-                "type": "text",
-                "text": "The content of this image is:"
-            },
-            {
-                "type": "image"
-            },
-        ],
-    }]
-def test_mllama_interleaved_images(
-    mllama_model_config,
-    mllama_tokenizer,
-    image_url,
-):
-    """Ensures that multiple image are parsed as interleaved dicts."""
-    conversation, mm_data, mm_uuids = parse_chat_messages(
-        [{
-            "role":
-            "user",
-            "content": [
-                {
-                    "type": "text",
-                    "text": "The content of the first image is:",
-                },
-                {
-                    "image_url": image_url
-                },
-                {
-                    "type": "text",
-                    "text": "The content of the second image is:",
-                },
-                {
-                    "image_url": image_url
-                },
-            ],
-        }],
-        mllama_model_config,
-        mllama_tokenizer,
-        content_format="openai",
-    )
-    _assert_mm_data_is_image_input(mm_data, 2)
-    _assert_mm_uuids(mm_uuids, 2, expected_uuids=[None, None])
-    assert conversation == [{
-        "role":
-        "user",
-        "content": [
-            {
-                "type": "text",
-                "text": "The content of the first image is:"
-            },
-            {
-                "type": "image"
-            },
-            {
-                "type": "text",
-                "text": "The content of the second image is:"
-            },
-            {
-                "type": "image"
-            },
-        ],
-    }]
-@pytest.mark.parametrize("model", [MLLAMA_MODEL_ID])
-def test_multimodal_image_parsing_matches_hf(model, image_url):
-    """Checks end to end hf alignment for multimodal [image] parsing."""
-    def get_conversation(is_hf: bool):
-        img_part = {"type": "image_url", "image_url": {"url": image_url}}
-        if is_hf:
-            img_part = {"type": "image"}
-        return [{
-            "role":
-            "user",
-            "content": [
-                {
-                    "type": "text",
-                    "text": "The content of the first image is:",
-                },
-                img_part,
-                {
-                    "type": "text",
-                    "text": "The content of the second image is:",
-                },
-                img_part,
-                {
-                    "type": "text",
-                    "text": "What animal is in the first image?",
-                },
-            ],
-        }]
-    # Build a config for the model
-    model_config = ModelConfig(
-        model,
-        runner="generate",
-        limit_mm_per_prompt={
-            "image": 2,
-        },
-    )
-    # Build the tokenizer group and grab the underlying tokenizer
-    tokenizer_group = TokenizerGroup(
-        model,
-        enable_lora=False,
-        max_num_seqs=5,
-        max_input_length=None,
-        trust_remote_code=model_config.trust_remote_code,
-    )
-    tokenizer = tokenizer_group.tokenizer
-    # Build and parse a conversation with {"type": "image"} using the tokenizer
-    hf_conversation = get_conversation(is_hf=True)
-    hf_result = tokenizer.apply_chat_template(
-        hf_conversation,
-        tokenize=False,
-        add_generation_prompt=True,
-    )
-    # Now parse with vLLMs chat utils & apply the template
-    vllm_conversation = get_conversation(is_hf=False)
-    conversation, _, _ = parse_chat_messages(
-        vllm_conversation,
-        model_config,
-        tokenizer_group,
-        content_format="openai",
-    )
-    vllm_result = apply_hf_chat_template(
-        tokenizer=tokenizer,
-        conversation=conversation,
-        chat_template=None,
-        model_config=model_config,
-        tools=None,
-        add_generation_prompt=True,
-    )
-    assert hf_result == vllm_result
 @pytest.mark.parametrize(
    "model",
    [
@@ -2486,7 +2289,6 @@ def test_resolve_hf_chat_template(sample_json_schema, model, use_tools):
     (QWEN25VL_MODEL_ID, "openai"),
     (ULTRAVOX_MODEL_ID, "string"),
     (QWEN2AUDIO_MODEL_ID, "openai"),
-     (MLLAMA_MODEL_ID, "openai"),
     (LLAMA_GUARD_MODEL_ID, "openai")],
 )
 # yapf: enable
@@ -2545,7 +2347,6 @@ def test_resolve_content_format_hf_defined(model, expected_format):
    [("Salesforce/blip2-opt-2.7b", "string"),
     ("facebook/chameleon-7b", "string"),
     ("deepseek-ai/deepseek-vl2-tiny", "string"),
-     ("microsoft/Florence-2-base", "string"),
     ("adept/fuyu-8b", "string"),
     ("google/paligemma-3b-mix-224", "string"),
     ("Qwen/Qwen-VL", "string"),

--- a/tests/kernels/attention/test_encoder_decoder_attn.py
+++ b/tests/kernels/attention/test_encoder_decoder_attn.py
--- a/tests/models/language/generation/test_bart.py
+++ b/tests/models/language/generation/test_bart.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-from typing import Optional
-import pytest
-from transformers import AutoModelForSeq2SeqLM
-from vllm.sequence import SampleLogprobs
-from ....conftest import (DecoderPromptType, ExplicitEncoderDecoderPrompt,
-                          HfRunner, VllmRunner)
-from ....utils import multi_gpu_test
-from ...utils import check_logprobs_close
-def vllm_to_hf_output(
-    vllm_output: tuple[list[int], str, Optional[SampleLogprobs]],
-    decoder_prompt_type: DecoderPromptType,
-):
-    """Sanitize vllm output to be comparable with hf output."""
-    output_ids, output_str, out_logprobs = vllm_output
-    hf_output_str = output_str + "</s>"
-    if decoder_prompt_type == DecoderPromptType.NONE:
-        hf_output_str = "<s>" + hf_output_str
-    return output_ids, hf_output_str, out_logprobs
-def run_test(
-    hf_runner: type[HfRunner],
-    vllm_runner: type[VllmRunner],
-    prompts: list[ExplicitEncoderDecoderPrompt[str, str]],
-    decoder_prompt_type: DecoderPromptType,
-    model: str,
-    *,
-    dtype: str,
-    max_tokens: int,
-    num_logprobs: int,
-    tensor_parallel_size: int,
-    distributed_executor_backend: Optional[str] = None,
-) -> None:
-    '''
-    Test the vLLM BART model for a variety of encoder/decoder input prompts,
-    by validating it against HuggingFace (HF) BART.
-    Arguments:
-    * hf_runner: HuggingFace (HF) test model runner
-    * vllm_runner: vLLM test model runner
-    * example_encoder_decoder_prompts: test fixture which provides a 
-                                       dictionary of dummy prompts
-    * model: the HF ID of the specific BART variant under test
-    * dtype: the tensor datatype to employ
-    * max_tokens
-    * num_logprobs
-    * decoder_prompt_type: key into the example_encoder_decoder_prompts
-                           dictionary; selects specific encoder/decoder
-                           prompt scenarios to test
-    A note on using HF BART as a baseline for validating vLLM BART,
-    specifically when the decoder prompt is None. 
-    The HF GenerationMixin's default behavior is to force the first
-    decoded token to be <BOS> if the prompt does not already contain
-    <BOS> (this is accomplished using a logit
-    processor setting.)
-    So when we use HF BART as our baseline for comparison, note that
-    when the user provides a request with a None decoder prompt
-    (i.e. a singleton encoder prompt, or else an explicit encoder/
-    decoder prompt with the decoder sub-prompt set to None), HF and
-    vLLM handle this in different ways:
-    * HF will (1) tokenize the None prompt as an empty token-list, 
-      (2) append <decoder-start-token> to the beginning, yielding
-      [<decoder-start-token>], (3) pass this token list to the model, and
-      then (4) after computing logits during prefill, override the model
-      logits & force <BOS> to be the first generated token.
-    * vLLM will (1) tokenize the None prompt as [<BOS>], (2) append decoder-
-      start-token to the beginning, yielding [<decoder-start-token><BOS>],
-      (3) pass these tokens to the model & proceed with generation.
-    The net effect is that compared to vLLM, the list of HF *decoded* tokens
-    will contain one more initial <BOS> than the vLLM generated tokens,
-    because vLLM's <BOS> token is injected into the prompt rather than into
-    the generated output. This is in spite of the fact that overall, the
-    complete sequences (prompt + decoded tokens) produced by vLLM will match
-    HF.
-    So when we use HF decoded token output to validate vLLM's decoded token
-    output, the testing process must account for the difference in decoded
-    token sequences between vLLM and HF specifically in the
-    decoder-prompt-is-None case. 
-    One option is to disable the logit processor feature that forces the
-    <BOS> token to be decoded (forced_bos_token_id = None), eliminating
-    the problem entirely. However this is not "normal" BART usage.
-    The other option is - only in the decoder-prompt-is-None case - to
-    discard the first decoded token from the HF output before comparing it
-    to vLLM.
-    To that end, when testing the scenario where the decoder prompt is None
-    (and only in that one scenario), this test skips the first HF decoded
-    token during the process of validating the vLLM decoded output.
-    '''
-    # NOTE: take care of the order. run vLLM first, and then run HF.
-    # vLLM needs a fresh new process without cuda initialization.
-    # if we run HF first, the cuda initialization will be done and it
-    # will hurt multiprocessing backend with fork method (the default).
-    # Note: currently encoder/decoder models are only compatible with
-    # enforce_eager=True. Normally this is not a problem because
-    # for encoder/decoder models vLLM will
-    # default to enforce_eager=True if enforce_eager
-    # is left unspecified. However, the
-    # VllmRunner test fixture (which wraps around the LLM class) defaults to
-    # enforce_eager=False (a behavior which a number of already-existing
-    # decoder-only unit tests expect), so when testing an encoder/decoder
-    # model we must explicitly specify enforce_eager=True in the VllmRunner
-    # constructor.
-    with vllm_runner(model,
-                     dtype=dtype,
-                     tensor_parallel_size=tensor_parallel_size,
-                     distributed_executor_backend=distributed_executor_backend,
-                     enforce_eager=True) as vllm_model:
-        vllm_outputs = vllm_model.generate_encoder_decoder_greedy_logprobs(
-            prompts, max_tokens, num_logprobs)
-    # Configuration settings for HF baseline
-    hf_kwargs = {
-        "top_k": None,
-        "num_beams": 1,
-        "repetition_penalty": 1.0,
-        "top_p": 1.0,
-        "length_penalty": 1.0,
-        "early_stopping": False,
-        "no_repeat_ngram_size": None,
-        "min_length": 0
-    }
-    with hf_runner(model, dtype=dtype,
-                   auto_cls=AutoModelForSeq2SeqLM) as hf_model:
-        hf_outputs = (hf_model.generate_encoder_decoder_greedy_logprobs_limit(
-            prompts,
-            max_tokens,
-            num_logprobs,
-            **hf_kwargs,
-        ))
-    hf_skip_tokens = (1
-                      if decoder_prompt_type == DecoderPromptType.NONE else 0)
-    check_logprobs_close(
-        outputs_0_lst=hf_outputs,
-        outputs_1_lst=[
-            vllm_to_hf_output(vllm_output, decoder_prompt_type)
-            for vllm_output in vllm_outputs
-        ],
-        name_0="hf",
-        name_1="vllm",
-        num_outputs_0_skip_tokens=hf_skip_tokens,
-    )
-@pytest.mark.parametrize(
-    "model",
-    [
-        pytest.param("facebook/bart-base",
-                     marks=[pytest.mark.core_model, pytest.mark.cpu_model]),
-        pytest.param("facebook/bart-large-cnn"),
-    ],
-)
-@pytest.mark.parametrize("dtype", ["float", "bfloat16"])
-@pytest.mark.parametrize("max_tokens", [64])
-@pytest.mark.parametrize("num_logprobs", [5])
-@pytest.mark.parametrize("decoder_prompt_type", list(DecoderPromptType))
-@pytest.mark.skip(reason="bart not supported in V1")
-def test_models(hf_runner, vllm_runner, example_encoder_decoder_prompts, model,
-                dtype, max_tokens, num_logprobs, decoder_prompt_type) -> None:
-    run_test(
-        hf_runner,
-        vllm_runner,
-        example_encoder_decoder_prompts[decoder_prompt_type],
-        decoder_prompt_type,
-        model,
-        dtype=dtype,
-        max_tokens=max_tokens,
-        num_logprobs=num_logprobs,
-        tensor_parallel_size=1,
-    )
-@multi_gpu_test(num_gpus=2)
-@pytest.mark.parametrize("distributed_executor_backend", ["ray", "mp"])
-@pytest.mark.parametrize("model", ["facebook/bart-large-cnn"])
-@pytest.mark.parametrize("dtype", ["float"])
-@pytest.mark.parametrize("max_tokens", [64])
-@pytest.mark.parametrize("num_logprobs", [5])
-@pytest.mark.parametrize("decoder_prompt_type", [DecoderPromptType.CUSTOM])
-@pytest.mark.skip(reason="bart not supported in V1")
-def test_models_distributed(hf_runner, vllm_runner,
-                            example_encoder_decoder_prompts,
-                            distributed_executor_backend, model, dtype,
-                            max_tokens, num_logprobs,
-                            decoder_prompt_type) -> None:
-    run_test(
-        hf_runner,
-        vllm_runner,
-        example_encoder_decoder_prompts[decoder_prompt_type],
-        decoder_prompt_type,
-        model,
-        dtype=dtype,
-        max_tokens=max_tokens,
-        num_logprobs=num_logprobs,
-        tensor_parallel_size=2,
-        distributed_executor_backend=distributed_executor_backend,
-    )
--- a/tests/models/language/generation/test_mbart.py
+++ b/tests/models/language/generation/test_mbart.py
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-from typing import Optional
-import pytest
-from transformers import AutoModelForSeq2SeqLM
-from vllm.sequence import SampleLogprobs
-from ....conftest import DecoderPromptType, HfRunner, VllmRunner
-from ...utils import check_logprobs_close
-def vllm_to_hf_output(
-    vllm_output: tuple[list[int], str, Optional[SampleLogprobs]],
-    decoder_prompt_type: DecoderPromptType,
-):
-    """Sanitize vllm output to be comparable with hf output."""
-    output_ids, output_str, out_logprobs = vllm_output
-    hf_output_str = output_str + "</s>"
-    return output_ids, hf_output_str, out_logprobs
-def run_test(
-    hf_runner: type[HfRunner],
-    vllm_runner: type[VllmRunner],
-    prompts: list[dict[str, str]],
-    decoder_prompt_type: DecoderPromptType,
-    model: str,
-    *,
-    dtype: str,
-    max_tokens: int,
-    num_logprobs: int,
-    tensor_parallel_size: int,
-    distributed_executor_backend: Optional[str] = None,
-) -> None:
-    '''
-    Test the vLLM mBART model by validating it against HuggingFace (HF).
-    (Docstring content is omitted for brevity)
-    '''
-    vllm_prompts = prompts
-    if decoder_prompt_type == DecoderPromptType.NONE:
-        vllm_prompts = [{
-            "encoder_prompt": p['encoder_prompt'],
-            "decoder_prompt": ""
-        } for p in prompts]
-    vllm_kwargs = {
-        "hf_overrides": {
-            "architectures": ["MBartForConditionalGeneration"]
-        }
-    }
-    with vllm_runner(model,
-                     dtype=dtype,
-                     tensor_parallel_size=tensor_parallel_size,
-                     distributed_executor_backend=distributed_executor_backend,
-                     enforce_eager=True,
-                     **vllm_kwargs) as vllm_model:  # type: ignore
-        vllm_outputs = vllm_model.generate_encoder_decoder_greedy_logprobs(
-            vllm_prompts, max_tokens, num_logprobs)
-    hf_kwargs = {
-        "top_k": None,
-        "num_beams": 1,
-        "repetition_penalty": 1.0,
-        "top_p": 1.0,
-        "length_penalty": 1.0,
-        "early_stopping": False,
-        "no_repeat_ngram_size": None,
-        "min_length": 0
-    }
-    with hf_runner(model, dtype=dtype,
-                   auto_cls=AutoModelForSeq2SeqLM) as hf_model:
-        hf_kwargs["decoder_start_token_id"] = (
-            hf_model.tokenizer.lang_code_to_id["ro_RO"])
-        hf_outputs = (
-            hf_model.generate_encoder_decoder_greedy_logprobs_limit(
-                prompts,  # HF runner still uses the original prompts
-                max_tokens,
-                num_logprobs,
-                **hf_kwargs,
-            ))
-    hf_skip_tokens = 0
-    check_logprobs_close(
-        outputs_0_lst=hf_outputs,
-        outputs_1_lst=[
-            vllm_to_hf_output(vllm_output, decoder_prompt_type)
-            for vllm_output in vllm_outputs
-        ],
-        name_0="hf",
-        name_1="vllm",
-        num_outputs_0_skip_tokens=hf_skip_tokens,
-    )
-@pytest.mark.parametrize(
-    "model",
-    [pytest.param("facebook/mbart-large-en-ro")],
-)
-@pytest.mark.parametrize("dtype", ["float", "bfloat16"])
-@pytest.mark.parametrize("max_tokens", [64])
-@pytest.mark.parametrize("num_logprobs", [5])
-@pytest.mark.parametrize("decoder_prompt_type", list(DecoderPromptType))
-def test_models(hf_runner, vllm_runner, example_encoder_decoder_prompts, model,
-                dtype, max_tokens, num_logprobs, decoder_prompt_type) -> None:
-    run_test(
-        hf_runner,
-        vllm_runner,
-        example_encoder_decoder_prompts[decoder_prompt_type],
-        decoder_prompt_type,
-        model,
-        dtype=dtype,
-        max_tokens=max_tokens,
-        num_logprobs=num_logprobs,
-        tensor_parallel_size=1,
-    )