更新README.md，修改Docker镜像版本和深度学习库依赖，删除不再使用的示例文件，调整推理示例命令以支持新功能。

2937f02a · laibao · 35bfbbd8 · 2937f02a · 35bfbbd8 · 2937f02a
Commit 2937f02a authored Aug 16, 2025 by laibao
20 changed files
--- a/README.md
+++ b/README.md
@@ -44,7 +44,7 @@ ChatGLM系列模型基于GLM架构开发。GLM是一种基于Transformer的语
 提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像：

 ```
-docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
+docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.8.5-ubuntu22.04-dtk25.04.1-rc5-das1.6-py3.10-20250724
 # <Image ID>用上面拉取docker镜像的ID替换
 # <Host Path>主机端路径
 # <Container Path>容器映射路径
@@ -69,14 +69,15 @@ docker run -it --name chatglm_vllm --privileged --shm-size=64G  --device=/dev/kf
 conda create -n chatglm_vllm python=3.10
 ```

-关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。

-* DTK驱动：dtk24.04.3
-* Pytorch: 2.3.0
-* triton: 2.1.0
-* lmslim: 0.1.2
+* DTK驱动：dtk25.04.01
+* Pytorch: 2.4.0
+* triton: 3.0.0
+* lmslim: 0.2.1
 * flash_attn: 2.6.1
-* vllm: 0.6.2
+* flash_mla: 1.0.0
+* vllm: 0.8.5
 * python: python3.10

 `Tips：需先安装相关依赖，最后安装vllm包`  
@@ -110,7 +111,7 @@ export VLLM_RANK7_NUMA=7
 ### 离线批量推理

 ```bash
-python examples/offline_inference.py
+VLLM_USE_FLASH_ATTN_PA=1 python examples/offline_inference/basic/basic.py
 ```

 其中，`prompts`为提示词；`temperature`为控制采样随机性的值，值越小模型生成越确定，值变高模型生成更随机，0表示贪婪采样，默认为1；`max_tokens=16`为生成长度，默认为1；
@@ -121,7 +122,7 @@ python examples/offline_inference.py
 1、指定输入输出

 ```bash
-python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model THUDM/glm-4-9b-chat -tp 1 --trust-remote-code --enforce-eager --dtype float16
+VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model THUDM/glm-4-9b-chat -tp 1 --trust-remote-code --enforce-eager --dtype float16
 ```

 其中 `--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。若指定 `--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
@@ -129,9 +130,10 @@ glm-4-9b-chat-1m模型默认的model_max_length为1024000，官方vllm也尚不

 2、使用数据集
 下载数据集：
+[sharegpt_v3_unfiltered_cleaned_split](https://huggingface.co/datasets/learnanything/sharegpt_v3_unfiltered_cleaned_split)

 ```bash
-wget http://113.200.138.88:18080/aidatasets/vllm_data/-/raw/main/ShareGPT_V3_unfiltered_cleaned_split.json
+VLLM_USE_FLASH_ATTN_PA=1 python benchmarks/benchmark_throughput.py --num-prompts 1 --model THUDM/glm-4-9b-chat --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
 ```

 ```bash
@@ -145,13 +147,13 @@ python benchmarks/benchmark_throughput.py --num-prompts 1 --model THUDM/glm-4-9b
 1、启动服务端：

 ```bash
-python -m vllm.entrypoints.openai.api_server  --model THUDM/glm-4-9b-chat  --dtype float16 --enforce-eager -tp 1 
+VLLM_USE_FLASH_ATTN_PA=1 vllm serve --model THUDM/glm-4-9b-chat --enforce-eager --dtype float16 --trust-remote-code --tensor-parallel-size 1
 ```

 2、启动客户端：

-```bash
-python benchmarks/benchmark_serving.py --model THUDM/glm-4-9b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
+```
+python benchmarks/benchmark_serving.py --model THUDM/glm-4-9b-chat --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
 ```

 参数同使用数据集，离线批量推理性能测试，具体参考[benchmarks/benchmark_serving.py]（benchmarks/benchmark_serving.py）
@@ -161,7 +163,7 @@ python benchmarks/benchmark_serving.py --model THUDM/glm-4-9b-chat --dataset Sha
 启动服务：

 ```bash
-vllm serve THUDM/glm-4-9b-chat --enforce-eager --dtype float16 --trust-remote-code --chat-template template_chatglm2.jinja --port 8000
+VLLM_USE_FLASH_ATTN_PA=1 vllm serve THUDM/glm-4-9b-chat --enforce-eager --dtype float16 --trust-remote-code
 ```

 这里serve之后 为加载模型路径，`--dtype`为数据类型：float16，默认情况使用tokenizer中的预定义聊天模板，`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。
@@ -191,17 +193,21 @@ curl http://localhost:8000/v1/completions \

 ```bash
 curl http://localhost:8000/v1/chat/completions \
-    -H "Content-Type: application/json" \
-    -d '{
-        "model": "THUDM/glm-4-9b-chat",
-        "messages": [
-            {"role": "system", "content": "晚上睡不着怎么办"},
-            {"role": "user", "content": "晚上睡不着怎么办"}
-        ]
-    }'
+  -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "THUDM/glm-4-9b-chat",
+    "max_tokens": 128,
+    "messages": [
+      {
+        "role": "user",
+        "content": "晚上睡不着怎么办"
+      }
+    ]
+  }'
 ```

-或者使用[examples/openai_chatcompletion_client.py](examples/openai_chatcompletion_client.py)
+或者使用[examples/online_serving/openai_chat_completion_client.py](examples/online_serving/openai_chat_completion_client.py)

 ### **gradio和vllm结合使用**

@@ -216,7 +222,7 @@ pip install gradio
    2.1 启动gradio服务，根据提示操作

 ```
-python  gradio_openai_chatbot_webserver.py --model "THUDM/glm-4-9b-chat" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids ""
+python examples/online_serving/gradio_openai_chatbot_webserver.py --model "THUDM/glm-4-9b-chat" --model-url http://localhost:8000/v1 --temp 0.8 --stop-token-ids ""
 ```

    2.2 更改文件权限
@@ -236,7 +242,7 @@ ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节
 3.启动OpenAI兼容服务

 ```
-vllm serve THUDM/glm-4-9b-chat  --enforce-eager --dtype float16 --trust-remote-code --chat-template template_chatglm2.jinja --port 8000
+VLLM_USE_FLASH_ATTN_PA=1 vllm serve THUDM/glm-4-9b-chat --enforce-eager --dtype float16 --trust-remote-code --host "0.0.0.0"
 ```

 4.启动gradio服务

--- a/examples/cpu_offload.py
+++ b/examples/cpu_offload.py
-from vllm import LLM, SamplingParams
-
-# Sample prompts.
-prompts = [
-    "Hello, my name is",
-    "The president of the United States is",
-    "The capital of France is",
-    "The future of AI is",
-]
-# Create a sampling params object.
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-# Create an LLM.
-llm = LLM(model="meta-llama/Llama-2-13b-chat-hf", cpu_offload_gb=10)
-# Generate texts from the prompts. The output is a list of RequestOutput objects
-# that contain the prompt, generated text, and other information.
-outputs = llm.generate(prompts, sampling_params)
-# Print the outputs.
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
--- a/examples/fp8/README.md
+++ b/examples/fp8/README.md
+# FP8 KV Cache 
+
+This utility extracts the KV cache scaling factors from a quantized HF (Hugging Face) model. The extracted scaling factors are saved to a JSON file, which can later be used by vLLM (variable-length language model) during runtime. This tool is particularly useful when the KV cache data type is FP8 and is intended for use on ROCm (hcu) platforms.
+
+## Prerequisites
+
+- Python 3.x
+- PyTorch
+- NumPy
+- Hugging Face Transformers
+- Hugging Face Hub
+- AMMO 
+
+Before incorporating the FP8 datatype for inference workloads, you must adhere to the following steps:
+1. Install all necessary prerequisites and dependencies. 
+2. Convert HF model into a quantized HF model. 
+3. Extract KV Cache Scaling Factors from quantized HF model.
+4. Load KV Cache Scaling Factors into VLLM.
+
+### 2. Convert HF model into a quantized HF model.
+Note: The following steps are adapted from the [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/README.md).
+
+`quantize.py` (examples/fp8/quantizer/quantize.py) uses the quantization toolkit  (AMMO) to calibrate the PyTorch models and export TensorRT-LLM checkpoints. Each TensorRT-LLM checkpoint contains a config file (in .json format) and one or several rank weight files (in .safetensors format).
+
+The detailed quantization toolkit (AMMO) conversion guide for FP8 can be found at `examples/fp8/quantizer/README.md`.
+
+### 3. Extract KV Cache Scaling Factors from quantized HF model.
+`extract_scales.py` (examples/fp8/extract_scales.py) can be utilized to extract the KV cache scaling factors from your quantized HF model, however at the moment, this tool exclusively supports Llama 2 models. It is also important to note the following:
+1. **File Structure**: The utility operates under the assumption that all parameters, including KV cache scaling factors, corresponding to a particular Tensor Parallelism (TP) rank are stored in a single file. These files must adhere to a specific naming convention where the TP rank is immediately identified after a specific keyword (e.g., "rank") in the filename.
+
+2. **TP Decomposition**: The utility assumes consistency between the TP decomposition employed by the quantizer tool and that used by vLLM.
+
+3. **AMMO Compatibility**: Currently, the generated KV cache scaling factors for AMMO remain uniform across all TP ranks.
+
+```python
+# prerequisites:
+# - Quantized HF LLaMa 2 model 
+python3 examples/fp8/extract_scales.py --help
+Usage: extract_scales.py [-h] --quantized_model QUANTIZED_MODEL [--load_format {auto,safetensors,npz,pt}] [--output_dir OUTPUT_DIR] [--output_name OUTPUT_NAME] [--tp_size TP_SIZE]
+
+KV Scale Extraction Example
+
+optional arguments:
+--quantized_model: Specify either the local path to, or name of, a quantized HF model. It is expected that the quantization format is FP8_E4M3, for use on ROCm (hcu).
+Optional arguments:
+--cache_dir: Specify a cache directory to use in the event of a HF model download. (Default: None)
+--load_format: Specify the format of the model's tensor files containing the KV cache scaling factors. (Choices: auto, safetensors, npz, pt; Default: auto)
+--revision: Specify the model's revision number. (Default: None)
+--output_dir: Specify the output directory. By default the KV cache scaling factors will be saved in the model directory. (Default: None)
+--output_name: Specify the output filename. (Default: kv_cache_scales.json)
+--tp_size: Specify the tensor-parallel (TP) size that the quantized model should correspond to. If specified, during KV cache scaling factor extraction the observed TP size will be checked against this and an error will be raised if there is a mismatch. (Default: None)
+```
+```python
+Example:
+python3 examples/fp8/extract_scales.py --quantized_model <QUANTIZED_MODEL_DIR> --tp_size <TENSOR_PARALLEL_SIZE> --output_dir <PATH_TO_OUTPUT_DIR>
+```
+### 4. Load KV Cache Scaling Factors into VLLM.
+This script evaluates the inference throughput of language models using various backends such as vLLM. It measures the time taken to process a given number of prompts and generate sequences for each prompt. The recently generated KV cache scaling factors are now integrated into the benchmarking process and allow for KV cache scaling factors to be utilized for FP8.
+```python
+# prerequisites:
+# -  LLaMa 2 kv_cache_scales.json file
+
+python3 benchmarks/benchmark_throughput.py --help 
+usage: benchmark_throughput.py [-h] [--backend {vllm,hf,mii}] [--dataset DATASET] [--input-len INPUT_LEN] [--output-len OUTPUT_LEN] [--model MODEL]
+                               [--tokenizer TOKENIZER] [--quantization {awq,gptq,None}] [--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--n N]
+                               [--use-beam-search] [--num-prompts NUM_PROMPTS] [--seed SEED] [--hf-max-batch-size HF_MAX_BATCH_SIZE] [--trust-remote-code]
+                               [--max-model-len MAX_MODEL_LEN] [--dtype {auto,half,float16,bfloat16,float,float32}] [--enforce-eager] [--kv-cache-dtype {auto,fp8}]
+                               [--quantization-param-path KV_CACHE_quantization_param_path]
+
+Benchmark Throughput Example  
+optional arguments:
+  -h, --help  show this help message and exit
+  --backend {vllm,hf,mii}
+  --dataset DATASET  Path to the dataset.
+  --input-len INPUT_LEN  Input prompt length for each request
+  --output-len OUTPUT_LEN  Output length for each request. Overrides the output length from the dataset.
+  --model MODEL
+  --tokenizer TOKENIZER
+  --quantization {awq,gptq,None}, -q {awq,gptq,None}
+  --tensor-parallel-size TENSOR_PARALLEL_SIZE, -tp TENSOR_PARALLEL_SIZE
+  --n N  Number of generated sequences per prompt.
+  --use-beam-search
+  --num-prompts NUM_PROMPTS  Number of prompts to process.
+  --seed SEED
+  --hf-max-batch-size HF_MAX_BATCH_SIZE   Maximum batch size for HF backend.
+  --trust-remote-code trust remote code from huggingface
+  --max-model-len MAX_MODEL_LEN  Maximum length of a sequence (including prompt and output). If None, will be derived from the model.
+  --dtype {auto,half,float16,bfloat16,float,float32}  data type for model weights and activations. The "auto" option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
+  --enforce-eager  enforce eager execution
+  --kv-cache-dtype {auto,fp8} Data type for kv cache storage. If "auto", will use model data type. FP8_E5M2 (without scaling) is only supported on cuda version greater than 11.8. On ROCm (hcu), FP8_E4M3 is instead supported ```for common inference criteria.
+  --quantization-param-path QUANT_PARAM_JSON Path to the JSON file containing the KV cache scaling factors. This should generally be supplied, when KV cache dtype is FP8. Otherwise, KV cache scaling factors default to 1.0, which may cause accuracy issues. FP8_E5M2 (without scaling) is only supported on cuda version greater than 11.8. On ROCm (hcu), FP8_E4M3 is instead supported for common inference criteria.
+```
+```
+Example:
+python3 benchmarks/benchmark_throughput.py --input-len <INPUT_LEN> --output-len <OUTPUT_LEN> -tp <TENSOR_PARALLEL_SIZE> --kv-cache-dtype fp8 --quantization-param-path <path/to/kv_cache_scales.json> --model <path-to-llama2>
+```python
--- a/examples/fp8/extract_scales.py
+++ b/examples/fp8/extract_scales.py
+import argparse
+import glob
+import json
+import os
+from typing import Any, Callable, Dict, List, Optional, Tuple
+
+import numpy as np
+import torch
+from safetensors.torch import safe_open
+
+from vllm.model_executor.layers.quantization.schema import QuantParamSchema
+
+
+# Adapted from vllm/model_executor/model_loader/weight_utils.py
+# The main differences are that we add the NPZ format and simplify
+# its functionality drastically for our purposes (e.g. we assume that
+# the quantized model exists locally and there is no need to download it)
+def _prepare_hf_weights(
+    quantized_model_dir: str,
+    load_format: str = "auto",
+    fall_back_to_pt: bool = True,
+) -> Tuple[List[str], bool]:
+    if not os.path.isdir(quantized_model_dir):
+        raise FileNotFoundError(
+            f"The quantized model directory `{quantized_model_dir}` "
+            "does not exist.")
+    use_safetensors = False
+    # Some quantized models use .pt files for storing the weights.
+    if load_format == "auto":
+        allow_patterns = ["*.safetensors", "*.bin"]
+    elif load_format == "safetensors":
+        use_safetensors = True
+        allow_patterns = ["*.safetensors"]
+    elif load_format == "pt":
+        allow_patterns = ["*.pt"]
+    elif load_format == "npz":
+        allow_patterns = ["*.npz"]
+    else:
+        raise ValueError(f"Unknown load_format: {load_format}")
+    if fall_back_to_pt:
+        allow_patterns += ["*.pt"]
+
+    hf_weights_files: List[str] = []
+    for pattern in allow_patterns:
+        hf_weights_files += glob.glob(
+            os.path.join(quantized_model_dir, pattern))
+        if len(hf_weights_files) > 0:
+            if pattern == "*.safetensors":
+                use_safetensors = True
+            break
+
+    if not use_safetensors:
+        # Exclude files that are not needed for inference.
+        # https://github.com/huggingface/transformers/blob/v4.34.0/src/transformers/trainer.py#L227-L233
+        blacklist = [
+            "training_args.bin",
+            "optimizer.bin",
+            "optimizer.pt",
+            "scheduler.pt",
+            "scaler.pt",
+        ]
+        hf_weights_files = [
+            f for f in hf_weights_files
+            if not any(f.endswith(x) for x in blacklist)
+        ]
+
+    if len(hf_weights_files) == 0:
+        raise RuntimeError(
+            f"Cannot find any model weights with `{quantized_model_dir}`")
+
+    return hf_weights_files, use_safetensors
+
+
+# Adapted from vllm/model_executor/model_loader/weight_utils.py
+def _hf_tensorfile_iterator(filename: str, load_format: str,
+                            use_safetensors: bool):
+    if load_format == "npz":
+        assert not use_safetensors
+        with np.load(filename) as data:
+            for name in data.files:
+                param = torch.from_numpy(data[name])
+                yield name, param
+    elif use_safetensors:
+        with safe_open(filename, framework="pt") as f:
+            for name in f.keys():  # NOQA: SIM118
+                param = f.get_tensor(name)
+                yield name, param
+    else:
+        state = torch.load(filename, map_location="cpu")
+        for name, param in state.items():
+            yield name, param
+        del state
+        torch.cuda.empty_cache()
+
+
+def _kv_scales_extractor(
+        hf_tensor_files: List[str],
+        use_safetensors: bool,
+        rank_keyword: str = "rank",
+        expected_tp_size: Optional[int] = None) -> Dict[int, Dict[int, float]]:
+    """
+    Given a list of files containing tensor data, attempt to extract KV cache
+    scales from these files. Intended as a helper function taking in the output
+    from _prepare_hf_weights.
+    Args:
+    rank_keyword        Matches the number immediately after this keyword in the
+                        tensor filename to determine the TP rank corresponding
+                        to said tensor file
+    expected_tp_size    If specified, the TP size of the tensor files is checked
+                        against this and an error is raised if they don't match.
+    Returns a dictionary mapping TP ranks to their relevant KV cache scales.
+    The per-rank scales are themselves represented as a dictionary of layer
+    indices to the respective per-layer scale.
+    """
+    for char in rank_keyword:
+        assert not char.isdecimal(
+        ), f"Rank keyword {rank_keyword} contains a numeric character!"
+    rank_scales_map: Dict[int, Dict[int, float]] = {}
+    for tensor_file in hf_tensor_files:
+        try:
+            rank_idx = tensor_file.find(rank_keyword)
+            if rank_idx != -1:
+                start_idx = rank_idx + len(rank_keyword)
+                stop_idx = start_idx
+                while stop_idx < len(
+                        tensor_file) and tensor_file[stop_idx].isdecimal():
+                    stop_idx += 1
+                if stop_idx == start_idx:
+                    raise RuntimeError("Did not find rank # in filename.")
+                rank = int(tensor_file[start_idx:stop_idx])
+            elif len(hf_tensor_files) == 1:
+                # Since there is only one tensor file, we can assume
+                # that it's intended for TP rank 0
+                rank = 0
+            else:
+                raise RuntimeError(
+                    f"Filename does not contain '{rank_keyword}'.")
+        except RuntimeError:
+            print("Unable to determine TP rank "
+                  f"corresponding to file '{tensor_file}'")
+            raise
+
+        if rank not in rank_scales_map:
+            layer_scales_map: Dict[int, float] = {}
+            rank_scales_map[rank] = layer_scales_map
+        else:
+            raise RuntimeError(
+                f"Tensor file '{tensor_file}' shares TP rank {rank} "
+                "with another tensor file.")
+
+        module_delimiter = ":" if args.load_format == "npz" else "."
+        for name, param in _hf_tensorfile_iterator(tensor_file,
+                                                   args.load_format,
+                                                   use_safetensors):
+            if "kv_cache_scaling_factor" in name:
+                nums = [
+                    int(s) for s in name.split(module_delimiter)
+                    if s.isdecimal()
+                ]
+                assert len(
+                    nums) == 1, f"Could not determine layer idx for {name}"
+                layer_idx = nums[0]
+                assert layer_idx not in layer_scales_map, f"Duplicate scaling"\
+                    f" factor corresponding to layer {layer_idx}"
+                try:
+                    layer_scales_map[layer_idx] = param.item()
+                except RuntimeError:
+                    print(
+                        "This utility supports only per-tensor scalar scales "
+                        f"for now. The tensor\n {name} = {param} \nis an "
+                        "invalid scale factor.")
+                    raise
+
+    if all(
+            len(layer_scales_map) == 0
+            for layer_scales_map in rank_scales_map.values()):
+        # Note: this is true even if the rank_scales_map is empty
+        print("WARNING: No KV cache scale factors found. No output saved.")
+        return None
+    empirical_tp_world_size = max(rank_scales_map.keys()) + 1
+    if expected_tp_size is not None:
+        assert expected_tp_size == empirical_tp_world_size, \
+            f"User expected TP world size = {expected_tp_size} " \
+            "from model but tool is expecting TP world size = " \
+            f"{empirical_tp_world_size} from model instead."
+    for i in range(empirical_tp_world_size):
+        assert i in rank_scales_map, "Expected TP world size = "\
+            f"{empirical_tp_world_size} but did not find KV " \
+            f"cache scaling factors for TP rank {i}"
+    print(f"Found TP world size = {empirical_tp_world_size} "
+          "when extracting KV cache scales!")
+    return rank_scales_map
+
+
+def _metadata_extractor(quantized_model_dir: str,
+                        metadata_extract_fns: \
+                        Dict[str, Callable[[Dict[str, Any]], Any]]) \
+                        -> Dict[str, Any]:
+    """
+    Given a directory containing quantized model files, this function
+    aims to extract metadata from the JSON files within this directory.
+    Each JSON file is expected to represent a dictionary in JSON
+    format (referred to as a "JSON-dictionary"). Metadata extraction is
+    defined by a dictionary called metadata_extract_fns, where each
+    metadata field name is mapped to an extraction function.
+
+    These extraction functions are designed to take a JSON-dictionary
+    as their only argument  and return the corresponding metadata.
+    While extraction functions are permitted to raise  exceptions, they
+    should only raise a KeyError or ValueError if the metadata field
+    cannot  be extracted from the current JSON-dictionary, yet there's
+    a possibility of finding it in another JSON-dictionary.
+
+    The function returns a dictionary that maps metadata fields to
+    their extracted data. The keys of this dictionary correspond exactly
+    to those in metadata_extract_fns. If any fields fail to be extracted,
+    their corresponding values are set to None, and a warning is printed.
+    """
+    if not os.path.isdir(quantized_model_dir):
+        raise FileNotFoundError(
+            f"The quantized model directory `{quantized_model_dir}` "
+            "does not exist.")
+    metadata_files = glob.glob(os.path.join(quantized_model_dir, "*.json"))
+
+    result: Dict[str, Any] = {}
+    for file in metadata_files:
+        with open(file) as f:
+            try:
+                metadata = json.load(f)
+            except json.JSONDecodeError:
+                print(f"Could not parse `{file}` as a valid metadata file,"
+                      " skipping it.")
+                continue
+            if not isinstance(metadata, dict):
+                print(f"The file `{file}` does not correspond to a "
+                      "JSON-serialized dictionary, skipping it.")
+                continue
+            for metadata_name, extract_fn in metadata_extract_fns.items():
+                try:
+                    metadata_info = extract_fn(metadata)
+                    if metadata_name not in result:
+                        result[metadata_name] = metadata_info
+                    elif metadata_info != result[metadata_name]:
+                        raise RuntimeError(
+                            "Metadata mismatch! Originally found "
+                            f"{metadata_name} = {result[metadata_name]} but "
+                            f"now found {metadata_name} = {metadata_info} in "
+                            f"`{file}`")
+                except KeyError:
+                    # It is possible that a given file does not contain some
+                    # of our selected metadata as it could be located in some
+                    # other metadata file.
+                    # 'EFINAE': extract_fn failure is not an error.
+                    pass
+                except ValueError:
+                    # See above.
+                    pass
+
+    # Warn if we cannot find any of the requested metadata
+    for metadata_name in metadata_extract_fns:
+        if metadata_name not in result:
+            print("WARNING: Unable to find requested metadata field "
+                  f"`{metadata_name}`, setting it to None.")
+            result[metadata_name] = None
+
+    return result
+
+
+def main(args):
+    metadata_extract_fns = {
+        "model_type": lambda json_dict: json_dict["layers"][0]["decoder_type"],
+        "tp_size": lambda json_dict: int(json_dict["tensor_parallel"]),
+        "model_dtype": lambda json_dict: json_dict["dtype"]
+    }
+    recovered_metadata = _metadata_extractor(args.quantized_model,
+                                             metadata_extract_fns)
+    if args.tp_size is not None:
+        metadata_tp_size = recovered_metadata["tp_size"]
+        if metadata_tp_size is not None:
+            assert args.tp_size == metadata_tp_size, \
+              f"User expected TP world size = {args.tp_size} " \
+              f"but found TP world size = {metadata_tp_size} from metadata!"
+    expected_tp_size = args.tp_size or recovered_metadata["tp_size"]
+    rank_keyword = "rank"
+    hf_tensor_files, use_safetensors = _prepare_hf_weights(
+        args.quantized_model, args.load_format)
+    rank_scales_map = _kv_scales_extractor(hf_tensor_files, use_safetensors,
+                                           rank_keyword, expected_tp_size)
+    # Postprocess: formatting to the current schema. Consider pulling it
+    # out into a dedicated function should it ever become more complicated.
+    rank_scales_map = {
+        rank: {k: scale[k]
+               for k in sorted(scale.keys())}
+        for rank, scale in rank_scales_map.items()
+    }
+    # TODO: Expand this with activation and weights scaling factors when
+    # they are used in the future
+    schema = QuantParamSchema(
+        model_type=recovered_metadata["model_type"],
+        kv_cache={
+            "dtype": ("float8_e4m3fn" if len(rank_scales_map) > 0 else
+                      recovered_metadata["model_dtype"]),
+            "scaling_factor":
+            rank_scales_map
+        },
+    )
+
+    if args.output_dir is None:
+        output_file = os.path.join(args.quantized_model, args.output_name)
+    else:
+        if not os.path.isdir(args.output_dir):
+            os.makedirs(args.output_dir, exist_ok=True)
+        output_file = os.path.join(args.output_dir, args.output_name)
+
+    with open(output_file, 'w') as f:
+        f.write(schema.model_dump_json(indent=4))
+        print(f"Completed! KV cache scaling factors saved to {output_file}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="This simple utility extracts the "
+        "KV cache scaling factors from a quantized HF model "
+        "and saves them to a JSON file compatible with later "
+        "use by vLLM (pass this file to the appropriate "
+        "runtime typically using the argument "
+        "--quantization-param-path <filename>). This is only used "
+        "if the KV cache dtype is FP8 and on ROCm (hcu).")
+    parser.add_argument(
+        "--quantized-model",
+        help="Specify the directory containing a single quantized HF model. "
+        "It is expected that the quantization format is FP8_E4M3, for use "
+        "on ROCm (hcu).",
+        required=True)
+    parser.add_argument(
+        "--load_format",
+        help="Optionally specify the format of the model's tensor files "
+        "containing the KV cache scaling factors.",
+        choices=["auto", "safetensors", "npz", "pt"],
+        default="auto")
+    parser.add_argument(
+        "--output-dir",
+        help="Optionally specify the output directory. By default the "
+        "KV cache scaling factors will be saved in the model directory, "
+        "however you can override this behavior here.",
+        default=None)
+    parser.add_argument(
+        "--output-name",
+        help="Optionally specify the output filename.",
+        # TODO: Change this once additional scaling factors are enabled
+        default="kv_cache_scales.json")
+    parser.add_argument(
+        "--tp-size",
+        help="Optionally specify the tensor-parallel (TP) size that the "
+        "quantized model should correspond to. If specified, during KV "
+        "cache scaling factor extraction the observed TP size will be "
+        "checked against this and an error will be raised if there is "
+        "a mismatch. If not specified, the quantized model's expected "
+        "TP size is instead inferred from the largest TP rank observed. "
+        "The expected TP size is cross-checked against the TP ranks "
+        "observed in the quantized model and an error is raised if any "
+        "discrepancies are found.",
+        default=None,
+        type=int)
+    args = parser.parse_args()
+
+    main(args)
--- a/examples/gguf_inference.py
+++ b/examples/gguf_inference.py
-from huggingface_hub import hf_hub_download
-
-from vllm import LLM, SamplingParams
-
-
-def run_gguf_inference(model_path):
-    PROMPT_TEMPLATE = "<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>\n"  # noqa: E501
-    system_message = "You are a friendly chatbot who always responds in the style of a pirate."  # noqa: E501
-    # Sample prompts.
-    prompts = [
-        "How many helicopters can a human eat in one sitting?",
-        "What's the future of AI?",
-    ]
-    prompts = [
-        PROMPT_TEMPLATE.format(system_message=system_message, prompt=prompt)
-        for prompt in prompts
-    ]
-    # Create a sampling params object.
-    sampling_params = SamplingParams(temperature=0, max_tokens=128)
-
-    # Create an LLM.
-    llm = LLM(model=model_path,
-              tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
-              gpu_memory_utilization=0.95)
-
-    outputs = llm.generate(prompts, sampling_params)
-    # Print the outputs.
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-
-if __name__ == "__main__":
-    repo_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
-    filename = "tinyllama-1.1b-chat-v1.0.Q4_0.gguf"
-    model = hf_hub_download(repo_id, filename=filename)
-    run_gguf_inference(model)
--- a/examples/gradio_openai_chatbot_webserver.py
+++ b/examples/gradio_openai_chatbot_webserver.py
-import argparse
-
-import gradio as gr
-from openai import OpenAI
-
-# Argument parser setup
-parser = argparse.ArgumentParser(
-    description='Chatbot Interface with Customizable Parameters')
-parser.add_argument('--model-url',
-                    type=str,
-                    default='http://localhost:8000/v1',
-                    help='Model URL')
-parser.add_argument('-m',
-                    '--model',
-                    type=str,
-                    required=True,
-                    help='Model name for the chatbot')
-parser.add_argument('--temp',
-                    type=float,
-                    default=0.8,
-                    help='Temperature for text generation')
-parser.add_argument('--stop-token-ids',
-                    type=str,
-                    default='',
-                    help='Comma-separated stop token IDs')
-parser.add_argument("--host", type=str, default=None)
-parser.add_argument("--port", type=int, default=8001)
-
-# Parse the arguments
-args = parser.parse_args()
-
-# Set OpenAI's API key and API base to use vLLM's API server.
-openai_api_key = "EMPTY"
-openai_api_base = args.model_url
-
-# Create an OpenAI client to interact with the API server
-client = OpenAI(
-    api_key=openai_api_key,
-    base_url=openai_api_base,
-)
-
-
-def predict(message, history):
-    # Convert chat history to OpenAI format
-    history_openai_format = [{
-        "role": "system",
-        "content": "You are a great ai assistant."
-    }]
-    for human, assistant in history:
-        history_openai_format.append({"role": "user", "content": human})
-        history_openai_format.append({
-            "role": "assistant",
-            "content": assistant
-        })
-    history_openai_format.append({"role": "user", "content": message})
-
-    # Create a chat completion request and send it to the API server
-    stream = client.chat.completions.create(
-        model=args.model,  # Model name to use
-        messages=history_openai_format,  # Chat history
-        temperature=args.temp,  # Temperature for text generation
-        stream=True,  # Stream response
-        extra_body={
-            'repetition_penalty':
-            1,
-            'stop_token_ids': [
-                int(id.strip()) for id in args.stop_token_ids.split(',')
-                if id.strip()
-            ] if args.stop_token_ids else []
-        })
-
-    # Read and return generated text from response stream
-    partial_message = ""
-    for chunk in stream:
-        partial_message += (chunk.choices[0].delta.content or "")
-        yield partial_message
-
-
-# Create and launch a chat interface with Gradio
-gr.ChatInterface(predict).queue().launch(server_name=args.host,
-                                         server_port=args.port,
-                                         share=True)
--- a/examples/medusa/medusa_benchmark_throughput.py
+++ b/examples/medusa/medusa_benchmark_throughput.py
@@ -42,9 +42,6 @@ def sample_requests(
    # Only keep the first two turns of each conversation.
    dataset = [data["prompt"] for data in dataset]

-    # Shuffle the dataset.
-    random.shuffle(dataset)
-
    # Filter out sequences that are too long or too short
    filtered_dataset: List[Tuple[str, int, int]] = []
    for i in range(len(dataset)):
@@ -95,7 +92,6 @@ def run_vllm(
    spec_decoding_acceptance_method: str = None,
    enable_lora: bool = False,
    max_lora_rank: int = 32,
-    merge_lora: bool = False,
    lora_extra_vocab_size: int = 0,
    lora_target_modules: List[str] = None,
    num_speculative_heads: int = 5,
@@ -134,7 +130,6 @@ def run_vllm(
        spec_decoding_acceptance_method=spec_decoding_acceptance_method,
        enable_lora=enable_lora,
        max_lora_rank=max_lora_rank,
-        merge_lora=merge_lora,
        lora_extra_vocab_size=lora_extra_vocab_size,
        lora_target_modules=lora_target_modules,
        num_speculative_heads=num_speculative_heads,
@@ -229,7 +224,6 @@ async def run_vllm_async(
    spec_decoding_acceptance_method: str = None,
    enable_lora: bool = False,
    max_lora_rank: int = 32,
-    merge_lora: bool = False,
    lora_extra_vocab_size: int = 0,
    lora_target_modules: List[str] = None,
    num_speculative_heads: int = 5,
@@ -270,7 +264,6 @@ async def run_vllm_async(
        spec_decoding_acceptance_method=spec_decoding_acceptance_method,
        enable_lora=enable_lora,
        max_lora_rank=max_lora_rank,
-        merge_lora=merge_lora,
        lora_extra_vocab_size=lora_extra_vocab_size,
        lora_target_modules=lora_target_modules,
        num_speculative_heads=num_speculative_heads,
@@ -345,7 +338,7 @@ def main(args: argparse.Namespace):
            args.disable_async_output_proc, False, args.max_num_seqs,
            args.speculative_model, args.speculative_draft_tensor_parallel_size,
            args.speculative_disable_by_batch_size, args.spec_decoding_acceptance_method,
-            args.enable_lora, args.max_lora_rank, args.merge_lora, args.lora_extra_vocab_size,
+            args.enable_lora, args.max_lora_rank, args.lora_extra_vocab_size,
            args.lora_target_modules, args.num_speculative_heads,
            args.num_speculative_tokens
        ]
@@ -363,7 +356,7 @@ def main(args: argparse.Namespace):
            args.disable_async_output_proc, args.max_num_seqs,
            args.speculative_model, args.speculative_draft_tensor_parallel_size,
            args.speculative_disable_by_batch_size, args.spec_decoding_acceptance_method,
-            args.enable_lora, args.max_lora_rank, args.merge_lora, args.lora_extra_vocab_size,
+            args.enable_lora, args.max_lora_rank, args.lora_extra_vocab_size,
            args.lora_target_modules, args.num_speculative_heads,
            args.num_speculative_tokens
        ]

--- a/examples/mla/test_triton_decode_attention.py
+++ b/examples/mla/test_triton_decode_attention.py
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+import json
+import pytest
+import torch
+import triton
+from triton_decode_attention import decode_attentionv1_fwd, decode_attentionv2_fwd
+
+def cdiv(a, b):
+    return (a + b - 1) // b
+
+
+@pytest.mark.parametrize("B", [1])
+# @pytest.mark.parametrize("L", [100])
+@pytest.mark.parametrize("L", [1,100,400,700,1000,1300,1600,1900,2200,2500,2800,3100,3400,3700,4000,4300,4600,4900,5000,5500,6000,6500,7000,7500,8000,8500,9000,9500,10000,10500,11000,11500,12000,12500,13000,13500,14000,14500,15000,15500,16000,16500,17000,17500,18000,18500,19000,19500,20000,20500,21000,21500,22000,22500,23000,23500,24000,24500,25000,25500,26000,26500,27000,27500,28000,28500,29000,29500,30000,30500,31000,31500,32000,32500])
+@pytest.mark.parametrize("H_Q", [4, 8, 16])
+@pytest.mark.parametrize("H_KV", [1])
+@pytest.mark.parametrize("D_QK", [576])
+@pytest.mark.parametrize("D_V", [512])
+@pytest.mark.parametrize("CACHE_SIZE", [16384])
+@pytest.mark.parametrize("PAGE_SIZE", [16]) 
+
+def test_decode_attention(B, L, H_Q, H_KV, D_QK, D_V, CACHE_SIZE, PAGE_SIZE):
+    assert CACHE_SIZE % PAGE_SIZE == 0
+    dtype = torch.bfloat16
+    seq_len = L  # This represents the number of tokens already in the sequence
+    sm_scale = 1.0 / (D_QK**0.5)
+    num_kv_splits = 4
+
+    num_pages_per_batch = cdiv(seq_len, PAGE_SIZE) #这里为向上取整,65,(1027+16-1)//16
+    req_to_page = torch.randint(0,
+                                CACHE_SIZE // PAGE_SIZE,
+                                (B, num_pages_per_batch, 1), #shape为（B, num_pages_per_batch, 1）的tensor,大小取值为0 至cache_size//page_size
+                                device="cuda")
+    req_to_token = req_to_page * PAGE_SIZE
+    req_to_token = req_to_token.expand(B, num_pages_per_batch, PAGE_SIZE) 
+    req_to_token = req_to_token + torch.arange(PAGE_SIZE, device="cuda").view(
+        1, 1, -1)
+    req_to_token = req_to_token.view(B, -1)
+    req_to_token = req_to_token[:, :seq_len].contiguous()
+
+
+    # q represents the new token being generated, one per batch
+    q = torch.randn(B, H_Q, D_QK, dtype=dtype, device="cuda")
+
+    # k_buffer and v_buffer represent all previous tokens
+    # Page size is 1.
+    k_buffer = torch.randn(CACHE_SIZE, H_KV, D_QK, dtype=dtype, device="cuda")
+    v_buffer = torch.randn(CACHE_SIZE, H_KV, D_V, dtype=dtype, device="cuda")
+
+    # o will have the same shape as q
+    o = torch.zeros(B, H_Q, D_V, dtype=dtype, device="cuda")
+
+    b_seq_len = torch.full((B, ), seq_len, device="cuda")
+    attn_logits = torch.empty(
+        (B, H_Q, num_kv_splits, D_V + 1),
+        dtype=torch.float32,
+        device="cuda",
+    )
+    b_req_idx = torch.arange(B, device="cuda").to(torch.int32)
+    # Call the original implementation.
+    decode_attentionv2_fwd(
+        q,
+        k_buffer,
+        v_buffer,
+        o,
+        req_to_token,
+        b_seq_len,
+        attn_logits,
+        num_kv_splits,
+        sm_scale,
+    )
+
+
+    # Page size can be larger than 1.
+    k_buffer = k_buffer.view(CACHE_SIZE // PAGE_SIZE, PAGE_SIZE, H_KV, D_QK)
+    v_buffer = v_buffer.view(CACHE_SIZE // PAGE_SIZE, PAGE_SIZE, H_KV, D_V)
+
+    o1 = torch.zeros_like(o)
+    configs = {
+                  "v2_tc": {"stage1": {}, "stage2": {}},
+                  "v1_2stages_tc": {"stage1": {}, "stage2": {}},
+              }
+    ms = {
+         "v1_2stages_tc": 10000.0,
+         "v2_tc": 10000.0,
+     }
+    final_best_config = {
+                            "kernel_kind": "",
+                            "best_config": {},
+                            "best_us": 0.0,
+                        }
+    v2_tc_stage1_best_config, v2_tc_stage2_best_config = decode_attentionv2_fwd(
+        q,
+        k_buffer,
+        v_buffer,
+        o1,
+        req_to_page,
+        b_seq_len,
+        attn_logits,
+        num_kv_splits,
+        sm_scale,
+        PAGE_SIZE,
+    )
+    assert torch.allclose(o, o1, atol=1e-2, rtol=1e-2)
+    quantiles = [0.5, 0.2, 0.8]
+    v2_tc_ms, v2_tc_min_ms, v2_tc_max_ms = triton.testing.do_bench(lambda:
+    decode_attentionv2_fwd(
+        q,
+        k_buffer,
+        v_buffer,
+        o1,
+        req_to_page,
+        b_seq_len,
+        attn_logits,
+        num_kv_splits,
+        sm_scale,
+        PAGE_SIZE,
+
+    ), quantiles=quantiles)
+    for key, value in v2_tc_stage1_best_config.kwargs.items():
+        configs["v2_tc"]["stage1"][key] = value
+    configs["v2_tc"]["stage1"]["num_stages"] = v2_tc_stage1_best_config.num_stages
+    configs["v2_tc"]["stage1"]["num_warps"] = v2_tc_stage1_best_config.num_warps
+    for key, value in v2_tc_stage2_best_config.kwargs.items():
+        configs["v2_tc"]["stage2"][key] = value
+    configs["v2_tc"]["stage2"]["num_stages"] = v2_tc_stage2_best_config.num_stages
+    configs["v2_tc"]["stage2"]["num_warps"]  = v2_tc_stage2_best_config.num_warps
+    ms["v2_tc"] = v2_tc_ms
+    print(f"v2_tc best configs is {configs['v2_tc']}")
+    print("print mla decode attention v2 kernel [B, L, H_Q, H_KV, D_QK, D_V, CACHE_SIZE, PAGE_SIZE] min cost :",[B, L, H_Q, H_KV, D_QK, D_V, CACHE_SIZE, PAGE_SIZE], v2_tc_ms)
+    o2 = torch.zeros_like(o)
+
+    v1_tc_stage1_best_config, v1_tc_stage2_best_config = decode_attentionv1_fwd(
+        q,
+        k_buffer,
+        v_buffer,
+        o2,
+        req_to_page,
+        b_seq_len,
+        attn_logits,
+        num_kv_splits,
+        sm_scale,
+        PAGE_SIZE,
+    )
+    assert torch.allclose(o, o2, atol=1e-2, rtol=1e-2)
+    
+    v1_tc_ms, v1_tc_min_ms, v1_tc_max_ms = triton.testing.do_bench(lambda:
+    decode_attentionv1_fwd(
+        q,
+        k_buffer,
+        v_buffer,
+        o1,
+        req_to_page,
+        b_seq_len,
+        attn_logits,
+        num_kv_splits,
+        sm_scale,
+        PAGE_SIZE,
+
+    ), quantiles=quantiles)
+    for key, value in v1_tc_stage1_best_config.kwargs.items():
+        configs["v1_2stages_tc"]["stage1"][key] = value
+    configs["v1_2stages_tc"]["stage1"]["num_stages"] = v1_tc_stage1_best_config.num_stages
+    configs["v1_2stages_tc"]["stage1"]["num_warps"] = v1_tc_stage1_best_config.num_warps
+    configs["v1_2stages_tc"]["stage1"]["num_ldmatrixes"] = v1_tc_stage1_best_config.num_ldmatrixes
+    for key, value in v1_tc_stage2_best_config.kwargs.items():
+        configs["v1_2stages_tc"]["stage2"][key] = value
+    configs["v1_2stages_tc"]["stage2"]["num_stages"] = v1_tc_stage2_best_config.num_stages
+    configs["v1_2stages_tc"]["stage2"]["num_warps"]  = v1_tc_stage2_best_config.num_warps
+    configs["v1_2stages_tc"]["stage2"]["num_ldmatrixes"] = v1_tc_stage1_best_config.num_ldmatrixes
+    ms["v1_2stages_tc"] = v1_tc_ms 
+    min_key, min_ms = min(ms.items(), key=lambda x: x[1])
+    final_best_config["kernel_kind"] = min_key
+    final_best_config["best_config"] = configs[min_key]
+    final_best_config["best_us"] = min_ms * 1000
+    print(f"v1_2stages_tc best configs is {configs['v1_2stages_tc']}")
+
+    print("print mla decode attention v1 kernel [B, L, H_Q, H_KV, D_QK, D_V, CACHE_SIZE, PAGE_SIZE] min cost :",[B, L, H_Q, H_KV, D_QK, D_V, CACHE_SIZE, PAGE_SIZE], v1_tc_ms)
+    print(f"Tuned_decode_attention choose {min_key} kernel, min cost {min_ms} ms, best config of {min_key} kernel is {configs[min_key]}")
+    assert torch.allclose(o, o2, atol=1e-2, rtol=1e-2)
+
+
+    #**************save config**************#
+    batch = b_req_idx.shape[0]
+    mean_seq_len = int((b_seq_len.sum() / max(1, batch)).item())
+
+    device_name = torch.cuda.get_device_name().replace(" ", "_")
+    if "K100_AI" in device_name:
+        # return f"QH={QH}_KVH={KVH}_QKD={QKD}_VD={VD}_{cache_dtype}_K100AI.json"
+        file_name = f"QH={H_Q}_KVH={H_KV}_QKD={D_QK}_VD={D_V}_fp16_K100AI.json"
+    elif "BW" in device_name:
+        # return f"QH={QH}_KVH={KVH}_QKD={QKD}_VD={VD}_{cache_dtype}_BW.json"
+        file_name = f"QH={H_Q}_KVH={H_KV}_QKD={D_QK}_VD={D_V}_fp16_BW.json"
+    else:
+        raise ValueError(f"Unsurpport device name: {device_name}")
+    if os.path.exists(file_name):
+        with open(file_name, 'r') as file:
+            config_info = json.load(file)
+    else:
+        config_info = {}
+
+    # 如果 config_info 中没有当前的 batch，初始化它为一个空字典
+    # if f"{batch}" not in config_info:
+    #     config_info[f"{batch}"] = {}
+
+    # 把新的 mean_seq_len 配置加入到当前 batch 中
+    # config_info[f"{batch}"][f"{mean_seq_len}"] = final_best_config
+    config_info[f"{mean_seq_len}"] = final_best_config
+
+    # 保存最佳配置
+    with open(file_name, 'w') as file:
+        json.dump(config_info, file, indent=1)
+    #**************save config**************#
+
--- a/examples/mla/triton_decode_attention.py
+++ b/examples/mla/triton_decode_attention.py
--- a/examples/offline_inference.py
+++ b/examples/offline_inference.py
-from vllm import LLM, SamplingParams
-
-if __name__ == '__main__':
-    # Sample prompts.
-    prompts = [
-        "Hello, my name is",
-        "The president of the United States is",
-        "The capital of France is",
-        "The future of AI is",
-    ]
-    # Create a sampling params object.
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=16)
-
-    # Create an LLM.
-    llm = LLM(model="facebook/opt-125m",tensor_parallel_size=1, distributed_executor_backend="ray", dtype="float16",trust_remote_code=True, enforce_eager=True)
-    # Generate texts from the prompts. The output is a list of RequestOutput objects
-    # that contain the prompt, generated text, and other information.
-    outputs = llm.generate(prompts, sampling_params)
-    # Print the outputs.
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
--- a/examples/offline_inference/audio_language.py
+++ b/examples/offline_inference/audio_language.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""
+This example shows how to use vLLM for running offline inference
+with the correct prompt format on audio language models.
+
+For most models, the prompt format should follow corresponding examples
+on HuggingFace model repository.
+"""
+
+import os
+from dataclasses import asdict
+from typing import NamedTuple, Optional
+
+from huggingface_hub import snapshot_download
+from transformers import AutoTokenizer
+
+from vllm import LLM, EngineArgs, SamplingParams
+from vllm.assets.audio import AudioAsset
+from vllm.lora.request import LoRARequest
+from vllm.utils import FlexibleArgumentParser
+
+audio_assets = [AudioAsset("mary_had_lamb"), AudioAsset("winning_call")]
+question_per_audio_count = {
+    0: "What is 1+1?",
+    1: "What is recited in the audio?",
+    2: "What sport and what nursery rhyme are referenced?",
+}
+
+
+class ModelRequestData(NamedTuple):
+    engine_args: EngineArgs
+    prompt: str
+    stop_token_ids: Optional[list[int]] = None
+    lora_requests: Optional[list[LoRARequest]] = None
+
+
+# NOTE: The default `max_num_seqs` and `max_model_len` may result in OOM on
+# lower-end GPUs.
+# Unless specified, these settings have been tested to work on a single L4.
+
+
+# Granite Speech
+def run_granite_speech(question: str, audio_count: int) -> ModelRequestData:
+    # NOTE - the setting in this example are somehat different than what is
+    # optimal for granite speech, and it is generally recommended to use beam
+    # search. Check the model README for suggested settings.
+    # https://huggingface.co/ibm-granite/granite-speech-3.3-8b
+    model_name = "ibm-granite/granite-speech-3.3-8b"
+
+    engine_args = EngineArgs(
+        model=model_name,
+        trust_remote_code=True,
+        max_model_len=2048,
+        max_num_seqs=2,
+        enable_lora=True,
+        max_lora_rank=64,
+        limit_mm_per_prompt={"audio": audio_count},
+    )
+
+    # The model has an audio-specific lora directly in its model dir;
+    # it should be enabled whenever you pass audio inputs to the model.
+    speech_lora_path = model_name
+    audio_placeholder = "<|audio|>" * audio_count
+    prompts = f"<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>{audio_placeholder}{question}<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>"  # noqa: E501
+
+    return ModelRequestData(
+        engine_args=engine_args,
+        prompt=prompts,
+        lora_requests=[LoRARequest("speech", 1, speech_lora_path)],
+    )
+
+
+# MiniCPM-O
+def run_minicpmo(question: str, audio_count: int) -> ModelRequestData:
+    model_name = "openbmb/MiniCPM-o-2_6"
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    engine_args = EngineArgs(
+        model=model_name,
+        trust_remote_code=True,
+        max_model_len=4096,
+        max_num_seqs=2,
+        limit_mm_per_prompt={"audio": audio_count},
+    )
+
+    stop_tokens = ["<|im_end|>", "<|endoftext|>"]
+    stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
+
+    audio_placeholder = "(<audio>./</audio>)" * audio_count
+    audio_chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n<|spk_bos|><|spk|><|spk_eos|><|tts_bos|>' }}{% endif %}"  # noqa: E501
+    messages = [{"role": "user", "content": f"{audio_placeholder}\n{question}"}]
+    prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+        chat_template=audio_chat_template,
+    )
+
+    return ModelRequestData(
+        engine_args=engine_args,
+        prompt=prompt,
+        stop_token_ids=stop_token_ids,
+    )
+
+
+# Phi-4-multimodal-instruct
+def run_phi4mm(question: str, audio_count: int) -> ModelRequestData:
+    """
+    Phi-4-multimodal-instruct supports both image and audio inputs. Here, we
+    show how to process audio inputs.
+    """
+    model_path = snapshot_download("microsoft/Phi-4-multimodal-instruct")
+    # Since the vision-lora and speech-lora co-exist with the base model,
+    # we have to manually specify the path of the lora weights.
+    speech_lora_path = os.path.join(model_path, "speech-lora")
+    placeholders = "".join([f"<|audio_{i + 1}|>" for i in range(audio_count)])
+
+    prompts = f"<|user|>{placeholders}{question}<|end|><|assistant|>"
+
+    engine_args = EngineArgs(
+        model=model_path,
+        trust_remote_code=True,
+        max_model_len=12800,
+        max_num_seqs=2,
+        enable_lora=True,
+        max_lora_rank=320,
+        limit_mm_per_prompt={"audio": audio_count},
+    )
+
+    return ModelRequestData(
+        engine_args=engine_args,
+        prompt=prompts,
+        lora_requests=[LoRARequest("speech", 1, speech_lora_path)],
+    )
+
+
+# Qwen2-Audio
+def run_qwen2_audio(question: str, audio_count: int) -> ModelRequestData:
+    model_name = "Qwen/Qwen2-Audio-7B-Instruct"
+
+    engine_args = EngineArgs(
+        model=model_name,
+        max_model_len=4096,
+        max_num_seqs=5,
+        limit_mm_per_prompt={"audio": audio_count},
+    )
+
+    audio_in_prompt = "".join(
+        [
+            f"Audio {idx + 1}: <|audio_bos|><|AUDIO|><|audio_eos|>\n"
+            for idx in range(audio_count)
+        ]
+    )
+
+    prompt = (
+        "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
+        "<|im_start|>user\n"
+        f"{audio_in_prompt}{question}<|im_end|>\n"
+        "<|im_start|>assistant\n"
+    )
+
+    return ModelRequestData(
+        engine_args=engine_args,
+        prompt=prompt,
+    )
+
+
+# Qwen2.5-Omni
+def run_qwen2_5_omni(question: str, audio_count: int):
+    model_name = "Qwen/Qwen2.5-Omni-7B"
+
+    engine_args = EngineArgs(
+        model=model_name,
+        max_model_len=4096,
+        max_num_seqs=5,
+        limit_mm_per_prompt={"audio": audio_count},
+    )
+
+    audio_in_prompt = "".join(
+        ["<|audio_bos|><|AUDIO|><|audio_eos|>\n" for idx in range(audio_count)]
+    )
+
+    default_system = (
+        "You are Qwen, a virtual human developed by the Qwen Team, Alibaba "
+        "Group, capable of perceiving auditory and visual inputs, as well as "
+        "generating text and speech."
+    )
+
+    prompt = (
+        f"<|im_start|>system\n{default_system}<|im_end|>\n"
+        "<|im_start|>user\n"
+        f"{audio_in_prompt}{question}<|im_end|>\n"
+        "<|im_start|>assistant\n"
+    )
+    return ModelRequestData(
+        engine_args=engine_args,
+        prompt=prompt,
+    )
+
+
+# Ultravox 0.5-1B
+def run_ultravox(question: str, audio_count: int) -> ModelRequestData:
+    model_name = "fixie-ai/ultravox-v0_5-llama-3_2-1b"
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    messages = [{"role": "user", "content": "<|audio|>\n" * audio_count + question}]
+    prompt = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+
+    engine_args = EngineArgs(
+        model=model_name,
+        max_model_len=4096,
+        max_num_seqs=5,
+        trust_remote_code=True,
+        limit_mm_per_prompt={"audio": audio_count},
+    )
+
+    return ModelRequestData(
+        engine_args=engine_args,
+        prompt=prompt,
+    )
+
+
+# Whisper
+def run_whisper(question: str, audio_count: int) -> ModelRequestData:
+    assert audio_count == 1, "Whisper only support single audio input per prompt"
+    model_name = "openai/whisper-large-v3-turbo"
+
+    prompt = "<|startoftranscript|>"
+
+    engine_args = EngineArgs(
+        model=model_name,
+        max_model_len=448,
+        max_num_seqs=5,
+        limit_mm_per_prompt={"audio": audio_count},
+    )
+
+    return ModelRequestData(
+        engine_args=engine_args,
+        prompt=prompt,
+    )
+
+
+model_example_map = {
+    "granite_speech": run_granite_speech,
+    "minicpmo": run_minicpmo,
+    "phi4_mm": run_phi4mm,
+    "qwen2_audio": run_qwen2_audio,
+    "qwen2_5_omni": run_qwen2_5_omni,
+    "ultravox": run_ultravox,
+    "whisper": run_whisper,
+}
+
+
+def parse_args():
+    parser = FlexibleArgumentParser(
+        description="Demo on using vLLM for offline inference with "
+        "audio language models"
+    )
+    parser.add_argument(
+        "--model-type",
+        "-m",
+        type=str,
+        default="ultravox",
+        choices=model_example_map.keys(),
+        help='Huggingface "model_type".',
+    )
+    parser.add_argument(
+        "--num-prompts", type=int, default=1, help="Number of prompts to run."
+    )
+    parser.add_argument(
+        "--num-audios",
+        type=int,
+        default=1,
+        choices=[0, 1, 2],
+        help="Number of audio items per prompt.",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=None,
+        help="Set the seed when initializing `vllm.LLM`.",
+    )
+
+    return parser.parse_args()
+
+
+def main(args):
+    model = args.model_type
+    if model not in model_example_map:
+        raise ValueError(f"Model type {model} is not supported.")
+
+    audio_count = args.num_audios
+    req_data = model_example_map[model](
+        question_per_audio_count[audio_count], audio_count
+    )
+
+    # Disable other modalities to save memory
+    default_limits = {"image": 0, "video": 0, "audio": 0}
+    req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
+        req_data.engine_args.limit_mm_per_prompt or {}
+    )
+
+    engine_args = asdict(req_data.engine_args) | {"seed": args.seed}
+    llm = LLM(**engine_args)
+
+    # We set temperature to 0.2 so that outputs can be different
+    # even when all prompts are identical when running batch inference.
+    sampling_params = SamplingParams(
+        temperature=0.2, max_tokens=64, stop_token_ids=req_data.stop_token_ids
+    )
+
+    mm_data = {}
+    if audio_count > 0:
+        mm_data = {
+            "audio": [
+                asset.audio_and_sample_rate for asset in audio_assets[:audio_count]
+            ]
+        }
+
+    assert args.num_prompts > 0
+    inputs = {"prompt": req_data.prompt, "multi_modal_data": mm_data}
+    if args.num_prompts > 1:
+        # Batch inference
+        inputs = [inputs] * args.num_prompts
+    # Add LoRA request if applicable
+    lora_request = (
+        req_data.lora_requests * args.num_prompts if req_data.lora_requests else None
+    )
+
+    outputs = llm.generate(
+        inputs,
+        sampling_params=sampling_params,
+        lora_request=lora_request,
+    )
+
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
--- a/examples/offline_inference/automatic_prefix_caching.py
+++ b/examples/offline_inference/automatic_prefix_caching.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""
+Demonstration script for Automatic Prefix Caching (APC) in vLLM.
+
+Automatic Prefix Caching (APC) allows the vLLM engine to reuse cached
+KV (key-value) pairs from previous prompts if a new query shares the same
+prefix. This reduces redundant computation and improves inference speed.
+
+To enable APC, set `enable_prefix_caching=True` when initializing the
+vLLM engine.
+
+This script uses a long Markdown table as the shared prompt prefix and
+compares the generation time for two queries that share the same prefix
+but ask different questions.
+
+Run:
+python examples/offline_inference/automatic_prefix_caching.py
+"""
+
+import time
+
+from vllm import LLM, SamplingParams
+
+# ruff: noqa: E501
+# A prompt containing a large markdown table. The table is randomly generated by GPT-4.
+LONG_PROMPT = (
+    "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n"
+    + """
+| ID  | Name          | Age | Occupation    | Country       | Email                  | Phone Number   | Address                       |
+|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|
+| 1   | John Doe      | 29  | Engineer      | USA           | john.doe@example.com   | 555-1234       | 123 Elm St, Springfield, IL  |
+| 2   | Jane Smith    | 34  | Doctor        | Canada        | jane.smith@example.com | 555-5678       | 456 Oak St, Toronto, ON      |
+| 3   | Alice Johnson | 27  | Teacher       | UK            | alice.j@example.com    | 555-8765       | 789 Pine St, London, UK      |
+| 4   | Bob Brown     | 45  | Artist        | Australia     | bob.b@example.com      | 555-4321       | 321 Maple St, Sydney, NSW    |
+| 5   | Carol White   | 31  | Scientist     | New Zealand   | carol.w@example.com    | 555-6789       | 654 Birch St, Wellington, NZ |
+| 6   | Dave Green    | 28  | Lawyer        | Ireland       | dave.g@example.com     | 555-3456       | 987 Cedar St, Dublin, IE     |
+| 7   | Emma Black    | 40  | Musician      | USA           | emma.b@example.com     | 555-1111       | 246 Ash St, New York, NY     |
+| 8   | Frank Blue    | 37  | Chef          | Canada        | frank.b@example.com    | 555-2222       | 135 Spruce St, Vancouver, BC |
+| 9   | Grace Yellow  | 50  | Engineer      | UK            | grace.y@example.com    | 555-3333       | 864 Fir St, Manchester, UK   |
+| 10  | Henry Violet  | 32  | Artist        | Australia     | henry.v@example.com    | 555-4444       | 753 Willow St, Melbourne, VIC|
+| 11  | Irene Orange  | 26  | Scientist     | New Zealand   | irene.o@example.com    | 555-5555       | 912 Poplar St, Auckland, NZ  |
+| 12  | Jack Indigo   | 38  | Teacher       | Ireland       | jack.i@example.com     | 555-6666       | 159 Elm St, Cork, IE         |
+| 13  | Karen Red     | 41  | Lawyer        | USA           | karen.r@example.com    | 555-7777       | 357 Cedar St, Boston, MA     |
+| 14  | Leo Brown     | 30  | Chef          | Canada        | leo.b@example.com      | 555-8888       | 246 Oak St, Calgary, AB      |
+| 15  | Mia Green     | 33  | Musician      | UK            | mia.g@example.com      | 555-9999       | 975 Pine St, Edinburgh, UK   |
+| 16  | Noah Yellow   | 29  | Doctor        | Australia     | noah.y@example.com     | 555-0000       | 864 Birch St, Brisbane, QLD  |
+| 17  | Olivia Blue   | 35  | Engineer      | New Zealand   | olivia.b@example.com   | 555-1212       | 753 Maple St, Hamilton, NZ   |
+| 18  | Peter Black   | 42  | Artist        | Ireland       | peter.b@example.com    | 555-3434       | 912 Fir St, Limerick, IE     |
+| 19  | Quinn White   | 28  | Scientist     | USA           | quinn.w@example.com    | 555-5656       | 159 Willow St, Seattle, WA   |
+| 20  | Rachel Red    | 31  | Teacher       | Canada        | rachel.r@example.com   | 555-7878       | 357 Poplar St, Ottawa, ON    |
+| 21  | Steve Green   | 44  | Lawyer        | UK            | steve.g@example.com    | 555-9090       | 753 Elm St, Birmingham, UK   |
+| 22  | Tina Blue     | 36  | Musician      | Australia     | tina.b@example.com     | 555-1213       | 864 Cedar St, Perth, WA      |
+| 23  | Umar Black    | 39  | Chef          | New Zealand   | umar.b@example.com     | 555-3435       | 975 Spruce St, Christchurch, NZ|
+| 24  | Victor Yellow | 43  | Engineer      | Ireland       | victor.y@example.com   | 555-5657       | 246 Willow St, Galway, IE    |
+| 25  | Wendy Orange  | 27  | Artist        | USA           | wendy.o@example.com    | 555-7879       | 135 Elm St, Denver, CO       |
+| 26  | Xavier Green  | 34  | Scientist     | Canada        | xavier.g@example.com   | 555-9091       | 357 Oak St, Montreal, QC     |
+| 27  | Yara Red      | 41  | Teacher       | UK            | yara.r@example.com     | 555-1214       | 975 Pine St, Leeds, UK       |
+| 28  | Zack Blue     | 30  | Lawyer        | Australia     | zack.b@example.com     | 555-3436       | 135 Birch St, Adelaide, SA   |
+| 29  | Amy White     | 33  | Musician      | New Zealand   | amy.w@example.com      | 555-5658       | 159 Maple St, Wellington, NZ |
+| 30  | Ben Black     | 38  | Chef          | Ireland       | ben.b@example.com      | 555-7870       | 246 Fir St, Waterford, IE    |
+"""
+)
+
+
+def get_generation_time(llm, sampling_params, prompts):
+    # time the generation
+    start_time = time.time()
+    output = llm.generate(prompts, sampling_params=sampling_params)
+    end_time = time.time()
+    # print the output and generation time
+    print("-" * 30)
+    print(f"Output: {output[0].outputs[0].text}")
+    print(f"Generation time: {end_time - start_time} seconds.")
+    print("-" * 30)
+
+
+def main():
+    # set enable_prefix_caching=True to enable APC
+    llm = LLM(model="lmsys/longchat-13b-16k", enable_prefix_caching=True)
+
+    sampling_params = SamplingParams(temperature=0, max_tokens=100)
+
+    # Querying the age of John Doe
+    get_generation_time(
+        llm,
+        sampling_params,
+        LONG_PROMPT
+        + "Question: what is the age of John Doe? Your answer: The age of John Doe is ",
+    )
+
+    # Querying the age of Zack Blue
+    # This query will be faster since vllm avoids computing the KV cache of LONG_PROMPT again.
+    get_generation_time(
+        llm,
+        sampling_params,
+        LONG_PROMPT
+        + "Question: what is the age of Zack Blue? Your answer: The age of Zack Blue is ",
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/offline_inference/basic/README.md
+++ b/examples/offline_inference/basic/README.md
+# Basic
+
+The `LLM` class provides the primary Python interface for doing offline inference, which is interacting with a model without using a separate model inference server.
+
+## Usage
+
+The first script in this example shows the most basic usage of vLLM. If you are new to Python and vLLM, you should start here.
+
+```bash
+python examples/offline_inference/basic/basic.py
+```
+
+The rest of the scripts include an [argument parser](https://docs.python.org/3/library/argparse.html), which you can use to pass any arguments that are compatible with [`LLM`](https://docs.vllm.ai/en/latest/api/offline_inference/llm.html). Try running the script with `--help` for a list of all available arguments.
+
+```bash
+python examples/offline_inference/basic/classify.py
+```
+
+```bash
+python examples/offline_inference/basic/embed.py
+```
+
+```bash
+python examples/offline_inference/basic/score.py
+```
+
+The chat and generate scripts also accept the [sampling parameters](https://docs.vllm.ai/en/latest/api/inference_params.html#sampling-parameters): `max_tokens`, `temperature`, `top_p` and `top_k`.
+
+```bash
+python examples/offline_inference/basic/chat.py
+```
+
+```bash
+python examples/offline_inference/basic/generate.py
+```
+
+## Features
+
+In the scripts that support passing arguments, you can experiment with the following features.
+
+### Default generation config
+
+The `--generation-config` argument specifies where the generation config will be loaded from when calling `LLM.get_default_sampling_params()`. If set to ‘auto’, the generation config will be loaded from model path. If set to a folder path, the generation config will be loaded from the specified folder path. If it is not provided, vLLM defaults will be used.
+
+> If max_new_tokens is specified in generation config, then it sets a server-wide limit on the number of output tokens for all requests.
+
+Try it yourself with the following argument:
+
+```bash
+--generation-config auto
+```
+
+### Quantization
+
+#### AQLM
+
+vLLM supports models that are quantized using AQLM.
+
+Try one yourself by passing one of the following models to the `--model` argument:
+
+- `ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf`
+- `ISTA-DASLab/Llama-2-7b-AQLM-2Bit-2x8-hf`
+- `ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf`
+- `ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf`
+- `BlackSamorez/TinyLlama-1_1B-Chat-v1_0-AQLM-2Bit-1x16-hf`
+
+> Some of these models are likely to be too large for a single GPU. You can split them across multiple GPUs by setting `--tensor-parallel-size` to the number of required GPUs.
+
+#### GGUF
+
+vLLM supports models that are quantized using GGUF.
+
+Try one yourself by downloading a quantized GGUF model and using the following arguments:
+
+```python
+from huggingface_hub import hf_hub_download
+repo_id = "bartowski/Phi-3-medium-4k-instruct-GGUF"
+filename = "Phi-3-medium-4k-instruct-IQ2_M.gguf"
+print(hf_hub_download(repo_id, filename=filename))
+```
+
+```bash
+--model {local-path-printed-above} --tokenizer microsoft/Phi-3-medium-4k-instruct
+```
+
+### CPU offload
+
+The `--cpu-offload-gb` argument can be seen as a virtual way to increase the GPU memory size. For example, if you have one 24 GB GPU and set this to 10, virtually you can think of it as a 34 GB GPU. Then you can load a 13B model with BF16 weight, which requires at least 26GB GPU memory. Note that this requires fast CPU-GPU interconnect, as part of the model is loaded from CPU memory to GPU memory on the fly in each model forward pass.
+
+Try it yourself with the following arguments:
+
+```bash
+--model meta-llama/Llama-2-13b-chat-hf --cpu-offload-gb 10
+```
--- a/examples/offline_inference/basic/basic.py
+++ b/examples/offline_inference/basic/basic.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+from vllm import LLM, SamplingParams
+
+# Sample prompts.
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+    "Hello, my name is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=16)
+
+
+def main():
+    # Create an LLM.
+    llm = LLM(model="/mnt/data/llm-models/qwen3/Qwen3-8B",tensor_parallel_size=1, dtype="float16",trust_remote_code=True, enforce_eager=True, block_size=64, enable_prefix_caching=False)
+    # Generate texts from the prompts.
+    # The output is a list of RequestOutput objects
+    # that contain the prompt, generated text, and other information.
+    outputs = llm.generate(prompts, sampling_params)
+    # Print the outputs.
+    print("\nGenerated Outputs:\n" + "-" * 60)
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt:    {prompt!r}")
+        print(f"Output:    {generated_text!r}")
+        print("-" * 60)
+
+
+if __name__ == "__main__":
+    main()
+
--- a/examples/offline_inference/basic/chat.py
+++ b/examples/offline_inference/basic/chat.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+from vllm import LLM, EngineArgs
+from vllm.utils import FlexibleArgumentParser
+
+
+def create_parser():
+    parser = FlexibleArgumentParser()
+    # Add engine args
+    EngineArgs.add_cli_args(parser)
+    parser.set_defaults(model="meta-llama/Llama-3.2-1B-Instruct")
+    # Add sampling params
+    sampling_group = parser.add_argument_group("Sampling parameters")
+    sampling_group.add_argument("--max-tokens", type=int)
+    sampling_group.add_argument("--temperature", type=float)
+    sampling_group.add_argument("--top-p", type=float)
+    sampling_group.add_argument("--top-k", type=int)
+    # Add example params
+    parser.add_argument("--chat-template-path", type=str)
+
+    return parser
+
+
+def main(args: dict):
+    # Pop arguments not used by LLM
+    max_tokens = args.pop("max_tokens")
+    temperature = args.pop("temperature")
+    top_p = args.pop("top_p")
+    top_k = args.pop("top_k")
+    chat_template_path = args.pop("chat_template_path")
+
+    # Create an LLM
+    llm = LLM(**args)
+
+    # Create sampling params object
+    sampling_params = llm.get_default_sampling_params()
+    if max_tokens is not None:
+        sampling_params.max_tokens = max_tokens
+    if temperature is not None:
+        sampling_params.temperature = temperature
+    if top_p is not None:
+        sampling_params.top_p = top_p
+    if top_k is not None:
+        sampling_params.top_k = top_k
+
+    def print_outputs(outputs):
+        print("\nGenerated Outputs:\n" + "-" * 80)
+        for output in outputs:
+            prompt = output.prompt
+            generated_text = output.outputs[0].text
+            print(f"Prompt: {prompt!r}\n")
+            print(f"Generated text: {generated_text!r}")
+            print("-" * 80)
+
+    print("=" * 80)
+
+    # In this script, we demonstrate how to pass input to the chat method:
+    conversation = [
+        {"role": "system", "content": "You are a helpful assistant"},
+        {"role": "user", "content": "Hello"},
+        {"role": "assistant", "content": "Hello! How can I assist you today?"},
+        {
+            "role": "user",
+            "content": "Write an essay about the importance of higher education.",
+        },
+    ]
+    outputs = llm.chat(conversation, sampling_params, use_tqdm=False)
+    print_outputs(outputs)
+
+    # You can run batch inference with llm.chat API
+    conversations = [conversation for _ in range(10)]
+
+    # We turn on tqdm progress bar to verify it's indeed running batch inference
+    outputs = llm.chat(conversations, sampling_params, use_tqdm=True)
+    print_outputs(outputs)
+
+    # A chat template can be optionally supplied.
+    # If not, the model will use its default chat template.
+    if chat_template_path is not None:
+        with open(chat_template_path) as f:
+            chat_template = f.read()
+
+        outputs = llm.chat(
+            conversations,
+            sampling_params,
+            use_tqdm=False,
+            chat_template=chat_template,
+        )
+
+
+if __name__ == "__main__":
+    parser = create_parser()
+    args: dict = vars(parser.parse_args())
+    main(args)
--- a/examples/offline_inference/basic/classify.py
+++ b/examples/offline_inference/basic/classify.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+from argparse import Namespace
+
+from vllm import LLM, EngineArgs
+from vllm.utils import FlexibleArgumentParser
+
+
+def parse_args():
+    parser = FlexibleArgumentParser()
+    parser = EngineArgs.add_cli_args(parser)
+    # Set example specific arguments
+    parser.set_defaults(
+        model="jason9693/Qwen2.5-1.5B-apeach", task="classify", enforce_eager=True
+    )
+    return parser.parse_args()
+
+
+def main(args: Namespace):
+    # Sample prompts.
+    prompts = [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
+    ]
+
+    # Create an LLM.
+    # You should pass task="classify" for classification models
+    model = LLM(**vars(args))
+
+    # Generate logits. The output is a list of ClassificationRequestOutputs.
+    outputs = model.classify(prompts)
+
+    # Print the outputs.
+    print("\nGenerated Outputs:\n" + "-" * 60)
+    for prompt, output in zip(prompts, outputs):
+        probs = output.outputs.probs
+        probs_trimmed = (str(probs[:16])[:-1] + ", ...]") if len(probs) > 16 else probs
+        print(
+            f"Prompt: {prompt!r} \n"
+            f"Class Probabilities: {probs_trimmed} (size={len(probs)})"
+        )
+        print("-" * 60)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
--- a/examples/offline_inference/basic/embed.py
+++ b/examples/offline_inference/basic/embed.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+from argparse import Namespace
+
+from vllm import LLM, EngineArgs
+from vllm.utils import FlexibleArgumentParser
+
+
+def parse_args():
+    parser = FlexibleArgumentParser()
+    parser = EngineArgs.add_cli_args(parser)
+    # Set example specific arguments
+    parser.set_defaults(
+        model="intfloat/e5-mistral-7b-instruct",
+        task="embed",
+        enforce_eager=True,
+        max_model_len=1024,
+    )
+    return parser.parse_args()
+
+
+def main(args: Namespace):
+    # Sample prompts.
+    prompts = [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
+    ]
+
+    # Create an LLM.
+    # You should pass task="embed" for embedding models
+    model = LLM(**vars(args))
+
+    # Generate embedding. The output is a list of EmbeddingRequestOutputs.
+    outputs = model.embed(prompts)
+
+    # Print the outputs.
+    print("\nGenerated Outputs:\n" + "-" * 60)
+    for prompt, output in zip(prompts, outputs):
+        embeds = output.outputs.embedding
+        embeds_trimmed = (
+            (str(embeds[:16])[:-1] + ", ...]") if len(embeds) > 16 else embeds
+        )
+        print(f"Prompt: {prompt!r} \nEmbeddings: {embeds_trimmed} (size={len(embeds)})")
+        print("-" * 60)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
--- a/examples/offline_inference/basic/generate.py
+++ b/examples/offline_inference/basic/generate.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+from vllm import LLM, EngineArgs
+from vllm.utils import FlexibleArgumentParser
+
+
+def create_parser():
+    parser = FlexibleArgumentParser()
+    # Add engine args
+    EngineArgs.add_cli_args(parser)
+    parser.set_defaults(model="meta-llama/Llama-3.2-1B-Instruct")
+    # Add sampling params
+    sampling_group = parser.add_argument_group("Sampling parameters")
+    sampling_group.add_argument("--max-tokens", type=int)
+    sampling_group.add_argument("--temperature", type=float)
+    sampling_group.add_argument("--top-p", type=float)
+    sampling_group.add_argument("--top-k", type=int)
+
+    return parser
+
+
+def main(args: dict):
+    # Pop arguments not used by LLM
+    max_tokens = args.pop("max_tokens")
+    temperature = args.pop("temperature")
+    top_p = args.pop("top_p")
+    top_k = args.pop("top_k")
+
+    # Create an LLM
+    llm = LLM(**args)
+
+    # Create a sampling params object
+    sampling_params = llm.get_default_sampling_params()
+    if max_tokens is not None:
+        sampling_params.max_tokens = max_tokens
+    if temperature is not None:
+        sampling_params.temperature = temperature
+    if top_p is not None:
+        sampling_params.top_p = top_p
+    if top_k is not None:
+        sampling_params.top_k = top_k
+
+    # Generate texts from the prompts. The output is a list of RequestOutput
+    # objects that contain the prompt, generated text, and other information.
+    prompts = [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
+    ]
+    outputs = llm.generate(prompts, sampling_params)
+    # Print the outputs.
+    print("-" * 50)
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
+        print("-" * 50)
+
+
+if __name__ == "__main__":
+    parser = create_parser()
+    args: dict = vars(parser.parse_args())
+    main(args)
--- a/examples/offline_inference/basic/score.py
+++ b/examples/offline_inference/basic/score.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+from argparse import Namespace
+
+from vllm import LLM, EngineArgs
+from vllm.utils import FlexibleArgumentParser
+
+
+def parse_args():
+    parser = FlexibleArgumentParser()
+    parser = EngineArgs.add_cli_args(parser)
+    # Set example specific arguments
+    parser.set_defaults(
+        model="BAAI/bge-reranker-v2-m3", task="score", enforce_eager=True
+    )
+    return parser.parse_args()
+
+
+def main(args: Namespace):
+    # Sample prompts.
+    text_1 = "What is the capital of France?"
+    texts_2 = [
+        "The capital of Brazil is Brasilia.",
+        "The capital of France is Paris.",
+    ]
+
+    # Create an LLM.
+    # You should pass task="score" for cross-encoder models
+    model = LLM(**vars(args))
+
+    # Generate scores. The output is a list of ScoringRequestOutputs.
+    outputs = model.score(text_1, texts_2)
+
+    # Print the outputs.
+    print("\nGenerated Outputs:\n" + "-" * 60)
+    for text_2, output in zip(texts_2, outputs):
+        score = output.outputs.score
+        print(f"Pair: {[text_1, text_2]!r} \nScore: {score}")
+        print("-" * 60)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
--- a/examples/offline_inference/batch_llm_inference.py
+++ b/examples/offline_inference/batch_llm_inference.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""
+This example shows how to use Ray Data for data parallel batch inference.
+
+Ray Data is a data processing framework that can handle large datasets
+and integrates tightly with vLLM for data-parallel inference.
+
+As of Ray 2.44, Ray Data has a native integration with
+vLLM (under ray.data.llm).
+
+Ray Data provides functionality for:
+* Reading and writing to cloud storage (S3, GCS, etc.)
+* Automatic sharding and load-balancing across a cluster
+* Optimized configuration of vLLM using continuous batching
+* Compatible with tensor/pipeline parallel inference as well.
+
+Learn more about Ray Data's LLM integration:
+https://docs.ray.io/en/latest/data/working-with-llms.html
+"""
+
+import ray
+from packaging.version import Version
+from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig
+
+assert Version(ray.__version__) >= Version("2.44.1"), (
+    "Ray version must be at least 2.44.1"
+)
+
+# Uncomment to reduce clutter in stdout
+# ray.init(log_to_driver=False)
+# ray.data.DataContext.get_current().enable_progress_bars = False
+
+# Read one text file from S3. Ray Data supports reading multiple files
+# from cloud storage (such as JSONL, Parquet, CSV, binary format).
+ds = ray.data.read_text("s3://anonymous@air-example-data/prompts.txt")
+print(ds.schema())
+
+size = ds.count()
+print(f"Size of dataset: {size} prompts")
+
+# Configure vLLM engine.
+config = vLLMEngineProcessorConfig(
+    model_source="unsloth/Llama-3.1-8B-Instruct",
+    engine_kwargs={
+        "enable_chunked_prefill": True,
+        "max_num_batched_tokens": 4096,
+        "max_model_len": 16384,
+    },
+    concurrency=1,  # set the number of parallel vLLM replicas
+    batch_size=64,
+)
+
+# Create a Processor object, which will be used to
+# do batch inference on the dataset
+vllm_processor = build_llm_processor(
+    config,
+    preprocess=lambda row: dict(
+        messages=[
+            {"role": "system", "content": "You are a bot that responds with haikus."},
+            {"role": "user", "content": row["text"]},
+        ],
+        sampling_params=dict(
+            temperature=0.3,
+            max_tokens=250,
+        ),
+    ),
+    postprocess=lambda row: dict(
+        answer=row["generated_text"],
+        **row,  # This will return all the original columns in the dataset.
+    ),
+)
+
+ds = vllm_processor(ds)
+
+# Peek first 10 results.
+# NOTE: This is for local testing and debugging. For production use case,
+# one should write full result out as shown below.
+outputs = ds.take(limit=10)
+
+for output in outputs:
+    prompt = output["prompt"]
+    generated_text = output["generated_text"]
+    print(f"Prompt: {prompt!r}")
+    print(f"Generated text: {generated_text!r}")
+
+# Write inference output data out as Parquet files to S3.
+# Multiple files would be written to the output destination,
+# and each task would write one or more files separately.
+#
+# ds.write_parquet("s3://<your-output-bucket>")