Online serving benchmarks of real datasets for hierarchical KV caching (#3211)

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

Online serving benchmarks of real datasets for hierarchical KV caching (#3211)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
25482edb · Yueyang Pan · GitHub · 62b362b1 · 25482edb · 25482edb
Unverified Commit 25482edb authored Mar 06, 2025 by Yueyang Pan Committed by GitHub Mar 05, 2025
6 changed files
--- a/benchmark/hicache/README.md
+++ b/benchmark/hicache/README.md
@@ -22,4 +22,70 @@ python bench_multiturn.py --model-path Qwen/Qwen2.5-14B-Instruct
 Note: The performance gain of hierarchical caching depends on the ratio of reusable tokens to GPU memory capacity. The more tokens to be reused, the larger the model, and the more constrained the GPU memory size, the greater the benefit one can expect from hierarchical caching.
-## More benchmarks to be added
+# Benchmark with more datasets
+## Download Dataset
+```bash
+./download.sh {sharegpt|ultragpt|loogle|nextqa|all}
+```
+This script will automatically download the required dataset to the current working directory
+## Multiturn Benchmark
+### Supported Datasets
+- sharegpt
+- ultrachat
+- loogle
+### Example Usage:
+```bash
+python3 bench_serving.py --model mistralai/Mistral-7B-Instruct-v0.3 --backend sglang \
+--dataset-path longdep_qa.json --dataset-name loogle --request-rate 10 --num-prompts 10  \
+--port 8001 --enable-multiturn --disable-shuffle
+```
+This uses `mistralai/Mistral-7B-Instruct-v0.3` model with `sglang` as backend. The dataset
+is `longdep_qa.json`. We send `10 conversations` with `10 req/s` to port 8001. We enable
+multiturn chat without shuffling the order of conversations (i.e. following the original
+order in the dataset file).
+### Note:
+The requests of multiple conversations are sent in a round robin fashion.
+For example, if we have 3 conversations A, B, C whose rounds are `[2, 3, 4]` correspondingly,
+multiturn chat will send the requests to the backend in the following order: `[A1, B1, C1, A2, B2, C2, B3, C3, C4]`
+This has implications on the cache reuse patterns: the cache reuse distance is the largest
+under this request pattern (which means a prefix-aware local scheduler in the backend can
+yield the most benefit compared to a FIFO scheduler)
+## Shared Prefix Benchmark
+### Supported Datasets
+- loogle
+### Example Usage:
+```bash
+python3 bench_serving.py --model mistralai/Mistral-7B-Instruct-v0.3 --backend sglang \
+--dataset-path longdep_qa.json --dataset-name loogle --request-rate 10 --num-prompts 10  \
+--port 8001 --enable-shared-prefix --disable-shuffle
+```
+### Note:
+Shared Prefix benchmark sends the questions for the same prompt together. For example,
+if we have 3 shared prefix A, B, C, which have [2, 3, 4] questions correspondingly,
+the shared prefix benchmark will send the requests to the
+backend in the following order: `[A+Q1, A+Q2, B+Q1, B+Q2, B+Q3, C+Q1, C+Q2, C+Q3]`.
+## Multi Modality Benchmark (WIP)
+### Supported Datasets:
+- nextqa
+### Example Usage:
+```bash
+Server:
+python3 -m sglang.launch_server --model-path lmms-lab/LLaVA-NeXT-Video-7B  --tp 2 --dp 1 --port 8001 \
+--host 0.0.0.0 --mem-fraction-static 0.9 --tokenizer-path llava-hf/llava-1.5-7b-hf \
+--json-model-override-args "{\"architectures\": [\"LlavaVidForCausalLM\"], \"model_type\":\"llava\", \"mm_spatial_pool_stride\":2}"
+Client:
+python3 bench_serving.py --model lmms-lab/LLaVA-NeXT-Video-7B --backend sglang  --dataset-path \
+NExTVideo  --dataset-name nextqa --request-rate 10 --num-prompts 1 --disable-shuffle --port 8001 \ --enable-multiturn --max-frames 16 --tokenizer llava-hf/llava-1.5-7b-hf --fixed-output-len 2048
+```
+Note: for the server args, `tokenizer-path`, overriding architecture are necessary.
+## Supported Backend
+- sglang (oai)
+- vllm (oai)
+- lmdeploy (oai)
--- a/benchmark/hicache/bench_serving.py
+++ b/benchmark/hicache/bench_serving.py
--- a/benchmark/hicache/data_processing.py
+++ b/benchmark/hicache/data_processing.py
--- a/benchmark/hicache/download.sh
+++ b/benchmark/hicache/download.sh
+#!/usr/bin/bash
+# The usage function
+usage() {
+    echo "Usage: $0 {sharegpt|ultragpt|loogle|nextqa|all}"
+    exit 1
+}
+# The download function
+download() {
+    case "$1" in
+        sharegpt)
+            echo $1
+            wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+            ;;
+        ultragpt)
+            echo $1
+            # Questions about the world
+            wget https://cloud.tsinghua.edu.cn/seafhttp/files/be1d7b87-22ca-449e-a6a7-c61d1ea7e010/ultrachat_release_230407.json
+            # Writing and Creation
+            wget https://cloud.tsinghua.edu.cn/seafhttp/files/61742d2a-25e2-4d08-b2b9-15f47ae50ace/ultrachat_material_release_230417.json
+            wget https://cloud.tsinghua.edu.cn/seafhttp/files/f71f6aa6-d346-4b16-85b7-8502efa3d608/ultrachat_material_release_230412.json
+            # External materials
+            wget https://cloud.tsinghua.edu.cn/seafhttp/files/42d22e28-e899-4975-a70f-5eda163e265d/ultrachat_existent_material_release_230420.json.gz
+            gunzip ultrachat_existent_material_release_230420.json.gz
+            ;;
+        loogle)
+            echo $1
+            git lfs install
+            git clone git@hf.co:datasets/bigainlco/LooGLE
+            unzip LooGLE/data.zip
+            ;;
+        nextqa)
+            echo $1
+            git lfs install
+            git clone https://huggingface.co/datasets/lmms-lab/NExTQA
+            unzip NExTQA/videos.zip
+            ;;
+        *)
+            usage
+            exit 1
+            ;;
+    esac
+}
+# Arg check
+if [ "$#" -ne 1 ]; then
+    usage
+fi
+# Invoke
+case "$1" in
+    sharegpt|ultragpt|loogle|nextqa)
+        download "$1"
+        ;;
+    all)
+        download sharegpt
+        download ultragpt
+        download loogle
+        download nextqa
+        ;;
+    *)
+        usage
+        ;;
+esac
--- a/benchmark/hicache/nextqa.py
+++ b/benchmark/hicache/nextqa.py
+import os
+import sys
+from typing import List
+import av
+from datasets import load_dataset
+def find_video_files(video_dir) -> List[str]:
+    if os.path.isfile(video_dir):
+        return [video_dir]
+    video_files = []
+    for root, dirs, files in os.walk(video_dir):
+        for file in files:
+            if file.endswith((".mp4", ".avi", ".mov")):
+                video_files.append(os.path.join(root, file))
+            # if file is dir
+            elif os.path.isdir(file):
+                video_files.extend(find_video_files(file))
+    return video_files
+def video_frames(video_path, max_frames) -> int:
+    container = av.open(video_path)
+    total_frames = container.streams.video[0].frames
+    return min(total_frames, max_frames)
+class Video:
+    def __init__(self, video_path, num_frames):
+        self.path = video_path
+        self.num_frames = num_frames
+    def __str__(self):
+        return f"Video({self.path}, {self.num_frames})"
+    def __iter__(self):
+        return iter((self.path, self.num_frames))
+class VideoPrompt(Video):
+    def __init__(self, video_path, num_frames, prompt):
+        super().__init__(video_path, num_frames)
+        self.prompt = prompt
+    def __str__(self):
+        return f"VideoPrompt({self.path}, {self.num_frames}, {self.prompt})"
+    def __iter__(self):
+        return iter((self.path, self.num_frames, self.prompt))
+class VideoLoader:
+    pass
+class VideoFileLoader(VideoLoader):
+    """
+    Load all the videos in a directory
+    """
+    def __init__(self, video_dir, batch_size=1, max_frames=sys.maxsize):
+        super().__init__()
+        self.video_dir = video_dir
+        self.video_files = find_video_files(video_dir)
+        self.batch_size = batch_size
+        self.max_frames = max_frames
+        print(f"batch_size: {batch_size}, max_frames: {max_frames}")
+    def __iter__(self):  # (file, number of frames)
+        if self.batch_size == 1:
+            for video_file in self.video_files:
+                yield Video(video_file, video_frames(video_file, self.max_frames))
+        else:
+            batch = []
+            for video_file in self.video_files:
+                video = Video(video_file, video_frames(video_file, self.max_frames))
+                batch.append(video)
+                if len(batch) == self.batch_size:
+                    yield batch
+                    batch = []
+class NExTQALoader(VideoLoader):
+    """
+    Load vdideos and prompts from NExT dataset
+    set: train, test or validation
+    """
+    def __init__(
+        self, video_dir, batch_size=1, max_frames=sys.maxsize, dset="test", task="OE"
+    ):
+        """
+        task: 'MV' or 'OE'
+        """
+        super().__init__()
+        self.task = task
+        print(f"Loading the {dset} data of {task} from lmms-lab/NExTQA")
+        self.ds = load_dataset("lmms-lab/NExTQA", task)
+        self.ds = self.ds[dset]
+        # self.n = ds.num_rows
+        self.video_dir = video_dir
+        self.video_files = find_video_files(video_dir)
+        self.video_to_path = dict()
+        for video_file in self.video_files:
+            video_id = video_file.split("/")[-1].split(".")[0]
+            self.video_to_path[video_id] = video_file
+        self.batch_size = batch_size
+        self.max_frames = max_frames
+    def get_video_prompt(self, entry, max_frames) -> VideoPrompt:
+        # Get video
+        video_id = entry["video"]
+        video_path = self.video_to_path[video_id]
+        assert os.path.exists(video_path), f"Video not found: {video_path}"
+        num_frames = min(entry["frame_count"], max_frames)
+        video = Video(video_path, num_frames)
+        prompt = entry["question"] + "?"
+        if self.task == "MC":  # add choices
+            prompt += f' a0: {entry["a0"]}, a1: {entry["a1"]}, a2: {entry["a2"]}, a3: {entry["a3"]}'
+        return VideoPrompt(video_path, num_frames, prompt)
+    def __iter__(self):
+        if self.batch_size == 1:
+            for entry in self.ds:
+                yield self.get_video_prompt(entry, self.max_frames)
+        else:
+            batch = []
+            for entry in self.ds:
+                video = self.get_video_prompt(entry, self.max_frames)
+                batch.append(video)
+                if len(batch) == self.batch_size:
+                    yield batch
+                    batch = []
+# main
+if __name__ == "__main__":
+    video_dir = "./videos"
+    # video_loader = VideoFileLoader(video_dir, batch_size=16)
+    # for batch in video_loader:
+    #     print(f"Number of videos in batch: {len(batch)}")
+    #     for video_file, num_frames in batch:
+    #         print(f"Video: {video_file} number of frames: {num_frames}")
+    video_loader = NExTQALoader(video_dir, batch_size=16, dset="test", task="OE")
+    for batch in video_loader:
+        print(f"Number of videos in batch: {len(batch)}")
+        for video_file, num_frames, prompt in batch:
+            print(
+                f"Video: {video_file} number of frames: {num_frames}, prompt: {prompt}"
+            )
+        # break
+        # for video_file, prompt in batch:
+        #     print(f"Video: {video_file} prompt: {prompt}")
+        #     break
--- a/python/sglang/utils.py
+++ b/python/sglang/utils.py
@@ -24,10 +24,14 @@ import requests
 from IPython.display import HTML, display
 from tqdm import tqdm
+from sglang.srt.openai_api.protocol import ChatCompletionMessageContentPart
 from sglang.srt.utils import kill_process_tree
 logger = logging.getLogger(__name__)
+# type of content fields, can be only prompts or with images/videos
+MsgContent = Union[str, List[ChatCompletionMessageContentPart]]
 def get_exception_traceback():
    etype, value, tb = sys.exc_info()