Our dataset can be downloaded from [here](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Fdata&mode=list). The data is organized in the following JSON format: * source: The original sources of the dataset. This field is ['internvid'](https://huggingface.co/datasets/OpenGVLab/InternVid) in stage 2 and ['anet'](http://activity-net.org/download.html) or ['didemo'](https://drive.google.com/drive/u/0/folders/1_oyJ5rQiZboipbMl6tkhY8v0s9zDkvJc) in stage 3. * id: The ID of the video in the original dataset. * conversations: The conversation data used for training. We utilize special tokens (such as ``, ``) to indicate timestamps. These need to be replaced during training using information from the 'meta' field. * meta: * split: refers to the extraction of a segment from the original video as input, with two numbers denoting the start and end timestamps of the extraction. If this field does not exist, the entire video is used as input. * duration: indicates the length of the input video. * token: represents the timestamp of special tokens appearing in 'conversations'. Here is an example: ```json { "source": "internvid", "id": "3n3oCNerzV0", "conversations": [ { "from": "human", "value": "