download_process_data_seedtts.md

# Benchmark Dataset Preparation Guide

This guide describes how to download and prepare the SeedTTS test dataset for benchmarking Qwen-Omni models.

## Prerequisites

- Python 3.8+
- `gdown` for downloading from Google Drive
- Access to the benchmark scripts

## Steps

### 1. Navigate to the Dataset Directory

```bash
cd benchmarks/build_dataset
```

### 2. Install Dependencies

```bash
pip install gdown
```

### 3. Download the SeedTTS Test Dataset

Download the dataset from Google Drive:

```bash
gdown --id 1GlSjVfSHkW3-leKKBlfrjuuTGqQ_xaLP
```

### 4. Extract the Dataset

```bash
tar -xf seedtts_testset.tar
```

### 5. Prepare the Metadata File

Copy the English metadata file to the working directory:

```bash
cp seedtts_testset/en/meta.lst meta.lst
```

### 6. Extract Prompts

Extract the first N prompts from the metadata file:

```bash
# Extract top 100 prompts (adjust -n for different amounts)
python extract_tts_prompts.py -i meta.lst -o top100.txt -n 100
```

**Options:**
- `-i, --input`: Input metadata file (default: `meta.lst`)
- `-o, --output`: Output prompts file (default: `prompts.txt`)
- `-n, --num_lines`: Number of prompts to extract (required)

### 7. Clean Up (Optional)

Remove temporary files to save disk space:

```bash
rm -rf seedtts_testset
rm seedtts_testset.tar
rm meta.lst
```

## Quick Start (All-in-One)

```bash
# Full setup and benchmark
cd benchmarks/build_dataset
pip install gdown
gdown --id 1GlSjVfSHkW3-leKKBlfrjuuTGqQ_xaLP
tar -xf seedtts_testset.tar
cp seedtts_testset/en/meta.lst meta.lst
python extract_tts_prompts.py -i meta.lst -o top100.txt -n 100
rm -rf seedtts_testset seedtts_testset.tar meta.lst
```