download_process_data_seedtts.md 1.62 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# Benchmark Dataset Preparation Guide

This guide describes how to download and prepare the SeedTTS test dataset for benchmarking Qwen-Omni models.

## Prerequisites

- Python 3.8+
- `gdown` for downloading from Google Drive
- Access to the benchmark scripts

## Steps

### 1. Navigate to the Dataset Directory

```bash
cd benchmarks/build_dataset
```

### 2. Install Dependencies

```bash
pip install gdown
```

### 3. Download the SeedTTS Test Dataset

Download the dataset from Google Drive:

```bash
gdown --id 1GlSjVfSHkW3-leKKBlfrjuuTGqQ_xaLP
```

### 4. Extract the Dataset

```bash
tar -xf seedtts_testset.tar
```

### 5. Prepare the Metadata File

Copy the English metadata file to the working directory:

```bash
cp seedtts_testset/en/meta.lst meta.lst
```

### 6. Extract Prompts

Extract the first N prompts from the metadata file:

```bash
# Extract top 100 prompts (adjust -n for different amounts)
python extract_tts_prompts.py -i meta.lst -o top100.txt -n 100
```

**Options:**
- `-i, --input`: Input metadata file (default: `meta.lst`)
- `-o, --output`: Output prompts file (default: `prompts.txt`)
- `-n, --num_lines`: Number of prompts to extract (required)

### 7. Clean Up (Optional)

Remove temporary files to save disk space:

```bash
rm -rf seedtts_testset
rm seedtts_testset.tar
rm meta.lst
```

## Quick Start (All-in-One)

```bash
# Full setup and benchmark
cd benchmarks/build_dataset
pip install gdown
gdown --id 1GlSjVfSHkW3-leKKBlfrjuuTGqQ_xaLP
tar -xf seedtts_testset.tar
cp seedtts_testset/en/meta.lst meta.lst
python extract_tts_prompts.py -i meta.lst -o top100.txt -n 100
rm -rf seedtts_testset seedtts_testset.tar meta.lst
```