README.md 4.72 KB
Newer Older
1
2
3
4
5
6
7
8

## Introduction

This directory contains our TF implementation of Transformer-XL. Note that our state-of-the-art results reported in the paper were obtained by training the model on a large-scale TPU cluster, and our gpu codebase currently does not support distributed training. Here we provide two sets of hyperparameters and scripts:
- `*large_tpu.sh` are for the SoTA setting on TPUs. These are exactly the commands we used to obtained our best results.
- `*base_gpu.sh` are for the base models which can be run on a few GPUs.


Zhilin Yang's avatar
init  
Zhilin Yang committed
9
10
11
## Prerequisite

- Python 2.7
12
- Tensorflow [1.12.0](https://github.com/tensorflow/tensorflow/releases/tag/v1.12.0)
Zhilin Yang's avatar
init  
Zhilin Yang committed
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64



## Obtain and evaluate pretrained SoTA models

#### 1. Download preprocessed data (vocab) & pretrained models

(a) Set your own `DATA_ROOT` in `sota/download.sh` (default to `./`), which will be the root diretory of downloaded model.

(b) Then, download the model & data by `bash sota/download.sh`. After downloading, the expected directory structure is as follows

```markdown
pretrained_xl
  tf_enwik8/
    data/
      cache.pkl
      corpus-info.json
    model/
      checkpoint
      model.ckpt*
  tf_wt103/
  	...
  ...
```

**Note**: we include preprocessed data in the download files to make sure the **same vocabulary** is used. Please see the code `tf/data_utils.py` to understand the data structure.



#### 2. Run evaluation scripts to replicate SoTA results on GPUs

- **enwik8**: modify the script `sota/enwik8.sh` accordingly (see below)
  - set `DATA_ROOT` to the same folder used in the download step (default to `./`)
  - set `TEST_NUM_CORE ` (number of GPUs to use): we recommend 2 GPUs => about 60 mins
  - run the script: `bash sota/enwik8.sh`

- **lm1b**: modify the script `sota/lm1b.sh` accordingly  (see below)
  - set `DATA_ROOT` to the same folder used in the download step (default to `./`)
  - set `TEST_NUM_CORE ` (number of GPUs to use): we recommend 1 GPUs => less than 5 mins
  - run the script: `bash sota/lm1b.sh`

- **wt103**:  modify the script `sota/wt103.sh` accordingly  (see below)
  - set `DATA_ROOT` to the same folder used in the download step (default to `./`)
  - set `TEST_NUM_CORE ` (number of GPUs to use): we recommend 1 GPUs => less than 5 mins
  - run the script: `bash sota/wt103.sh`

- **text8**:  modify the script `sota/text8.sh` accordingly  (see below)
  - set `DATA_ROOT` to the same folder used in the download step (default to `./`)
  - set `TEST_NUM_CORE ` (number of GPUs to use): we recommend 2 GPUs => about 60 mins
  - run the script: `bash sota/text8.sh`


kimiyoung's avatar
kimiyoung committed
65
66
67
68
69
#### 3. Resources Needed for SoTA Model Training

We used 32, 32, 64, and 512 TPU cores for training our best models on enwik8, text8, wt103, and lm1b respectively. The training time for each model ranges from 2 to 5 days.


Zhilin Yang's avatar
init  
Zhilin Yang committed
70
71
72
73
74
75
76
77
78
79
80
81
82

## Train "Transformer-XL" from scratch with GPUs or TPUs

### 1. Download raw data

`bash getdata.sh`



### 2. Preprocess, training and evaluation

For `dataset` in `[enwik8, lm1b, wt103, text8]`:

Zhilin Yang's avatar
Zhilin Yang committed
83
84
- check out `scripts/dataset_base_gpu.sh` for GPU training and evaluation
- check out `scripts/dataset_large_tpu.sh` for TPU training and evaluation
Zhilin Yang's avatar
init  
Zhilin Yang committed
85
86
87
88
89
90
91
92
93



#### (1) Preprocess raw data and create tfrecords

**NOTE**: The preprocessing for GPU and TPU are different. So, you have to run them separately.

GPU:

94
95
- create training and validation data: `bash scripts/dataset_bas_gpu.sh train_data`
- create test data: `bash scripts/dataset_base_gpu.sh test_data`
Zhilin Yang's avatar
init  
Zhilin Yang committed
96
97
98

TPU:

99
- Set the Google storage URL  in `scripts/dataset_large_tpu.sh`:
Zhilin Yang's avatar
init  
Zhilin Yang committed
100
101
  - `GSDATA`: data URL
  - `GSEXP`: experiment URL
102
103
- create training and validation data: `bash scripts/dataset_large_tpu.sh train_data`
- create test data: `bash scripts/dataset_large_tpu.sh test_data`
Zhilin Yang's avatar
init  
Zhilin Yang committed
104
105
106
107
108



#### (2) Run training

109
Base models on GPUs:
Zhilin Yang's avatar
init  
Zhilin Yang committed
110

111
112
113
- Modify the configurations in `scripts/dataset_base_gpu.sh`  according to your needs.
- `bash scripts/dataset_base_gpu.sh train`
- If enough resources are available, increasing the model sizes (e.g., `N_LAYER`, `D_MODEL`, `D_EMBED`, `D_HEAD`, `D_INNER`) so that they are closer to the values defined in `scripts/dataset_large_tpu.sh`. Likewise, when resources are limited, decrease the model sizes. It is recommended to ensure that `D_MODEL == D_EMBED` and `D_MODEL == N_HEAD x D_HEAD`. When the model sizes increase, remember to increase `warmup_steps` accordingly to alleviate optimization difficulties.
Zhilin Yang's avatar
Zhilin Yang committed
114
- Adjust the `NUM_CORE` parameter to reflect the number of GPUs to use.
Zhilin Yang's avatar
init  
Zhilin Yang committed
115

116
Larger models on TPUs:
Zhilin Yang's avatar
init  
Zhilin Yang committed
117

118
119
- Modify the configurations in `scripts/dataset_large_tpu.sh`  according to your needs.
- `bash scripts/dataset_large_tpu.sh train`
Zhilin Yang's avatar
init  
Zhilin Yang committed
120
121
122
123
124



#### (3) Run evaluation

125
Base models on GPUs:
Zhilin Yang's avatar
init  
Zhilin Yang committed
126

127
- `bash scripts/dataset_base_gpu.sh eval --eval_ckpt_path PATH_TO_CKPT`
Zhilin Yang's avatar
init  
Zhilin Yang committed
128

129
Larger models on TPUs:
Zhilin Yang's avatar
init  
Zhilin Yang committed
130

131
- `bash scripts/dataset_base_tpu.sh eval --eval_ckpt_path PATH_TO_CKPT`