## Introduction

This directory contains our TF implementation of Transformer-XL. Note that our state-of-the-art results reported in the paper were obtained by training the model on a large-scale TPU cluster, and our gpu codebase currently does not support distributed training. Here we provide two sets of hyperparameters and scripts:
- `*large_tpu.sh` are for the SoTA setting on TPUs. These are exactly the commands we used to obtained our best results.
- `*base_gpu.sh` are for the base models which can be run on a few GPUs.


## Prerequisite

- Python 2.7
- Tensorflow [1.12.0](https://github.com/tensorflow/tensorflow/releases/tag/v1.12.0)


## Obtain and evaluate pretrained SoTA models

#### 1. Download preprocessed data (vocab) & pretrained models

(a) Set your own `DATA_ROOT` in `sota/download.sh` (default to `./`), which will be the root diretory of downloaded model.

(b) Then, download the model & data by `bash sota/download.sh`. After downloading, the expected directory structure is as follows

```markdown
pretrained_xl
  tf_enwik8/
    data/
      cache.pkl
      corpus-info.json
    model/
      checkpoint
      model.ckpt*
  tf_wt103/
  	...
  ...
```

**Note**: we include preprocessed data in the download files to make sure the **same vocabulary** is used. Please see the code `tf/data_utils.py` to understand the data structure.


#### 2. Run evaluation scripts to replicate SoTA results on GPUs

- **enwik8**: modify the script `sota/enwik8.sh` accordingly (see below)
  - set `DATA_ROOT` to the same folder used in the download step (default to `./`)
  - set `TEST_NUM_CORE ` (number of GPUs to use): we recommend 2 GPUs => about 60 mins
  - run the script: `bash sota/enwik8.sh`

- **lm1b**: modify the script `sota/lm1b.sh` accordingly  (see below)
  - set `DATA_ROOT` to the same folder used in the download step (default to `./`)
  - set `TEST_NUM_CORE ` (number of GPUs to use): we recommend 1 GPUs => less than 5 mins
  - run the script: `bash sota/lm1b.sh`

- **wt103**:  modify the script `sota/wt103.sh` accordingly  (see below)
  - set `DATA_ROOT` to the same folder used in the download step (default to `./`)
  - set `TEST_NUM_CORE ` (number of GPUs to use): we recommend 1 GPUs => less than 5 mins
  - run the script: `bash sota/wt103.sh`

- **text8**:  modify the script `sota/text8.sh` accordingly  (see below)
  - set `DATA_ROOT` to the same folder used in the download step (default to `./`)
  - set `TEST_NUM_CORE ` (number of GPUs to use): we recommend 2 GPUs => about 60 mins
  - run the script: `bash sota/text8.sh`


## Train "Transformer-XL" from scratch with GPUs or TPUs

### 1. Download raw data

`bash getdata.sh`


### 2. Preprocess, training and evaluation

For `dataset` in `[enwik8, lm1b, wt103, text8]`:

- check out `scripts/dataset_base_gpu.sh` for GPU training and evaluation
- check out `scripts/dataset_large_tpu.sh` for TPU training and evaluation


#### (1) Preprocess raw data and create tfrecords

**NOTE**: The preprocessing for GPU and TPU are different. So, you have to run them separately.

GPU:

- create training and validation data: `bash scripts/dataset_bas_gpu.sh train_data`
- create test data: `bash scripts/dataset_base_gpu.sh test_data`

TPU:

- Set the Google storage URL  in `scripts/dataset_large_tpu.sh`:
  - `GSDATA`: data URL
  - `GSEXP`: experiment URL
- create training and validation data: `bash scripts/dataset_large_tpu.sh train_data`
- create test data: `bash scripts/dataset_large_tpu.sh test_data`


#### (2) Run training

Base models on GPUs:

- Modify the configurations in `scripts/dataset_base_gpu.sh`  according to your needs.
- `bash scripts/dataset_base_gpu.sh train`
- If enough resources are available, increasing the model sizes (e.g., `N_LAYER`, `D_MODEL`, `D_EMBED`, `D_HEAD`, `D_INNER`) so that they are closer to the values defined in `scripts/dataset_large_tpu.sh`. Likewise, when resources are limited, decrease the model sizes. It is recommended to ensure that `D_MODEL == D_EMBED` and `D_MODEL == N_HEAD x D_HEAD`. When the model sizes increase, remember to increase `warmup_steps` accordingly to alleviate optimization difficulties.

Larger models on TPUs:

- Modify the configurations in `scripts/dataset_large_tpu.sh`  according to your needs.
- `bash scripts/dataset_large_tpu.sh train`


#### (3) Run evaluation

Base models on GPUs:

- `bash scripts/dataset_base_gpu.sh eval --eval_ckpt_path PATH_TO_CKPT`

Larger models on TPUs:

- `bash scripts/dataset_base_tpu.sh eval --eval_ckpt_path PATH_TO_CKPT`