"docs/zh/vscode:/vscode.git/clone" did not exist on "fbc8d21d6af30e554b548cb292904313c0f01d4d"
README.md 2.62 KB
Newer Older
Zhilin Yang's avatar
Zhilin Yang committed
1
## Introduction
2
3
4
5
6
7
8
9

This directory contains our pytorch implementation of Transformer-XL. Note that our state-of-the-art results reported in the paper were obtained by training the model on a large-scale TPU cluster, and our pytorch codebase currently does not support distributed training. Here we provide two sets of hyperparameters and scripts:
- `*large.sh` are for the SoTA setting with large models which might not be directly runnable on a local GPU machine.
- `*base.sh` are for the base models which can be run on a few GPUs.

The pytorch implementation produces similar results to the TF codebase under the same settings in our preliminary experiments.


Zhilin Yang's avatar
Zhilin Yang committed
10
## Prerequisite
Zhilin Yang's avatar
init  
Zhilin Yang committed
11
12
13
14

- Pytorch 0.4: `conda install pytorch torchvision -c pytorch`


Zhilin Yang's avatar
Zhilin Yang committed
15
## Data Prepration
Zhilin Yang's avatar
init  
Zhilin Yang committed
16
17
18

`bash getdata.sh`

Zhilin Yang's avatar
Zhilin Yang committed
19
## Training and Evaluation
Zhilin Yang's avatar
init  
Zhilin Yang committed
20
21
22
23
24
25
26

#### Replicate the "bpc = 1.06" result on `enwik8` with a 12-layer Transformer-XL

- Make sure the machine have **4 GPUs**, each with **at least 11G memory**

- Training

Zhilin Yang's avatar
Zhilin Yang committed
27
  `bash run_enwik8_base.sh train --work_dir PATH_TO_WORK_DIR`
Zhilin Yang's avatar
init  
Zhilin Yang committed
28

29
- Evaluation
Zhilin Yang's avatar
init  
Zhilin Yang committed
30

Zhilin Yang's avatar
Zhilin Yang committed
31
  `bash run_enwik8_base.sh eval --work_dir PATH_TO_WORK_DIR`
Zhilin Yang's avatar
init  
Zhilin Yang committed
32
33
34
35
36
37
38



#### Replicate the "PPL = 24.03" result on `wikitext-103` with Transformer-XL

- Make sure the machine have **4 GPUs**, each with **at least 11G memory**

Yongbo Wang's avatar
Yongbo Wang committed
39
- Training
Zhilin Yang's avatar
init  
Zhilin Yang committed
40

Zhilin Yang's avatar
Zhilin Yang committed
41
  `bash run_wt103_base.sh train --work_dir PATH_TO_WORK_DIR`
Zhilin Yang's avatar
init  
Zhilin Yang committed
42

Yongbo Wang's avatar
Yongbo Wang committed
43
- Evaluation
Zhilin Yang's avatar
init  
Zhilin Yang committed
44

Zhilin Yang's avatar
Zhilin Yang committed
45
  `bash run_wt103_base.sh eval --work_dir PATH_TO_WORK_DIR`
Zhilin Yang's avatar
init  
Zhilin Yang committed
46
47
48
49
50
51
52
53
54



#### Other options:

- `--batch_chunk`: this option allows one to trade speed for memory. For `batch_chunk > 1`, the program will split each training batch into `batch_chunk` sub-batches and perform forward and backward on each sub-batch sequentially, with the gradient accumulated and divided by `batch_chunk`. Hence, the memory usage will propertionally lower while the computation time will inversely higher. 
- `--div_val`: when using adaptive softmax and embedding, the embedding dimension is divided by `div_val` from bin $i$ to bin $i+1$. This saves both GPU memory and the parameter budget.
- `--fp16` and `--dynamic-loss-scale`: Run in pseudo-fp16 mode (fp16 storage fp32 math) with dynamic loss scaling. 
  - Note: to explore the `--fp16` option, please make sure the `apex` package is installed (https://github.com/NVIDIA/apex/).
55
- To see performance without the recurrence mechanism, simply use `mem_len=0` in all your scripts.
Zhilin Yang's avatar
Zhilin Yang committed
56
- To see performance of a standard Transformer without relative positional encodings or recurrence mechanisms, use `attn_type=2` and `mem_len=0`.
Zhilin Yang's avatar
init  
Zhilin Yang committed
57
58
59
60


#### Other datasets:

Zhilin Yang's avatar
Zhilin Yang committed
61
62
- `Text8` character-level language modeling: check out `run_text8_base.sh`
- `lm1b` word-level language modeling: check out `run_lm1b_base.sh`