We used 32, 32, 64, and 512 TPU cores for training our best models on enwik8, text8, wt103, and lm1b respectively. The training time for each model ranges from 2 to 5 days.
## Train "Transformer-XL" from scratch with GPUs or TPUs
## Train "Transformer-XL" from scratch with GPUs or TPUs