Commit c0f05c10 authored by hepj's avatar hepj
Browse files

更新transformer代码

parent c056df78
### 1.Benchmark下载地址
transformer:https://github.com/pytorch/fairseq.git
最新的githup上代码和现在存在区别,最好使用本地旧版本代码,否则可能出现HIP版本的torch不识别情况
### 2.1 数据集准备
```
环境准备:
pip3 install fastBPE sacremoses subword_nmt
数据集下载:
cd examples/translation/
bash prepare-wmt14en2de.sh --icml17
数据预处理:
DATA_DIR=~/data/wmt14_en_de_joined_dict
TEXT=`pwd`/examples/translation/wmt14_en_de
fairseq-preprocess --source-lang en --target-lang de --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test --destdir $DATA_DIR --nwordssrc 32768 --nwordstgt 32768 --joined-dictionary --workers 20
相关参数说明:
--source-lang source language
--target-lang target language
--trainpref rain file prefix (also used to build dictionaries)
--validpref comma separated, valid file prefixes (words missing from train set are replaced with <unk>)
--testpref comma separated, test file prefixes (words missing from train set are replaced with <unk>)
--destdir destination dir, Default: “data-bin”
数据集路径:
DATA_PATH=~/data/wmt14_en_de_joined_dict
```
#### 2.2.环境部署
##### 2.2.1.构建测试的虚拟环境
```
virtualenv -p python3 venv
source venv/bin/activate
```
##### 2.2.2.安装python3.7环境下的依赖包
```
pip3 install --upgrade pip
pip3 install typing
pip3 install sacremoses
pip3 install numpy
pip3 install torch-1.10.0a0+gitd8cde89.atomic.dtk22042-cp37-cp37m-linux_x86_64.whl
pip3 install apex-0.1_dtk22.04-cp37-cp37m-linux_x86_64.whl
pip3 install setuptools==59.5.0
pip3 install protobuf==3.20.0
```
##### 2.2.3.安装fairseq
```
#git clone https://github.com/pytorch/fairseq.git
#cd fairseq
#整个transformer文件夹就是从github上拷贝下来的fariseq,可以直接安装
pip3 install --editable ./
相关说明:可以先把setup.py里的torch、torch-audio的安装屏蔽掉
```
##### 2.2.4.环境变量里设置env.sh
```
WORK_PATH=`pwd`
source env.sh
```
### 3.transformer测试(昆山)
#### 3.1.单卡测试(单精度)
##### 3.1.1.run.sh
```
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=1
export HIP_VISIBLE_DEVICES=0
export TOKEN=2560
export DATA_PATH=~/data/wmt14_en_de_joined_dict
export HIP_LAUNCH_BLOCKING=1
export ROCBLAS_ATOMICS_MOD=1
python3 train.py \
$DATA_PATH \
--arch transformer_wmt_en_de \
--share-decoder-input-output-embed \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 \
--lr 5e-4 \
--lr-scheduler inverse_sqrt \
--warmup-updates 4000 \
--dropout 0.3 \
--weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--max-tokens ${TOKEN} \
--eval-bleu \
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu \
--maximize-best-checkpoint-metric \
--max-epoch 1
```
##### 3.1.2.运行
```
./ run.sh
```
#### 3.2.四卡测试(单精度)
##### 3.2.1.single_process.sh
```
#!/bin/bash
export MIOPEN_DEBUG_DISABLE_FIND_DB=1
export NCCL_SOCKET_IFNAME=eno1
export HSA_USERPTR_FOR_PAGED_MEM=0
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZE
TOKENS=2560
DATA_PATH=~/fairseq/examples/translation/data-bin/wmt14_en_de_joined_dict
APP="python3 ~/fairseq/train.py $DATA_PATH --arch transformer_wmt_en_de --share-decoder-input-output-embed --optimizer adam --adam-betas (0.9,0.98) --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens $TOKENS --eval-bleu --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10} --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --distributed-rank ${comm_rank} --distributed-world-size ${comm_size} --device-id ${lrank} --local_rank ${lrank} --distributed-init-method tcp://${1}:34567 --distributed-no-spawn --max-epoch 1"
case ${lrank} in
[0])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_PCI_BW=mlx5_0:50Gbs
echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[1])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_1:1
export UCX_IB_PCI_BW=mlx5_1:50Gbs
echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}
NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}
;;
[2])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_PCI_BW=mlx5_2:50Gbs
echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP}
NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP}
;;
[3])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_3:1
export UCX_IB_PCI_BW=mlx5_3:50Gbs
echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}
NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}
;;
esac
```
##### 3.2.2.run4.sh
```
#!/usr/bin/env bash
#SBATCH -J distribute
#SBATCH -p wzhdtest
#SBATCH -N 1
#SBARCH -n 32
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --gres=dcu:4
set -x
hostfile=./$SLURM_JOB_ID
scontrol show hostnames $SLURM_JOB_NODELIST > ${hostfile}
for i in `cat $hostfile`
do
echo ${i} slots=4 >> `pwd`/hostfile-$SLURM_JOB_ID
((num_node=${num_node}+1))
done
num_dcu=$((${num_node}*4))
echo $num_dcu
nodename=$(cat $hostfile |sed -n "1p")
echo $nodename
dist_url=`echo $nodename | awk '{print $1}'`
export HSA_USERPTR_FOR_PAGED_MEM=0
mpirun -np ${num_dcu} --hostfile hostfile-$SLURM_JOB_ID single_process.sh $dist_url
```
##### 3.2.3.运行
```
sbatch run4.sh
```
##### 3.2.4.参数说明
- 上面的中single_process.sh需要关注--max-tokens;
- 通过--arch 设置要测试的网络,eg:transformer_wmt_en_de 等;
- 上述 run_transformer_4dcus.sh中mpirun 运行命令表示使用4张DCU加速卡训练。
#### 3.3.单卡测试(半精度)
##### 3.3.1.fp16_run_transformer.sh
```
export HSA_FORCE_FINE_GRAIN_PCIE=1
export MIOPEN_FIND_MODE=1
export HIP_VISIBLE_DEVICES=0
export DATA_PATH=~/data/wmt14_en_de_joined_dict
export TOKEN=2560
python3 train.py \
$DATA_PATH \
--save-dir module-fp16_2560 \
--arch transformer_wmt_en_de --share-decoder-input-output-embed \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens ${TOKEN} \
--eval-bleu \
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric --max-epoch 1 --fp16
```
##### 3.3.2.运行
./ fp16_run_transformer.sh
##### 3.3.3.参数说明
--max-tokens 根据tokens设置batch size
--fp16 使用半精度训练
#### 3.4.四卡测试(半精度)
##### 3.4.1.fp16_single_process.sh
```
export HIP_VISIBLE_DEVICES=0
export MIOPEN_DEBUG_DISABLE_FIND_DB=1
export NCCL_SOCKET_IFNAME=eno1
export HSA_USERPTR_FOR_PAGED_MEM=0
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
comm_rank=$OMPI_COMM_WORLD_RANK
comm_size=$OMPI_COMM_WORLD_SIZE
TOKENS=2560
DATA_PATH=~/data/wmt14_en_de_joined_dict
APP="python3 ~/fairseq/train.py $DATA_PATH --arch transformer_wmt_en_de --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9,0.98)' --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens $TOKENS --eval-bleu --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10} --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --distributed-rank ${comm_rank} --distributed-world-size ${comm_size} --device-id ${lrank} --local_rank ${lrank} --distributed-init-method tcp://${1}:34567 --distributed-no-spawn --max-epoch 1 --fp16"
case ${lrank} in
[0])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_PCI_BW=mlx5_0:50Gbs
echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
;;
[1])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_1:1
export UCX_IB_PCI_BW=mlx5_1:50Gbs
echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}
NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}
;;
[2])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_PCI_BW=mlx5_2:50Gbs
echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP}
NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP}
;;
[3])
export HIP_VISIBLE_DEVICES=0,1,2,3
export UCX_NET_DEVICES=mlx5_3:1
export UCX_IB_PCI_BW=mlx5_3:50Gbs
echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}
NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}
;;
esac
```
##### 3.4.2.fp16_run_transformer_4dcus.sh
```
#!/usr/bin/env bash
#SBATCH -J transformer
#SBATCH -p wzhdtest
#SBATCH -N 1
#SBARCH -n 32
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --gres=dcu:4
set -x
hostfile=./$SLURM_JOB_ID
scontrol show hostnames $SLURM_JOB_NODELIST > ${hostfile}
for i in `cat $hostfile`
do
echo ${i} slots=4 >> `pwd`/hostfile-$SLURM_JOB_ID
((num_node=${num_node}+1))
done
num_dcu=$((${num_node}*4))
echo $num_dcu
nodename=$(cat $hostfile |sed -n "1p")
echo $nodename
dist_url=`echo $nodename | awk '{print $1}'`
export HSA_USERPTR_FOR_PAGED_MEM=0
mpirun -np ${num_dcu} --hostfile hostfile-$SLURM_JOB_ID single_fp16.sh $dist_ur
```
##### 3.4.3.运行
sbatch fp16_ run_transformer_4dcus.sh
##### 3.4.4.参数说明
- 上面的中single_process.sh需要关注--max-tokens;
- 通过--arch 设置要测试的网络,eg:transformer_wmt_en_de 等;
- 上述 run_transformer_4dcus.sh中mpirun 运行命令表示使用4张DCU加速卡训练。
#### 3.5. 部分问题说明
##### 3.5.1. format错误
报错信息如下:
```
File "~/virturlenv-test/venv/lib/python3.6/site-packages/sacrebleu/metrics/bleu.py", line 103, in __init__
self._verbose += f"ratio = {self.ratio:.3f} hyp_len = {self.sys_len:d} "
...
ValueError: Unknown format code 'd' for object of type 'float'
```
修改方法:修改报错提示中的bleu.py,103和104行的d改成.0f
```
#修改后
self._verbose += f"ratio = {self.ratio:.3f} hyp_len ={self.sys_len:.0f}"
self._verbose += f"ref_len = {slef.ref_len:.0f}"
```
##### 3.5.2 json格式解析错误
报错信息如下:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
解析错误的地方,可以看出是引号嵌套问题
错误的打印
```
'eval_bleu_args': "'{beam:5,max_len_a:1.2,max_len_b:10}'
```
正确的打印
```
'eval_bleu_args': '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'
```
修改为:
```
" --eval-bleu-args {\"beam\":5,\"max_len_a\":1.2,\"max_len_b\":10}"
```
<p align="center">
<img src="docs/fairseq_logo.png" width="150">
<br />
<br />
<a href="https://opensource.fb.com/support-ukraine"><img alt="Support Ukraine" src="https://img.shields.io/badge/Support-Ukraine-FFD500?style=flat&labelColor=005BBB" /></a>
<a href="https://github.com/pytorch/fairseq/blob/main/LICENSE"><img alt="MIT License" src="https://img.shields.io/badge/license-MIT-blue.svg" /></a>
<a href="https://github.com/pytorch/fairseq/releases"><img alt="Latest Release" src="https://img.shields.io/github/release/pytorch/fairseq.svg" /></a>
<a href="https://github.com/pytorch/fairseq/actions?query=workflow:build"><img alt="Build Status" src="https://github.com/pytorch/fairseq/workflows/build/badge.svg" /></a>
<a href="https://fairseq.readthedocs.io/en/latest/?badge=latest"><img alt="Documentation Status" src="https://readthedocs.org/projects/fairseq/badge/?version=latest" /></a>
</p>
--------------------------------------------------------------------------------
Fairseq(-py) is a sequence modeling toolkit that allows researchers and
developers to train custom models for translation, summarization, language
modeling and other text generation tasks.
We provide reference implementations of various sequence modeling papers:
<details><summary>List of implemented papers</summary><p>
* **Convolutional Neural Networks (CNN)**
+ [Language Modeling with Gated Convolutional Networks (Dauphin et al., 2017)](examples/language_model/conv_lm/README.md)
+ [Convolutional Sequence to Sequence Learning (Gehring et al., 2017)](examples/conv_seq2seq/README.md)
+ [Classical Structured Prediction Losses for Sequence to Sequence Learning (Edunov et al., 2018)](https://github.com/pytorch/fairseq/tree/classic_seqlevel)
+ [Hierarchical Neural Story Generation (Fan et al., 2018)](examples/stories/README.md)
+ [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](examples/wav2vec/README.md)
* **LightConv and DynamicConv models**
+ [Pay Less Attention with Lightweight and Dynamic Convolutions (Wu et al., 2019)](examples/pay_less_attention_paper/README.md)
* **Long Short-Term Memory (LSTM) networks**
+ Effective Approaches to Attention-based Neural Machine Translation (Luong et al., 2015)
* **Transformer (self-attention) networks**
+ Attention Is All You Need (Vaswani et al., 2017)
+ [Scaling Neural Machine Translation (Ott et al., 2018)](examples/scaling_nmt/README.md)
+ [Understanding Back-Translation at Scale (Edunov et al., 2018)](examples/backtranslation/README.md)
+ [Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)](examples/language_model/README.adaptive_inputs.md)
+ [Lexically constrained decoding with dynamic beam allocation (Post & Vilar, 2018)](examples/constrained_decoding/README.md)
+ [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (Dai et al., 2019)](examples/truncated_bptt/README.md)
+ [Adaptive Attention Span in Transformers (Sukhbaatar et al., 2019)](examples/adaptive_span/README.md)
+ [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md)
+ [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md)
+ [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md)
+ [Jointly Learning to Align and Translate with Transformer Models (Garg et al., 2019)](examples/joint_alignment_translation/README.md )
+ [Multilingual Denoising Pre-training for Neural Machine Translation (Liu et at., 2020)](examples/mbart/README.md)
+ [Neural Machine Translation with Byte-Level Subwords (Wang et al., 2020)](examples/byte_level_bpe/README.md)
+ [Unsupervised Quality Estimation for Neural Machine Translation (Fomicheva et al., 2020)](examples/unsupervised_quality_estimation/README.md)
+ [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)](examples/wav2vec/README.md)
+ [Generating Medical Reports from Patient-Doctor Conversations Using Sequence-to-Sequence Models (Enarvi et al., 2020)](examples/pointer_generator/README.md)
+ [Linformer: Self-Attention with Linear Complexity (Wang et al., 2020)](examples/linformer/README.md)
+ [Cross-lingual Retrieval for Iterative Self-Supervised Training (Tran et al., 2020)](examples/criss/README.md)
+ [Deep Transformers with Latent Depth (Li et al., 2020)](examples/latent_depth/README.md)
+ [Unsupervised Cross-lingual Representation Learning for Speech Recognition (Conneau et al., 2020)](https://arxiv.org/abs/2006.13979)
+ [Self-training and Pre-training are Complementary for Speech Recognition (Xu et al., 2020)](https://arxiv.org/abs/2010.11430)
+ [Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training (Hsu, et al., 2021)](https://arxiv.org/abs/2104.01027)
+ [Unsupervised Speech Recognition (Baevski, et al., 2021)](https://arxiv.org/abs/2105.11084)
+ [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition (Xu et al., 2021)](https://arxiv.org/abs/2109.11680)
+ [VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding (Xu et. al., 2021)](https://arxiv.org/pdf/2109.14084.pdf)
+ [VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding (Xu et. al., 2021)](https://aclanthology.org/2021.findings-acl.370.pdf)
+ [NormFormer: Improved Transformer Pretraining with Extra Normalization (Shleifer et. al, 2021)](examples/normformer/README.md)
* **Non-autoregressive Transformers**
+ Non-Autoregressive Neural Machine Translation (Gu et al., 2017)
+ Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement (Lee et al. 2018)
+ Insertion Transformer: Flexible Sequence Generation via Insertion Operations (Stern et al. 2019)
+ Mask-Predict: Parallel Decoding of Conditional Masked Language Models (Ghazvininejad et al., 2019)
+ [Levenshtein Transformer (Gu et al., 2019)](examples/nonautoregressive_translation/README.md)
* **Finetuning**
+ [Better Fine-Tuning by Reducing Representational Collapse (Aghajanyan et al. 2020)](examples/rxf/README.md)
</p></details>
### What's New:
* May 2022 [Integration with xFormers](https://github.com/facebookresearch/xformers)
* December 2021 [Released Direct speech-to-speech translation code](examples/speech_to_speech/README.md)
* October 2021 [Released VideoCLIP and VLM models](examples/MMPT/README.md)
* October 2021 [Released multilingual finetuned XLSR-53 model](examples/wav2vec/README.md)
* September 2021 [`master` branch renamed to `main`](https://github.com/github/renaming).
* July 2021 [Released DrNMT code](examples/discriminative_reranking_nmt/README.md)
* July 2021 [Released Robust wav2vec 2.0 model](examples/wav2vec/README.md)
* June 2021 [Released XLMR-XL and XLMR-XXL models](examples/xlmr/README.md)
* May 2021 [Released Unsupervised Speech Recognition code](examples/wav2vec/unsupervised/README.md)
* March 2021 [Added full parameter and optimizer state sharding + CPU offloading](examples/fully_sharded_data_parallel/README.md)
* February 2021 [Added LASER training code](examples/laser/README.md)
* December 2020: [Added Adaptive Attention Span code](examples/adaptive_span/README.md)
* December 2020: [GottBERT model and code released](examples/gottbert/README.md)
* November 2020: Adopted the [Hydra](https://github.com/facebookresearch/hydra) configuration framework
* [see documentation explaining how to use it for new and existing projects](docs/hydra_integration.md)
* November 2020: [fairseq 0.10.0 released](https://github.com/pytorch/fairseq/releases/tag/v0.10.0)
* October 2020: [Added R3F/R4F (Better Fine-Tuning) code](examples/rxf/README.md)
* October 2020: [Deep Transformer with Latent Depth code released](examples/latent_depth/README.md)
* October 2020: [Added CRISS models and code](examples/criss/README.md)
<details><summary>Previous updates</summary><p>
* September 2020: [Added Linformer code](examples/linformer/README.md)
* September 2020: [Added pointer-generator networks](examples/pointer_generator/README.md)
* August 2020: [Added lexically constrained decoding](examples/constrained_decoding/README.md)
* August 2020: [wav2vec2 models and code released](examples/wav2vec/README.md)
* July 2020: [Unsupervised Quality Estimation code released](examples/unsupervised_quality_estimation/README.md)
* May 2020: [Follow fairseq on Twitter](https://twitter.com/fairseq)
* April 2020: [Monotonic Multihead Attention code released](examples/simultaneous_translation/README.md)
* April 2020: [Quant-Noise code released](examples/quant_noise/README.md)
* April 2020: [Initial model parallel support and 11B parameters unidirectional LM released](examples/megatron_11b/README.md)
* March 2020: [Byte-level BPE code released](examples/byte_level_bpe/README.md)
* February 2020: [mBART model and code released](examples/mbart/README.md)
* February 2020: [Added tutorial for back-translation](https://github.com/pytorch/fairseq/tree/main/examples/backtranslation#training-your-own-model-wmt18-english-german)
* December 2019: [fairseq 0.9.0 released](https://github.com/pytorch/fairseq/releases/tag/v0.9.0)
* November 2019: [VizSeq released (a visual analysis toolkit for evaluating fairseq models)](https://facebookresearch.github.io/vizseq/docs/getting_started/fairseq_example)
* November 2019: [CamemBERT model and code released](examples/camembert/README.md)
* November 2019: [BART model and code released](examples/bart/README.md)
* November 2019: [XLM-R models and code released](examples/xlmr/README.md)
* September 2019: [Nonautoregressive translation code released](examples/nonautoregressive_translation/README.md)
* August 2019: [WMT'19 models released](examples/wmt19/README.md)
* July 2019: fairseq relicensed under MIT license
* July 2019: [RoBERTa models and code released](examples/roberta/README.md)
* June 2019: [wav2vec models and code released](examples/wav2vec/README.md)
</p></details>
### Features:
* multi-GPU training on one machine or across multiple machines (data and model parallel)
* fast generation on both CPU and GPU with multiple search algorithms implemented:
+ beam search
+ Diverse Beam Search ([Vijayakumar et al., 2016](https://arxiv.org/abs/1610.02424))
+ sampling (unconstrained, top-k and top-p/nucleus)
+ [lexically constrained decoding](examples/constrained_decoding/README.md) (Post & Vilar, 2018)
* [gradient accumulation](https://fairseq.readthedocs.io/en/latest/getting_started.html#large-mini-batch-training-with-delayed-updates) enables training with large mini-batches even on a single GPU
* [mixed precision training](https://fairseq.readthedocs.io/en/latest/getting_started.html#training-with-half-precision-floating-point-fp16) (trains faster with less GPU memory on [NVIDIA tensor cores](https://developer.nvidia.com/tensor-cores))
* [extensible](https://fairseq.readthedocs.io/en/latest/overview.html): easily register new models, criterions, tasks, optimizers and learning rate schedulers
* [flexible configuration](docs/hydra_integration.md) based on [Hydra](https://github.com/facebookresearch/hydra) allowing a combination of code, command-line and file based configuration
* [full parameter and optimizer state sharding](examples/fully_sharded_data_parallel/README.md)
* [offloading parameters to CPU](examples/fully_sharded_data_parallel/README.md)
We also provide [pre-trained models for translation and language modeling](#pre-trained-models-and-examples)
with a convenient `torch.hub` interface:
``` python
en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de.single_model')
en2de.translate('Hello world', beam=5)
# 'Hallo Welt'
```
See the PyTorch Hub tutorials for [translation](https://pytorch.org/hub/pytorch_fairseq_translation/)
and [RoBERTa](https://pytorch.org/hub/pytorch_fairseq_roberta/) for more examples.
# Requirements and Installation
* [PyTorch](http://pytorch.org/) version >= 1.5.0
* Python version >= 3.6
* For training new models, you'll also need an NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)
* **To install fairseq** and develop locally:
``` bash
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./
# to install the latest stable release (0.10.x)
# pip install fairseq
```
* **For faster training** install NVIDIA's [apex](https://github.com/NVIDIA/apex) library:
``` bash
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
```
* **For large datasets** install [PyArrow](https://arrow.apache.org/docs/python/install.html#using-pip): `pip install pyarrow`
* If you use Docker make sure to increase the shared memory size either with `--ipc=host` or `--shm-size`
as command line options to `nvidia-docker run` .
# Getting Started
The [full documentation](https://fairseq.readthedocs.io/) contains instructions
for getting started, training new models and extending fairseq with new model
types and tasks.
# Pre-trained models and examples
We provide pre-trained models and pre-processed, binarized test sets for several tasks listed below,
as well as example training and evaluation commands.
* [Translation](examples/translation/README.md): convolutional and transformer models are available
* [Language Modeling](examples/language_model/README.md): convolutional and transformer models are available
We also have more detailed READMEs to reproduce results from specific papers:
* [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale (Babu et al., 2021)](examples/wav2vec/xlsr/README.md)
* [Cross-lingual Retrieval for Iterative Self-Supervised Training (Tran et al., 2020)](examples/criss/README.md)
* [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)](examples/wav2vec/README.md)
* [Unsupervised Quality Estimation for Neural Machine Translation (Fomicheva et al., 2020)](examples/unsupervised_quality_estimation/README.md)
* [Training with Quantization Noise for Extreme Model Compression ({Fan*, Stock*} et al., 2020)](examples/quant_noise/README.md)
* [Neural Machine Translation with Byte-Level Subwords (Wang et al., 2020)](examples/byte_level_bpe/README.md)
* [Multilingual Denoising Pre-training for Neural Machine Translation (Liu et at., 2020)](examples/mbart/README.md)
* [Reducing Transformer Depth on Demand with Structured Dropout (Fan et al., 2019)](examples/layerdrop/README.md)
* [Jointly Learning to Align and Translate with Transformer Models (Garg et al., 2019)](examples/joint_alignment_translation/README.md)
* [Levenshtein Transformer (Gu et al., 2019)](examples/nonautoregressive_translation/README.md)
* [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md)
* [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md)
* [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](examples/wav2vec/README.md)
* [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md)
* [Pay Less Attention with Lightweight and Dynamic Convolutions (Wu et al., 2019)](examples/pay_less_attention_paper/README.md)
* [Understanding Back-Translation at Scale (Edunov et al., 2018)](examples/backtranslation/README.md)
* [Classical Structured Prediction Losses for Sequence to Sequence Learning (Edunov et al., 2018)](https://github.com/pytorch/fairseq/tree/classic_seqlevel)
* [Hierarchical Neural Story Generation (Fan et al., 2018)](examples/stories/README.md)
* [Scaling Neural Machine Translation (Ott et al., 2018)](examples/scaling_nmt/README.md)
* [Convolutional Sequence to Sequence Learning (Gehring et al., 2017)](examples/conv_seq2seq/README.md)
* [Language Modeling with Gated Convolutional Networks (Dauphin et al., 2017)](examples/language_model/README.conv.md)
# Join the fairseq community
* Twitter: https://twitter.com/fairseq
* Facebook page: https://www.facebook.com/groups/fairseq.users
* Google group: https://groups.google.com/forum/#!forum/fairseq-users
# License
fairseq(-py) is MIT-licensed.
The license applies to the pre-trained models as well.
# Citation
Please cite as:
``` bibtex
@inproceedings{ott2019fairseq,
title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
year = {2019},
}
```
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = python -msphinx
SPHINXPROJ = fairseq
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
\ No newline at end of file
.wy-table-responsive table td kbd {
white-space: nowrap;
}
.wy-table-responsive table td {
white-space: normal !important;
}
.wy-table-responsive {
overflow: visible !important;
}
.. _Command-line Tools:
Command-line Tools
==================
Fairseq provides several command-line tools for training and evaluating models:
- :ref:`fairseq-preprocess`: Data pre-processing: build vocabularies and binarize training data
- :ref:`fairseq-train`: Train a new model on one or multiple GPUs
- :ref:`fairseq-generate`: Translate pre-processed data with a trained model
- :ref:`fairseq-interactive`: Translate raw text with a trained model
- :ref:`fairseq-score`: BLEU scoring of generated translations against reference translations
- :ref:`fairseq-eval-lm`: Language model evaluation
.. _fairseq-preprocess:
fairseq-preprocess
~~~~~~~~~~~~~~~~~~
.. automodule:: fairseq_cli.preprocess
.. argparse::
:module: fairseq.options
:func: get_preprocessing_parser
:prog: fairseq-preprocess
.. _fairseq-train:
fairseq-train
~~~~~~~~~~~~~
.. automodule:: fairseq_cli.train
.. argparse::
:module: fairseq.options
:func: get_training_parser
:prog: fairseq-train
.. _fairseq-generate:
fairseq-generate
~~~~~~~~~~~~~~~~
.. automodule:: fairseq_cli.generate
.. argparse::
:module: fairseq.options
:func: get_generation_parser
:prog: fairseq-generate
.. _fairseq-interactive:
fairseq-interactive
~~~~~~~~~~~~~~~~~~~
.. automodule:: fairseq_cli.interactive
.. argparse::
:module: fairseq.options
:func: get_interactive_generation_parser
:prog: fairseq-interactive
.. _fairseq-score:
fairseq-score
~~~~~~~~~~~~~
.. automodule:: fairseq_cli.score
.. argparse::
:module: fairseq_cli.score
:func: get_parser
:prog: fairseq-score
.. _fairseq-eval-lm:
fairseq-eval-lm
~~~~~~~~~~~~~~~
.. automodule:: fairseq_cli.eval_lm
.. argparse::
:module: fairseq.options
:func: get_eval_lm_parser
:prog: fairseq-eval-lm
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# fairseq documentation build configuration file, created by
# sphinx-quickstart on Fri Aug 17 21:45:30 2018.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
import os
import sys
from fairseq import __version__
# source code directory, relative to this file, for sphinx-autobuild
sys.path.insert(0, os.path.abspath(".."))
source_suffix = [".rst"]
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.intersphinx",
"sphinx.ext.viewcode",
"sphinx.ext.napoleon",
"sphinxarg.ext",
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# The master toctree document.
master_doc = "index"
# General information about the project.
project = "fairseq"
copyright = "Facebook AI Research (FAIR)"
author = "Facebook AI Research (FAIR)"
github_doc_root = "https://github.com/pytorch/fairseq/tree/main/docs/"
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = __version__
# The full version, including alpha/beta/rc tags.
release = __version__
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = "sphinx"
highlight_language = "python"
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "sphinx_rtd_theme"
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]
html_context = {
"css_files": [
"_static/theme_overrides.css", # override wide tables in RTD theme
],
}
# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# This is required for the alabaster theme
# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
# html_sidebars = {
# '**': [
# 'about.html',
# 'navigation.html',
# 'relations.html', # needs 'show_related': True theme option to display
# 'searchbox.html',
# 'donate.html',
# ]
# }
# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {
"numpy": ("http://docs.scipy.org/doc/numpy/", None),
"python": ("https://docs.python.org/", None),
"torch": ("https://pytorch.org/docs/master/", None),
}
.. role:: hidden
:class: hidden-section
.. _Criterions:
Criterions
==========
Criterions compute the loss function given the model and batch, roughly::
loss = criterion(model, batch)
.. automodule:: fairseq.criterions
:members:
.. autoclass:: fairseq.criterions.FairseqCriterion
:members:
:undoc-members:
.. autoclass:: fairseq.criterions.adaptive_loss.AdaptiveLoss
:members:
:undoc-members:
.. autoclass:: fairseq.criterions.composite_loss.CompositeLoss
:members:
:undoc-members:
.. autoclass:: fairseq.criterions.cross_entropy.CrossEntropyCriterion
:members:
:undoc-members:
.. autoclass:: fairseq.criterions.label_smoothed_cross_entropy.LabelSmoothedCrossEntropyCriterion
:members:
:undoc-members:
.. role:: hidden
:class: hidden-section
.. module:: fairseq.data
Data Loading and Utilities
==========================
.. _datasets:
Datasets
--------
**Datasets** define the data format and provide helpers for creating
mini-batches.
.. autoclass:: fairseq.data.FairseqDataset
:members:
.. autoclass:: fairseq.data.LanguagePairDataset
:members:
.. autoclass:: fairseq.data.MonolingualDataset
:members:
**Helper Datasets**
These datasets wrap other :class:`fairseq.data.FairseqDataset` instances and
provide additional functionality:
.. autoclass:: fairseq.data.BacktranslationDataset
:members:
.. autoclass:: fairseq.data.ConcatDataset
:members:
.. autoclass:: fairseq.data.ResamplingDataset
:members:
.. autoclass:: fairseq.data.RoundRobinZipDatasets
:members:
.. autoclass:: fairseq.data.TransformEosDataset
:members:
Dictionary
----------
.. autoclass:: fairseq.data.Dictionary
:members:
Iterators
---------
.. autoclass:: fairseq.data.CountingIterator
:members:
.. autoclass:: fairseq.data.EpochBatchIterator
:members:
.. autoclass:: fairseq.data.GroupedIterator
:members:
.. autoclass:: fairseq.data.ShardedIterator
:members:
Evaluating Pre-trained Models
=============================
First, download a pre-trained model along with its vocabularies:
.. code-block:: console
> curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
This model uses a `Byte Pair Encoding (BPE)
vocabulary <https://arxiv.org/abs/1508.07909>`__, so we'll have to apply
the encoding to the source text before it can be translated. This can be
done with the
`apply\_bpe.py <https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/apply_bpe.py>`__
script using the ``wmt14.en-fr.fconv-cuda/bpecodes`` file. ``@@`` is
used as a continuation marker and the original text can be easily
recovered with e.g. ``sed s/@@ //g`` or by passing the ``--remove-bpe``
flag to :ref:`fairseq-generate`. Prior to BPE, input text needs to be tokenized
using ``tokenizer.perl`` from
`mosesdecoder <https://github.com/moses-smt/mosesdecoder>`__.
Let's use :ref:`fairseq-interactive` to generate translations interactively.
Here, we use a beam size of 5 and preprocess the input with the Moses
tokenizer and the given Byte-Pair Encoding vocabulary. It will automatically
remove the BPE continuation markers and detokenize the output.
.. code-block:: console
> MODEL_DIR=wmt14.en-fr.fconv-py
> fairseq-interactive \
--path $MODEL_DIR/model.pt $MODEL_DIR \
--beam 5 --source-lang en --target-lang fr \
--tokenizer moses \
--bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes
| loading model(s) from wmt14.en-fr.fconv-py/model.pt
| [en] dictionary: 44206 types
| [fr] dictionary: 44463 types
| Type the input sentence and press return:
Why is it rare to discover new marine mammal species?
S-0 Why is it rare to discover new marine mam@@ mal species ?
H-0 -0.0643349438905716 Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
P-0 -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015
This generation script produces three types of outputs: a line prefixed
with *O* is a copy of the original source sentence; *H* is the
hypothesis along with an average log-likelihood; and *P* is the
positional score per token position, including the
end-of-sentence marker which is omitted from the text.
Other types of output lines you might see are *D*, the detokenized hypothesis,
*T*, the reference target, *A*, alignment info, *E* the history of generation steps.
See the `README <https://github.com/pytorch/fairseq#pre-trained-models>`__ for a
full list of pre-trained models available.
Training a New Model
====================
The following tutorial is for machine translation. For an example of how
to use Fairseq for other tasks, such as :ref:`language modeling`, please see the
``examples/`` directory.
Data Pre-processing
-------------------
Fairseq contains example pre-processing scripts for several translation
datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT
2014 (English-German). To pre-process and binarize the IWSLT dataset:
.. code-block:: console
> cd examples/translation/
> bash prepare-iwslt14.sh
> cd ../..
> TEXT=examples/translation/iwslt14.tokenized.de-en
> fairseq-preprocess --source-lang de --target-lang en \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/iwslt14.tokenized.de-en
This will write binarized data that can be used for model training to
``data-bin/iwslt14.tokenized.de-en``.
Training
--------
Use :ref:`fairseq-train` to train a new model. Here a few example settings that work
well for the IWSLT 2014 dataset:
.. code-block:: console
> mkdir -p checkpoints/fconv
> CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
--optimizer nag --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
--arch fconv_iwslt_de_en --save-dir checkpoints/fconv
By default, :ref:`fairseq-train` will use all available GPUs on your machine. Use the
``CUDA_VISIBLE_DEVICES`` environment variable to select specific GPUs and/or to
change the number of GPU devices that will be used.
Also note that the batch size is specified in terms of the maximum
number of tokens per batch (``--max-tokens``). You may need to use a
smaller value depending on the available GPU memory on your system.
Generation
----------
Once your model is trained, you can generate translations using
:ref:`fairseq-generate` **(for binarized data)** or
:ref:`fairseq-interactive` **(for raw text)**:
.. code-block:: console
> fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/fconv/checkpoint_best.pt \
--batch-size 128 --beam 5
| [de] dictionary: 35475 types
| [en] dictionary: 24739 types
| data-bin/iwslt14.tokenized.de-en test 6750 examples
| model fconv
| loaded checkpoint trainings/fconv/checkpoint_best.pt
S-721 danke .
T-721 thank you .
...
To generate translations with only a CPU, use the ``--cpu`` flag. BPE
continuation markers can be removed with the ``--remove-bpe`` flag.
Advanced Training Options
=========================
Large mini-batch training with delayed updates
----------------------------------------------
The ``--update-freq`` option can be used to accumulate gradients from
multiple mini-batches and delay updating, creating a larger effective
batch size. Delayed updates can also improve training speed by reducing
inter-GPU communication costs and by saving idle time caused by variance
in workload across GPUs. See `Ott et al.
(2018) <https://arxiv.org/abs/1806.00187>`__ for more details.
To train on a single GPU with an effective batch size that is equivalent
to training on 8 GPUs:
.. code-block:: console
> CUDA_VISIBLE_DEVICES=0 fairseq-train --update-freq 8 (...)
Training with half precision floating point (FP16)
--------------------------------------------------
.. note::
FP16 training requires a Volta GPU and CUDA 9.1 or greater
Recent GPUs enable efficient half precision floating point computation,
e.g., using `Nvidia Tensor Cores
<https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__.
Fairseq supports FP16 training with the ``--fp16`` flag:
.. code-block:: console
> fairseq-train --fp16 (...)
Distributed training
--------------------
Distributed training in fairseq is implemented on top of ``torch.distributed``.
The easiest way to launch jobs is with the `torch.distributed.launch
<https://pytorch.org/docs/stable/distributed.html#launch-utility>`__ tool.
For example, to train a large English-German Transformer model on 2 nodes each
with 8 GPUs (in total 16 GPUs), run the following command on each node,
replacing ``node_rank=0`` with ``node_rank=1`` on the second node and making
sure to update ``--master_addr`` to the IP address of the first node:
.. code-block:: console
> python -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
--master_port=12345 \
$(which fairseq-train) data-bin/wmt16_en_de_bpe32k \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0005 \
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3584 \
--max-epoch 70 \
--fp16
On SLURM clusters, fairseq will automatically detect the number of nodes and
GPUs, but a port number must be provided:
.. code-block:: console
> salloc --gpus=16 --nodes 2 (...)
> srun fairseq-train --distributed-port 12345 (...).
Sharding very large datasets
----------------------------
It can be challenging to train over very large datasets, particularly if your
machine does not have much system RAM. Most tasks in fairseq support training
over "sharded" datasets, in which the original dataset has been preprocessed
into non-overlapping chunks (or "shards").
For example, instead of preprocessing all your data into a single "data-bin"
directory, you can split the data and create "data-bin1", "data-bin2", etc.
Then you can adapt your training command like so:
.. code-block:: console
> fairseq-train data-bin1:data-bin2:data-bin3 (...)
Training will now iterate over each shard, one by one, with each shard
corresponding to an "epoch", thus reducing system memory usage.
## Hydra
[Hydra](https://github.com/facebookresearch/hydra) is an open-source Python
framework that simplifies the development of research and other complex
applications. The key feature is the ability to dynamically create a
hierarchical configuration by composition and override it through config files
and the command line. The name Hydra comes from its ability to run multiple
similar jobs - much like a Hydra with multiple heads.
## Motivation
Until recently, all components in fairseq were configured through a shared
`args` namespace that was created at application startup. Components declared
their own `add_args` method to update the argparse parser, hoping that the names
would not clash with arguments from other components. While this model works for
smaller applications, as fairseq grew and became integrated into other
applications, this became problematic. In order to determine how to configure
each component, one needed to a) examine what args were added by this component,
and b) read the code to figure out what shared arguments it is using that were
added in other places. Reproducing models involved sharing commands that often
contained dozens of command line switches.
The model described above is still supported by fairseq for backward
compatibility, but will be deprecated some time in the future.
New components in fairseq should now create a dataclass that encapsulates all
parameters required to configure this component. The dataclass is registered
along with the component, and fairseq takes care of constructing and providing
this configuration object to the component's constructor. Note that sharing
parameters can optionally still work, but one has to explicitly point to the
"source of truth" (see inheritance example below). These changes make components
in fairseq more independent and re-usable by other applications: all that is
needed to create a component is to initialize its dataclass and overwrite some
of the defaults.
While configuring fairseq through command line (using either the legacy argparse
based or the new Hydra based entry points) is still fully supported, you can now
take advantage of configuring fairseq completely or piece-by-piece through
hierarchical YAML configuration files. These files can also be shipped as
examples that others can use to run an identically configured job.
Additionally, Hydra has a rich and growing [library of
plugins](https://github.com/facebookresearch/hydra/tree/master/plugins) that
provide functionality such as hyperparameter sweeping (including using bayesian
optimization through the [Ax](https://github.com/facebook/Ax) library), job
launching across various platforms, and more.
## Creating or migrating components
In general, each new (or updated) component should provide a companion
[dataclass](https://www.python.org/dev/peps/pep-0557/). These dataclass are
typically located in the same file as the component and are passed as arguments
to the `register_*()` functions. Top-level configs that should be present in
every fairseq application are placed in the
[global](fairseq/dataclass/configs.py) config file and added to the
`FairseqConfig` object.
Each dataclass is a plain-old-data object, similar to a `NamedTuple`. These
classes are decorated with a `@dataclass` decorator, and typically inherit from
`FairseqDataclass` (which adds some functionality for backward compatibility).
Each field must have a type, and generally has metadata (such as a help string)
and a default value. Only primitive types or other config objects are allowed as
data types for each field.
#### Example:
```python
from dataclasses import dataclass, field
from fairseq.dataclass import FairseqDataclass
@dataclass
class InteractiveConfig(FairseqDataclass):
buffer_size: int = field(
default=0,
metadata={
"help": "read this many sentences into a buffer before processing them"
},
)
input: str = field(
default="-",
metadata={"help": "file to read from; use - for stdin"},
)
```
### Inherting values
Some components require sharing a value. For example, a learning rate scheduler
and an optimizer may both need to know the initial learning rate value. One can
declare a field that, by default, will inherit its value from another config
node in the same hierarchy:
```python
@dataclass
FairseqAdamConfig(FairseqDataclass):
...
lr: List[float] = II("optimization.lr")
...
```
`II("optimization.lr")` is syntactic sugar for `"${optimization.lr}"`, which is
the value one can use in a YAML config file or through command line to achieve
the same effect. Note that this assumes that there is an "optimization" config
object in the root config and it has a field called "lr".
### Tasks and Models
Creating Tasks and Models works same as before, except that legacy
implementations now inherit from `LegacyFairseq*` base classes, while new
components inherit from `FairseqTask` and `FairseqModel` and provide a dataclass
to the `register_*()` functions.
#### Task example:
```python
@dataclass
class LanguageModelingConfig(FairseqDataclass):
data: Optional[str] = field(
default=None, metadata={"help": "path to data directory"}
)
...
@register_task("language_modeling", dataclass=LanguageModelingConfig)
class LanguageModelingTask(FairseqTask):
...
@classmethod
def setup_task(cls, cfg: LanguageModelingConfig):
...
```
#### Model example:
```python
@dataclass
class TransformerLanguageModelConfig(FairseqDataclass):
activation_fn: ChoiceEnum(utils.get_available_activation_fns()) = field(
default="relu", metadata={"help": "activation function to use"}
)
dropout: float = field(default=0.1, metadata={"help": "dropout probability"})
...
@register_model("transformer_lm", dataclass=TransformerLanguageModelConfig)
class TransformerLanguageModel(FairseqLanguageModel):
...
@classmethod
def build_model(cls, cfg: TransformerLanguageModelConfig, task: FairseqTask):
...
```
### Other components
Other components work as before, but they now take their configuration dataclass
as the only constructor argument:
```python
@dataclass
class MosesTokenizerConfig(FairseqDataclass):
source_lang: str = field(default="en", metadata={"help": "source language"})
...
@register_tokenizer("moses", dataclass=MosesTokenizerConfig)
class MosesTokenizer(object):
def __init__(self, cfg: MosesTokenizerConfig):
...
```
Note that if you are adding a new registry for a new set of components, you need
to add it to the `FairseqConfig` object in `fairseq/dataclass/configs.py`:
```python
@dataclass
class FairseqConfig(object):
...
my_new_registry: Any = None
```
## Training with `fairseq-hydra-train`
To fully take advantage of configuration flexibility offered by Hydra, you may
want to train new models using the `fairseq-hydra-train` entry point. Legacy CLI
tools such as `fairseq-train` will remain supported for the foreseeable future
but will be deprecated eventually.
On startup, Hydra will create a configuration object that contains a hierarchy
of all the necessary dataclasses populated with their default values in the
code. The default values are overwritten by values found in YAML files in
`fairseq/config` directory (which currently sets minimal defaults) and then
further overwritten by values provided through command line arguments.
Some of the most common use cases are shown below:
### 1. Override default values through command line:
```shell script
$ fairseq-hydra-train \
distributed_training.distributed_world_size=1 \
dataset.batch_size=2 \
task.data=data-bin \
model=transformer_lm/transformer_lm_gpt \
task=language_modeling \
optimization.max_update=5000
```
Note that along with explicitly providing values for parameters such as
`dataset.batch_size`, this also tells Hydra to overlay configuration found in
`fairseq/config/model/transformer_lm/transformer_lm_gpt.yaml` over the default
values in the dataclass. If you want to train a model without specifying a
particular architecture you can simply specify `model=transformer_lm`. This only
works for migrated tasks and models.
### 2. Replace bundled configs with an external config:
```shell script
$ fairseq-hydra-train \
--config-dir /path/to/external/configs \
--config-name wiki103
```
where `/path/to/external/configs/wiki103.yaml` contains:
```yaml
# @package _group_
model:
_name: transformer_lm
distributed_training:
distributed_world_size: 1
dataset:
batch_size: 2
task:
_name: language_modeling
data: /path/to/data
add_bos_token: false
max_target_positions: 1024
optimization:
max_update: 50000
lr: [ 0.25 ]
criterion: cross_entropy
optimizer: adam
lr_scheduler:
_name: cosine
```
Note that here bundled configs from `fairseq/config` directory are not used,
however the defaults from each dataclass will still be used (unless overwritten
by your external config).
Additionally you can choose to break up your configs by creating a directory
structure in the same location as your main config file, with the names of the
top-level fields (such as "model", "dataset", etc), and placing config files
with meaningful names that would populate that specific section of your
top-level config file (for example, you might have
`model/small_transformer_lm.yaml`, `model/big_transformer_lm.yaml`, etc). You
can then specify the correct configuration via command line, defaults in the
main config, or even launch all of them as a sweep (see Hydra documentation on
how to do this).
### 3. Add an external config directory to Hydra search path:
This allows combining default configuration (including using any bundled config
files), while specifying your own config files for some parts of the
configuration.
```shell script
$ fairseq-hydra-train \
distributed_training.distributed_world_size=1 \
dataset.batch_size=2 \
task.data=/path/to/data/ \
model=transformer_lm/2_layers \
task=language_modeling \
optimization.max_update=5000 \
--config-dir /path/to/external/configs
```
where `/path/to/external/configs` has the following structure:
```
.
+-- model
| +-- transformer_lm
| | +-- 2_layers.yaml
```
and `2_layers.yaml` contains a copy of `transformer_lm_gpt.yaml` but with
`decoder_layers` set to 2. You can add other configs to configure other
components as well.
.. fairseq documentation master file, created by
sphinx-quickstart on Fri Aug 17 21:45:30 2018.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
:github_url: https://github.com/pytorch/fairseq
fairseq documentation
=====================
Fairseq is a sequence modeling toolkit written in `PyTorch
<http://pytorch.org/>`_ that allows researchers and developers to
train custom models for translation, summarization, language modeling and other
text generation tasks.
.. toctree::
:maxdepth: 1
:caption: Getting Started
getting_started
command_line_tools
.. toctree::
:maxdepth: 1
:caption: Extending Fairseq
overview
tutorial_simple_lstm
tutorial_classifying_names
.. toctree::
:maxdepth: 2
:caption: Library Reference
tasks
models
criterions
optim
lr_scheduler
data
modules
Indices and tables
==================
* :ref:`genindex`
* :ref:`search`
.. role:: hidden
:class: hidden-section
.. _Learning Rate Schedulers:
Learning Rate Schedulers
========================
Learning Rate Schedulers update the learning rate over the course of training.
Learning rates can be updated after each update via :func:`step_update` or at
epoch boundaries via :func:`step`.
.. automodule:: fairseq.optim.lr_scheduler
:members:
.. autoclass:: fairseq.optim.lr_scheduler.FairseqLRScheduler
:members:
:undoc-members:
.. autoclass:: fairseq.optim.lr_scheduler.cosine_lr_scheduler.CosineSchedule
:members:
:undoc-members:
.. autoclass:: fairseq.optim.lr_scheduler.fixed_schedule.FixedSchedule
:members:
:undoc-members:
.. autoclass:: fairseq.optim.lr_scheduler.inverse_square_root_schedule.InverseSquareRootSchedule
:members:
:undoc-members:
.. autoclass:: fairseq.optim.lr_scheduler.reduce_lr_on_plateau.ReduceLROnPlateau
:members:
:undoc-members:
.. autoclass:: fairseq.optim.lr_scheduler.triangular_lr_scheduler.TriangularSchedule
:members:
:undoc-members:
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=python -msphinx
)
set SOURCEDIR=.
set BUILDDIR=_build
set SPHINXPROJ=fairseq
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The Sphinx module was not found. Make sure you have Sphinx installed,
echo.then set the SPHINXBUILD environment variable to point to the full
echo.path of the 'sphinx-build' executable. Alternatively you may add the
echo.Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
:end
popd
.. role:: hidden
:class: hidden-section
.. module:: fairseq.models
.. _Models:
Models
======
A Model defines the neural network's ``forward()`` method and encapsulates all
of the learnable parameters in the network. Each model also provides a set of
named *architectures* that define the precise network configuration (e.g.,
embedding dimension, number of layers, etc.).
Both the model type and architecture are selected via the ``--arch``
command-line argument. Once selected, a model may expose additional command-line
arguments for further configuration.
.. note::
All fairseq Models extend :class:`BaseFairseqModel`, which in turn extends
:class:`torch.nn.Module`. Thus any fairseq Model can be used as a
stand-alone Module in other PyTorch code.
Convolutional Neural Networks (CNN)
-----------------------------------
.. module:: fairseq.models.fconv
.. autoclass:: fairseq.models.fconv.FConvModel
:members:
.. autoclass:: fairseq.models.fconv.FConvEncoder
:members:
:undoc-members:
.. autoclass:: fairseq.models.fconv.FConvDecoder
:members:
Long Short-Term Memory (LSTM) networks
--------------------------------------
.. module:: fairseq.models.lstm
.. autoclass:: fairseq.models.lstm.LSTMModel
:members:
.. autoclass:: fairseq.models.lstm.LSTMEncoder
:members:
.. autoclass:: fairseq.models.lstm.LSTMDecoder
:members:
Transformer (self-attention) networks
-------------------------------------
.. module:: fairseq.models.transformer
.. autoclass:: fairseq.models.transformer.TransformerModel
:members:
.. autoclass:: fairseq.models.transformer.TransformerEncoder
:members:
.. autoclass:: fairseq.models.transformer.TransformerEncoderLayer
:members:
.. autoclass:: fairseq.models.transformer.TransformerDecoder
:members:
.. autoclass:: fairseq.models.transformer.TransformerDecoderLayer
:members:
Adding new models
-----------------
.. currentmodule:: fairseq.models
.. autofunction:: fairseq.models.register_model
.. autofunction:: fairseq.models.register_model_architecture
.. autoclass:: fairseq.models.BaseFairseqModel
:members:
:undoc-members:
.. autoclass:: fairseq.models.FairseqEncoderDecoderModel
:members:
:undoc-members:
.. autoclass:: fairseq.models.FairseqEncoderModel
:members:
:undoc-members:
.. autoclass:: fairseq.models.FairseqLanguageModel
:members:
:undoc-members:
.. autoclass:: fairseq.models.FairseqMultiModel
:members:
:undoc-members:
.. autoclass:: fairseq.models.FairseqEncoder
:members:
.. autoclass:: fairseq.models.CompositeEncoder
:members:
.. autoclass:: fairseq.models.FairseqDecoder
:members:
.. _Incremental decoding:
Incremental decoding
--------------------
.. autoclass:: fairseq.models.FairseqIncrementalDecoder
:members:
:undoc-members:
Modules
=======
Fairseq provides several stand-alone :class:`torch.nn.Module` classes that may
be helpful when implementing a new :class:`~fairseq.models.BaseFairseqModel`.
.. automodule:: fairseq.modules
:members:
:undoc-members:
.. role:: hidden
:class: hidden-section
.. _optimizers:
Optimizers
==========
Optimizers update the Model parameters based on the gradients.
.. automodule:: fairseq.optim
:members:
.. autoclass:: fairseq.optim.FairseqOptimizer
:members:
:undoc-members:
.. autoclass:: fairseq.optim.adadelta.Adadelta
:members:
:undoc-members:
.. autoclass:: fairseq.optim.adagrad.Adagrad
:members:
:undoc-members:
.. autoclass:: fairseq.optim.adafactor.FairseqAdafactor
:members:
:undoc-members:
.. autoclass:: fairseq.optim.adam.FairseqAdam
:members:
:undoc-members:
.. autoclass:: fairseq.optim.fp16_optimizer.FP16Optimizer
:members:
:undoc-members:
.. autoclass:: fairseq.optim.nag.FairseqNAG
:members:
:undoc-members:
.. autoclass:: fairseq.optim.sgd.SGD
:members:
:undoc-members:
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment