"python/pyproject.toml" did not exist on "b675ecaec40d11ef8c0980de748e1a0471fcf5c4"
Commit 208295df authored by Myle Ott's avatar Myle Ott Committed by Facebook Github Bot
Browse files

Update README.md

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/899

Differential Revision: D16448602

Pulled By: myleott

fbshipit-source-id: afd1a1b713274b6328150cd85d7f8a81833597aa
parent af6b361c
......@@ -15,23 +15,20 @@ The model is trained with online responsibility assignment and shared parameteri
The following command will train a `hMoElp` model with `3` experts:
```
$ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_de \
$ fairseq-train --ddp-backend='no_c10d' \
data-bin/wmt17_en_de \
--max-update 100000 \
--task translation_moe \
--method hMoElp --mean-pool-gating-network \
--num-experts 3 \
--arch transformer_vaswani_wmt_en_de --share-all-embeddings \
--arch transformer_wmt_en_de --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0007 --min-lr 1e-09 \
--dropout 0.1 --weight-decay 0.0 --criterion cross_entropy \
--max-tokens 3584 \
--update-freq 8
--max-tokens 3584
```
**Note**: the above command assumes 1 GPU, but accumulates gradients from 8 fwd/bwd passes to simulate training on 8 GPUs.
You can accelerate training on up to 8 GPUs by adjusting the `CUDA_VISIBLE_DEVICES` and `--update-freq` options accordingly.
## Translate
Once a model is trained, we can generate translations from different experts using the `--gen-expert` option.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment