*[Activation Checkpointing and Recomputation](#activation-checkpointing-and-recomputation)
*[Activation Checkpointing and Recomputation](#activation-checkpointing-and-recomputation)
*[Distributed Optimizer](#distributed-optimizer)
*[Distributed Optimizer](#distributed-optimizer)
*[FlashAttention](#flashattention)
*[GPT-3 Example](#gpt-3-example)
*[GPT-3 Example](#gpt-3-example)
*[Retro](#retro)
*[Evaluation and Tasks](#evaluation-and-tasks)
*[Evaluation and Tasks](#evaluation-and-tasks)
*[GPT Text Generation](#gpt-text-generation)
*[GPT Text Generation](#gpt-text-generation)
*[GPT Evaluation](#gpt-evaluation)
*[GPT Evaluation](#gpt-evaluation)
...
@@ -323,6 +325,19 @@ In `examples/pretrain_gpt3_175B.sh` we have provided an example of how to config
...
@@ -323,6 +325,19 @@ In `examples/pretrain_gpt3_175B.sh` we have provided an example of how to config
With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes around 32 seconds resulting in 138 teraFLOPs per GPU which is 44% of the theoretical peak FLOPs.
With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes around 32 seconds resulting in 138 teraFLOPs per GPU which is 44% of the theoretical peak FLOPs.
## Retro
See:
-`tools/retro/README.md` for an overview.
-`tools/retro/examples/get_preprocess_cmd.sh` for an example of common preprocessing arguments.
-`tools/retro/examples/preprocess_data.sh` for an example of how to preprocess data.
-`tools/retro/examples/pretrain_model.sh` for an example of how to pretrain a model.
Retro is a retrieval-enhanced model that is based on GPT. As described in [Improving language models by retrieving from trillions of tokens](https://arxiv.org/abs/2112.04426), Retro retrieves from a database of document chunks by performing locality search using a sample's tokens. The retrieval database can be large -- often billions or even trillions of tokens -- and provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters.
Using Retro requires two steps: 1) preprocessing the retrieval database and pretraining neighbors, and 2) pretraining a model using this data. Please see `tools/retro/README.md` for a detailed overview.
<!--
<!--
## REALM Pipeline
## REALM Pipeline
We are working on implementing the [REALM](https://arxiv.org/pdf/2002.08909.pdf) system. The following sections (will) reflect the three stages of training it. For now it's just the ICT code.
We are working on implementing the [REALM](https://arxiv.org/pdf/2002.08909.pdf) system. The following sections (will) reflect the three stages of training it. For now it's just the ICT code.