*[Activation Checkpointing and Recomputation](#activation-checkpointing-and-recomputation)
*[Distributed Optimizer](#distributed-optimizer)
*[FlashAttention](#flashattention)
*[GPT-3 Example](#gpt-3-example)
*[Retro](#retro)
*[Evaluation and Tasks](#evaluation-and-tasks)
*[GPT Text Generation](#gpt-text-generation)
*[GPT Evaluation](#gpt-evaluation)
...
...
@@ -323,6 +325,19 @@ In `examples/pretrain_gpt3_175B.sh` we have provided an example of how to config
With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes around 32 seconds resulting in 138 teraFLOPs per GPU which is 44% of the theoretical peak FLOPs.
## Retro
See:
-`tools/retro/README.md` for an overview.
-`tools/retro/examples/get_preprocess_cmd.sh` for an example of common preprocessing arguments.
-`tools/retro/examples/preprocess_data.sh` for an example of how to preprocess data.
-`tools/retro/examples/pretrain_model.sh` for an example of how to pretrain a model.
Retro is a retrieval-enhanced model that is based on GPT. As described in [Improving language models by retrieving from trillions of tokens](https://arxiv.org/abs/2112.04426), Retro retrieves from a database of document chunks by performing locality search using a sample's tokens. The retrieval database can be large -- often billions or even trillions of tokens -- and provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters.
Using Retro requires two steps: 1) preprocessing the retrieval database and pretraining neighbors, and 2) pretraining a model using this data. Please see `tools/retro/README.md` for a detailed overview.
<!--
## REALM Pipeline
We are working on implementing the [REALM](https://arxiv.org/pdf/2002.08909.pdf) system. The following sections (will) reflect the three stages of training it. For now it's just the ICT code.