Unverified Commit 5d8b9860 authored by Ngo Quang Huy's avatar Ngo Quang Huy Committed by GitHub
Browse files

Fix deepspeed docs (#15346)

parent 96161ac4
......@@ -31,7 +31,7 @@ won't be possible on a single GPU.
🤗 Transformers integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options:
1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for you type
1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for your type
of integration - just supply your custom config file or use our template and you have nothing else to do. Most of
this document is focused on this feature.
2. If you don't use [`Trainer`] and want to use your own Trainer where you integrated DeepSpeed
......@@ -97,7 +97,7 @@ TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--disable-pip-version-check 2>&1 | tee build.log
```
If you intend to use NVMe offload you will need to also include `DS_BUILD_AIO=1` in the instructions above (and also
If you intend to use NVMe offload you will also need to include `DS_BUILD_AIO=1` in the instructions above (and also
install *libaio-dev* system-wide).
Edit `TORCH_CUDA_ARCH_LIST` to insert the code for the architectures of the GPU cards you intend to use. Assuming all
......@@ -134,7 +134,7 @@ You can check the archs pytorch was built with using:
python -c "import torch; print(torch.cuda.get_arch_list())"
```
Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:
Here is how to find out the arch for one of the installed GPUs. For example, for GPU 0:
```bash
CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
......@@ -169,7 +169,7 @@ following:
2. add a new argument `--deepspeed ds_config.json`, where `ds_config.json` is the DeepSpeed configuration file as
documented [here](https://www.deepspeed.ai/docs/config-json/). The file naming is up to you.
Therefore, if your original command line looked as following:
Therefore, if your original command line looked as follows:
```bash
python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
......@@ -214,7 +214,7 @@ For some practical usage examples, please, see this [post](https://github.com/hu
### Deployment with one GPU
To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as following:
To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as follows:
```bash
deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
......@@ -560,7 +560,7 @@ Do note that some values, such as `scheduler.params.total_num_steps` are calcula
### ZeRO
[Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/) is the workhorse of DeepSpeed. It
support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
supports 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity.
You will find more indepth information in the DeepSpeed documentation.
......@@ -581,7 +581,7 @@ going to use.
#### ZeRO-2 Config
The following is an example configuration for ZeRO stage 2:
The following is an example of configuration for ZeRO stage 2:
```json
{
......@@ -604,13 +604,13 @@ The following is an example configuration for ZeRO stage 2:
**Performance tuning:**
- enabling `offload_optimizer` should reduce GPU RAM usage (it requires `"stage": 2`)
- `"overlap_comm": true` trades off increased GPU RAM usage to lower all-reduce latency. `overlap_comm` uses 4.5x
- `"overlap_comm": true` trade offs increased GPU RAM usage to lower all-reduce latency. `overlap_comm` uses 4.5x
the `allgather_bucket_size` and `reduce_bucket_size` values. So if they are set to 5e8, this requires a 9GB
footprint (`5e8 x 2Bytes x 2 x 4.5`). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting
OOM-errors you will need to reduce those parameters to about `2e8`, which would require 3.6GB. You will want to do
the same on larger capacity GPU as well, if you're starting to hit OOM.
- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size,
the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size is,
the slower the communication gets, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
important, getting a slightly slower training time could be a good trade.
......@@ -619,7 +619,7 @@ The following is an example configuration for ZeRO stage 2:
#### ZeRO-3 Config
The following is an example configuration for ZeRO stage 3:
The following is an example of configuration for ZeRO stage 3:
```json
{
......@@ -662,7 +662,7 @@ and its typically accessed much faster than normal CPU memory.
If hitting OOM reduce `stage3_max_live_parameters` and `stage3_max_reuse_distance`. They should have minimal impact
on performance unless you are doing activation checkpointing. `1e9` would consume ~2GB. The memory is shared by
`stage3_max_live_parameters` and `stage3_max_reuse_distance`, so its not additive, its just 2GB total.
`stage3_max_live_parameters` and `stage3_max_reuse_distance`, so it's not additive, it's just 2GB total.
`stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given
time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment