CHANGELOG.md 12 KB
Newer Older
1
2
3
4
5
6
# Changelog
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
7
## NEXT - TBD
8
### Fixed
Min Xu's avatar
Min Xu committed
9
10
11
12
13
14
15


### Added


## [0.4.0] - 2021-07-31
### Fixed
16
17
18
19
20
21
22
- FSDP: fixed final backward callback in certain activation checkpointed cases. Before this fix,
        if a model is activation checkpointed in a certain way, the final backward
        callback can fire incorrectly. That's due to autograd and reentrant backward
        graphs. With this fix, the final callback is always registered on the outer
        most root FSDP instance (i.e. the outer most backward graph), which result
        in reliably firing. This makes FSDP much more robust with respect to different
        models and activation checkpoints. [#753]
Min Xu's avatar
Min Xu committed
23
24

### Added
25
26
27
28
- FSDP: support gradient accumulation without the `no_sync` context. This is useful
        in training with smaller number of GPU with same overall batch size as large
        number of GPUs. Compared with the `no_sync` context, this mode consumes less
        GPU memory but uses more networking bandwidth. [#752]
Min Xu's avatar
Min Xu committed
29

Min Xu's avatar
Min Xu committed
30

Min Xu's avatar
Min Xu committed
31
32
33
34
35
## [0.3.9] - 2021-07-26
### Fixed
- FSDP: fixed metadata saving and shard consolidation for MoE cases. When a model has
        shared parameters or mixture of expert layers, the handling of state dict
        metadata was broken. This release fixes that. [#746]
36
- OSS: fixed the buckets which would stay in fp16 if `broadcast fp16` was required [#751]
anj-s's avatar
anj-s committed
37
38

### Added
Min Xu's avatar
Min Xu committed
39
40
- FSDP: better performance; use `_allgather_base` and `_reduce_scatter_base` when they are
        available from pytorch nightly version (will be in 1.10 releases) [#729]
41
- FSDP: prepared FSDP internals for supporting multiple groups of flatten parameters (to support more general optimization) [#746]
anj-s's avatar
anj-s committed
42
43
44
45
46
47
48
49
50
51

## [0.3.8] - 2021-07-12
### Fixed
- checkpointing: Use dummy tensor to ensure backward pass is called. [#701]
- checkpointing: Ensure internal fwd counter is not incremented in eval mode. [#709]
- checkpointing: Use non-blocking CPU transfer to improve perf. [#719]
- FSDP: Fixed bug where buffers returned in `state_dict()` could still be half precision when `mixed_precision` is set to `True`. [#705]
- FSDP: Ensure requires_grad of FlatParameter is consistent with requires_grad of the original parameters. [#721]
- doc: Thoroughly improved the doc for FSDP. [#711]
- cleanup: Remove examples/ doc from the repo. [#712]
52
- cleanup: Future proof storage size test. [#735]
anj-s's avatar
anj-s committed
53
54
- cleanup: Migrate away from legacy torchtext iterators. [#713]
- chore: Updated torch 1.9 to release version. [#717]
Min Xu's avatar
Min Xu committed
55
56

### Added
57
- FSDP: supporting multiple flatten parameter groups [#708] [#711]
anj-s's avatar
anj-s committed
58
- chore: Add the latest numpy version to requirements-test.txt to prevent mypy errors on certain PR commits [#732]
Min Xu's avatar
Min Xu committed
59
60
61
62
63
64
65

## [0.3.7] - 2021-05-17
### Fixed
- setup.py: hide CUDA extensions behind `BUILD_CUDA_EXTENSIONS` envvar [#634]
- checkpointing: rename and move the `checkpoint_activations` wrapper [#654]
- FSDP: fix `local_state_dict` potentially called child class's `state_dict` [#574]
- FSDP: fix extra process groups being created by default. Old behavior can cause excessive GPU memory usage [#678] [#681]
66
67
68
- FSDP: fix forward pass not overlapping compute and allgather [#671]
- FSDP: improved frozen weight support [#657]
- FSDP: workaround AMP autocast cache issue with `clear_autocast_cache` flag [#650]
Min Xu's avatar
Min Xu committed
69
- FSDP: Rename API arg `cpu_offload` to `move_params_to_cpu` to better reflect functionality. We will deprecate `cpu_offload` in an upcoming release [#676]
70
71
- MoE: several fixes [#666] [#667] [#668]
- SDP: re-expose the module property [#647]
Min Xu's avatar
Min Xu committed
72
- wrap: support wrapping based on `wrapper_config` [#685]
73

74
### Added
75
76
- FSDP: added `force_input_to_fp32` flag for SyncBatchNorm [#659]
- FSDP: better memory usage for reduce bucket [#633]
77
78
- FSDP: added `local_metadata_dict` to save sharding relating information [#683]
- FSDP: added `consolidate_shard_weights` to reconstruct the consolidated (non-sharded) model weights from saved sharded weights and metadata on the disk [#683]
Min Xu's avatar
Min Xu committed
79
- Experimental SyncBatchNorm [#662] [#680]
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
80

Min Xu's avatar
Min Xu committed
81
82
## [0.3.6] - 2021-04-26
### Added
83
- FSDP: Consolidate cpu\_adam optimizer state dict ([#607](https://github.com/facebookresearch/fairscale/pull/607))
Min Xu's avatar
Min Xu committed
84
85
86
87
88
89
90
91
92

### Fixed
- FSDP: handle model with multiple forward pass and checkpoint ([#621](https://github.com/facebookresearch/fairscale/pull/621))
- FSDP & SDP: check before calling `_specify_ddp_gpu_num` ([#626](https://github.com/facebookresearch/fairscale/pull/626))
- FSDP: relax checking root condition ([#620](https://github.com/facebookresearch/fairscale/pull/620))
- SDP: removing an assert which does not seem always accurate ([#625](https://github.com/facebookresearch/fairscale/pull/625))
- FSDP: changing FSDP init to by pass pg validation ([#619](https://github.com/facebookresearch/fairscale/pull/619))
- OSS: to 100% coverage ([#618](https://github.com/facebookresearch/fairscale/pull/618))

Min Xu's avatar
Min Xu committed
93
94
95
96
97
98
99
100
101
102
## [0.3.5] - 2021-04-19
### Added
- [offload] Add API, tutorial and smaller doc string changes. ([#576](https://github.com/facebookresearch/fairscale/pull/576))

### Fixed
- FSDP: fixing training with freezing weights ([#614](https://github.com/facebookresearch/fairscale/pull/614))
- SDP: privatizing all the things ([#611](https://github.com/facebookresearch/fairscale/pull/611))
- FSDP: Make `_get_default_cuda_device` more robust to modules without params ([#606](https://github.com/facebookresearch/fairscale/pull/606))
- OffloadModel: Add prev codepath of using OffloadModel without activation checkpointing ([#608](https://github.com/facebookresearch/fairscale/pull/608))

Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
103
## [0.3.4] - 2021-04-13
Min Xu's avatar
Min Xu committed
104
### Added
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
105
106
- FSDP: Add no broadcast optim state option ([#560](https://github.com/facebookresearch/fairscale/pull/560))

Min Xu's avatar
Min Xu committed
107
### Fixed
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
108
109
110
111
112
- ShardedDDP: Properly handle .eval() mode ([#587](https://github.com/facebookresearch/fairscale/pull/587))
- ShardedDDP: Handle model being moved back to CPU prior to state consolidation ([#573](https://github.com/facebookresearch/fairscale/pull/573))
- FSDP: much faster state consolidation ([#595](https://github.com/facebookresearch/fairscale/pull/595))
- FSDP: Add gradient pre-dedivide to prevent overflow with large world sizes ([#565](https://github.com/facebookresearch/fairscale/pull/565))
- Offload: (experimental) Fix activation offloading to CPU ([#588]((https://github.com/facebookresearch/fairscale/pull/588) )
Min Xu's avatar
Min Xu committed
113

Min Xu's avatar
Min Xu committed
114
115
## [0.3.3] - 2021-04-1
### Added
116
- FSDP: changed `auto_wrap_bn` utility function so that single FSDP group is optional ([#556](https://github.com/facebookresearch/fairscale/pull/556))
Min Xu's avatar
Min Xu committed
117
118
119
120
121
122
123
124
- FSDP: optimizer state load/save ([#537](https://github.com/facebookresearch/fairscale/pull/537))
- FSDP: fix weight init when using apply() ([#543](https://github.com/facebookresearch/fairscale/pull/543))
- Multiprocess Pipe: retired old implementation
- Experimental: xpipe

### Fixed
- ShardedDDP deferred init ([#558](https://github.com/facebookresearch/fairscale/pull/558))

Min Xu's avatar
Min Xu committed
125
## [0.3.2] - 2021-03-18
Min Xu's avatar
Min Xu committed
126
### Added
127
- Experimental: Add spectrain support ([#372](https://github.com/facebookresearch/fairscale/issues/372))
Min Xu's avatar
Min Xu committed
128
- FSDP: enabled pytorch SyncBN (no asserting) ([#527](https://github.com/facebookresearch/fairscale/issues/527))
129
- FSDP: added `auto_wrap_bn` utility function ([#531](https://github.com/facebookresearch/fairscale/pull/531))
Min Xu's avatar
Min Xu committed
130
131

### Fixed
132
- OSS: fix a compatibily problem with lightning wrt optimizer state dict ([#510](https://github.com/facebookresearch/fairscale/issues/510))
133
- FSDP: fixed a bug when part of autograd graph is traversed multiple times in mixed precision mode ([#513](https://github.com/facebookresearch/fairscale/pull/513))
Min Xu's avatar
Min Xu committed
134
135
136
137

## [0.3.1] - 2021-03-09
### Added
- FSDP docs ([#455](https://github.com/facebookresearch/fairscale/issues/455))
138
- `enable_wrap` and `auto_wrap` APIs ([#446](https://github.com/facebookresearch/fairscale/issues/446))
Min Xu's avatar
Min Xu committed
139
140
- Added experimental.nn.OffloadModel API for training large models on a single GPU.([#432](https://github.com/facebookresearch/fairscale/issues/432))

141
### Fixed
Min Xu's avatar
Min Xu committed
142
143
144
145
- OSS: fix a broken state dict when using non contiguous param groups
- Several SDP fixes around performance and corner cases
- Many FSDP fixes
- AdaScale & SDP/FSDP test added but not officially supported
146

Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
147
## [0.3.0] - 2021-02-22
Min Xu's avatar
Min Xu committed
148
149
### Added
- FullyShardedDataParallel (FSDP) ([#413](https://github.com/facebookresearch/fairscale/issues/413))
150
- ShardedDDP fp16 grad reduction option ([#402](https://github.com/facebookresearch/fairscale/issues/402))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
151
- Expose experimental algorithms within the pip package ([#410](https://github.com/facebookresearch/fairscale/pull/410))
Min Xu's avatar
Min Xu committed
152

153
### Fixed
Min Xu's avatar
Min Xu committed
154
- Catch corner case when the model is too small with respect to the world size, and shards are empty ([#406](https://github.com/facebookresearch/fairscale/pull/406))
155
- Memory leak in `checkpoint_wrapper` ([#412](https://github.com/facebookresearch/fairscale/pull/412))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
156
157

## [0.1.7] - 2021-02-19
158
159
### Fixed
- ShardedDDP and OSS handle model trainability changes during training ([#369](https://github.com/facebookresearch/fairscale/issues/369))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
160
161
162
- ShardedDDP state dict load/save bug ([#386](https://github.com/facebookresearch/fairscale/issues/386))
- ShardedDDP handle train/eval modes ([#393](https://github.com/facebookresearch/fairscale/issues/393))
- AdaScale handling custom scaling factors ([#401](https://github.com/facebookresearch/fairscale/issues/401))
163
164

### Added
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
165
- ShardedDDP manual reduce option for checkpointing ([#389](https://github.com/facebookresearch/fairscale/issues/389))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
166

Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
167
168
169
170
171
172
173
174
175
176
## [0.1.6] - 2021-02-10
### Added
- Checkpointing model wrapper (#376)
- Faster OSS, flatbuffers (#371)
- Small speedup in OSS clipgradnorm (#363)

### Fixed
- Bug in ShardedDDP with 0.1.5 depending the init (KeyError / OSS)
- Much refactoring in Pipe (#357, #358, #360, #362, #370, #373)
- Better pip integration / resident pytorch (#375)
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
177
178

## [0.1.5] - 2021-02-03
179
### Added
180
181
182
- Pytorch compatibility for OSS checkpoints (#310)
- Elastic checkpoints for OSS, world size can vary in between save and loads (#310)
- Tensor views for OSS bucketing, reduced CPU use (#300)
183
- Bucket calls in ShardedDDP, for faster inter node communications (#327)
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
184
185
186
187
188
189
190
- FlattenParamWrapper, which flattens module parameters into a single tensor seamlessly (#317)
- AMPnet experimental support (#304)

### Fixed
- ShardedDDP properly handles device changes via `.to()` (#353)
- Add a new interface for AdaScale, AdaScaleWrapper, which makes it compatible with OSS (#347)

191

192
193
194
195
196
## [0.1.4] - 2021-01-07
### Fixed
- Missing cu files in the pip package


197
198
199
## [0.1.3] - 2021-01-04
### Fixed
- Release numbering within python and from pypi
200

201
## [0.1.2] - 2021-01-04
202
### Added
203
204
- AdaScale:
  . Added gradient accumulation feature (#202)
205
206
207
  . Added support of `torch.lr_scheduler` (#229)
  . Added support for `add_param_groups` (#266)
  . Added support for `scale != world_size` (#266)
208
209

### Fixed
Min Xu's avatar
Min Xu committed
210
211
- AdaScale: smoothing factor value fixed when using gradient accumulation (#235)
- Pipe: documentation on balancing functions (#243)
212
213
214
- ShardedDDP: handle typical NLP models
- ShardedDDP: better partitioning when finetuning

215

216
217
218
219
## [0.1.1] - 2020-12-01
### Fixed
- make sure pip package includes header files (#221)

msbaines's avatar
msbaines committed
220
221
222
223
224
225
## [0.1.0] - 2020-12-01
### Added
- ShardedDataParallel with autoreduce (#157)
- cpu support for Pipe (#188)
- ShardedOptim: Distributed Grad Scaler (for torch AMP)  (#182)
- OSS-aware clip grads, bridge sharded states (#167)
226
- oss: add `rank_local_state_dict` staticmethod (#174)
msbaines's avatar
msbaines committed
227
228
229
230
231
232
- support for PyTorch 1.7.0 (#171)
- Add implementation of AdaScale (#139)

### Fixed
- pip package install (#196, #200)

msbaines's avatar
msbaines committed
233
234
235
236
237
238
239
240
## [0.0.3] - 2020-10-14
### Added
- multi-process pipe

### Fixed
- multiple OSS fixes
- MegaTron+OSS DDP fix

msbaines's avatar
msbaines committed
241
242
## [0.0.2] - 2020-08-28
### Added
243
- add ddp that works with oss with `reduce()` not `all_reduce()` (#19)
msbaines's avatar
msbaines committed
244
245
246
247
248
249
250
251
252
- support for PyTorch v1.6
- add mixed precision Adam (#40)
- Adam optimizer state scaling (#44)

### Fixed
- properly restore a sharded optim state (#39)
- OSS restore state to proper device (#46)
- optim/oss: support optimizers with additional step kwargs (#53)
- optim/oss: fix state cast (#56)
253
- fix eval for `oss_ddp` (#55)
msbaines's avatar
msbaines committed
254
255
256
- optim/oss: work correctly with LRScheduler (#58)

## [0.0.1] - 2020-07-31
257
- Initial release.