CHANGELOG.md 11.1 KB
Newer Older
1
2
3
4
5
6
# Changelog
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
7
## NEXT - TBD
8
### Fixed
Min Xu's avatar
Min Xu committed
9
10
11
12
13
14
15
16

### Added

## [0.3.9] - 2021-07-26
### Fixed
- FSDP: fixed metadata saving and shard consolidation for MoE cases. When a model has
        shared parameters or mixture of expert layers, the handling of state dict
        metadata was broken. This release fixes that. [#746]
17
- OSS: fixed the buckets which would stay in fp16 if `broadcast fp16` was required (#751)
anj-s's avatar
anj-s committed
18
19

### Added
Min Xu's avatar
Min Xu committed
20
21
- FSDP: better performance; use `_allgather_base` and `_reduce_scatter_base` when they are
        available from pytorch nightly version (will be in 1.10 releases) [#729]
22
- FSDP: prepared FSDP internals for supporting multiple groups of flatten parameters (to support more general optimization) [#746]
anj-s's avatar
anj-s committed
23
24
25
26
27
28
29
30
31
32

## [0.3.8] - 2021-07-12
### Fixed
- checkpointing: Use dummy tensor to ensure backward pass is called. [#701]
- checkpointing: Ensure internal fwd counter is not incremented in eval mode. [#709]
- checkpointing: Use non-blocking CPU transfer to improve perf. [#719]
- FSDP: Fixed bug where buffers returned in `state_dict()` could still be half precision when `mixed_precision` is set to `True`. [#705]
- FSDP: Ensure requires_grad of FlatParameter is consistent with requires_grad of the original parameters. [#721]
- doc: Thoroughly improved the doc for FSDP. [#711]
- cleanup: Remove examples/ doc from the repo. [#712]
33
- cleanup: Future proof storage size test. [#735]
anj-s's avatar
anj-s committed
34
35
- cleanup: Migrate away from legacy torchtext iterators. [#713]
- chore: Updated torch 1.9 to release version. [#717]
Min Xu's avatar
Min Xu committed
36
37

### Added
38
- FSDP: supporting multiple flatten parameter groups [#708] [#711]
anj-s's avatar
anj-s committed
39
- chore: Add the latest numpy version to requirements-test.txt to prevent mypy errors on certain PR commits [#732]
Min Xu's avatar
Min Xu committed
40
41
42
43
44
45
46

## [0.3.7] - 2021-05-17
### Fixed
- setup.py: hide CUDA extensions behind `BUILD_CUDA_EXTENSIONS` envvar [#634]
- checkpointing: rename and move the `checkpoint_activations` wrapper [#654]
- FSDP: fix `local_state_dict` potentially called child class's `state_dict` [#574]
- FSDP: fix extra process groups being created by default. Old behavior can cause excessive GPU memory usage [#678] [#681]
47
48
49
- FSDP: fix forward pass not overlapping compute and allgather [#671]
- FSDP: improved frozen weight support [#657]
- FSDP: workaround AMP autocast cache issue with `clear_autocast_cache` flag [#650]
Min Xu's avatar
Min Xu committed
50
- FSDP: Rename API arg `cpu_offload` to `move_params_to_cpu` to better reflect functionality. We will deprecate `cpu_offload` in an upcoming release [#676]
51
52
- MoE: several fixes [#666] [#667] [#668]
- SDP: re-expose the module property [#647]
Min Xu's avatar
Min Xu committed
53
- wrap: support wrapping based on `wrapper_config` [#685]
54

55
### Added
56
57
- FSDP: added `force_input_to_fp32` flag for SyncBatchNorm [#659]
- FSDP: better memory usage for reduce bucket [#633]
58
59
- FSDP: added `local_metadata_dict` to save sharding relating information [#683]
- FSDP: added `consolidate_shard_weights` to reconstruct the consolidated (non-sharded) model weights from saved sharded weights and metadata on the disk [#683]
Min Xu's avatar
Min Xu committed
60
- Experimental SyncBatchNorm [#662] [#680]
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
61

Min Xu's avatar
Min Xu committed
62
63
## [0.3.6] - 2021-04-26
### Added
64
- FSDP: Consolidate cpu\_adam optimizer state dict ([#607](https://github.com/facebookresearch/fairscale/pull/607))
Min Xu's avatar
Min Xu committed
65
66
67
68
69
70
71
72
73

### Fixed
- FSDP: handle model with multiple forward pass and checkpoint ([#621](https://github.com/facebookresearch/fairscale/pull/621))
- FSDP & SDP: check before calling `_specify_ddp_gpu_num` ([#626](https://github.com/facebookresearch/fairscale/pull/626))
- FSDP: relax checking root condition ([#620](https://github.com/facebookresearch/fairscale/pull/620))
- SDP: removing an assert which does not seem always accurate ([#625](https://github.com/facebookresearch/fairscale/pull/625))
- FSDP: changing FSDP init to by pass pg validation ([#619](https://github.com/facebookresearch/fairscale/pull/619))
- OSS: to 100% coverage ([#618](https://github.com/facebookresearch/fairscale/pull/618))

Min Xu's avatar
Min Xu committed
74
75
76
77
78
79
80
81
82
83
## [0.3.5] - 2021-04-19
### Added
- [offload] Add API, tutorial and smaller doc string changes. ([#576](https://github.com/facebookresearch/fairscale/pull/576))

### Fixed
- FSDP: fixing training with freezing weights ([#614](https://github.com/facebookresearch/fairscale/pull/614))
- SDP: privatizing all the things ([#611](https://github.com/facebookresearch/fairscale/pull/611))
- FSDP: Make `_get_default_cuda_device` more robust to modules without params ([#606](https://github.com/facebookresearch/fairscale/pull/606))
- OffloadModel: Add prev codepath of using OffloadModel without activation checkpointing ([#608](https://github.com/facebookresearch/fairscale/pull/608))

Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
84
## [0.3.4] - 2021-04-13
Min Xu's avatar
Min Xu committed
85
### Added
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
86
87
- FSDP: Add no broadcast optim state option ([#560](https://github.com/facebookresearch/fairscale/pull/560))

Min Xu's avatar
Min Xu committed
88
### Fixed
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
89
90
91
92
93
- ShardedDDP: Properly handle .eval() mode ([#587](https://github.com/facebookresearch/fairscale/pull/587))
- ShardedDDP: Handle model being moved back to CPU prior to state consolidation ([#573](https://github.com/facebookresearch/fairscale/pull/573))
- FSDP: much faster state consolidation ([#595](https://github.com/facebookresearch/fairscale/pull/595))
- FSDP: Add gradient pre-dedivide to prevent overflow with large world sizes ([#565](https://github.com/facebookresearch/fairscale/pull/565))
- Offload: (experimental) Fix activation offloading to CPU ([#588]((https://github.com/facebookresearch/fairscale/pull/588) )
Min Xu's avatar
Min Xu committed
94

Min Xu's avatar
Min Xu committed
95
96
## [0.3.3] - 2021-04-1
### Added
97
- FSDP: changed `auto_wrap_bn` utility function so that single FSDP group is optional ([#556](https://github.com/facebookresearch/fairscale/pull/556))
Min Xu's avatar
Min Xu committed
98
99
100
101
102
103
104
105
- FSDP: optimizer state load/save ([#537](https://github.com/facebookresearch/fairscale/pull/537))
- FSDP: fix weight init when using apply() ([#543](https://github.com/facebookresearch/fairscale/pull/543))
- Multiprocess Pipe: retired old implementation
- Experimental: xpipe

### Fixed
- ShardedDDP deferred init ([#558](https://github.com/facebookresearch/fairscale/pull/558))

Min Xu's avatar
Min Xu committed
106
## [0.3.2] - 2021-03-18
Min Xu's avatar
Min Xu committed
107
### Added
108
- Experimental: Add spectrain support ([#372](https://github.com/facebookresearch/fairscale/issues/372))
Min Xu's avatar
Min Xu committed
109
- FSDP: enabled pytorch SyncBN (no asserting) ([#527](https://github.com/facebookresearch/fairscale/issues/527))
110
- FSDP: added `auto_wrap_bn` utility function ([#531](https://github.com/facebookresearch/fairscale/pull/531))
Min Xu's avatar
Min Xu committed
111
112

### Fixed
113
- OSS: fix a compatibily problem with lightning wrt optimizer state dict ([#510](https://github.com/facebookresearch/fairscale/issues/510))
114
- FSDP: fixed a bug when part of autograd graph is traversed multiple times in mixed precision mode ([#513](https://github.com/facebookresearch/fairscale/pull/513))
Min Xu's avatar
Min Xu committed
115
116
117
118

## [0.3.1] - 2021-03-09
### Added
- FSDP docs ([#455](https://github.com/facebookresearch/fairscale/issues/455))
119
- `enable_wrap` and `auto_wrap` APIs ([#446](https://github.com/facebookresearch/fairscale/issues/446))
Min Xu's avatar
Min Xu committed
120
121
- Added experimental.nn.OffloadModel API for training large models on a single GPU.([#432](https://github.com/facebookresearch/fairscale/issues/432))

122
### Fixed
Min Xu's avatar
Min Xu committed
123
124
125
126
- OSS: fix a broken state dict when using non contiguous param groups
- Several SDP fixes around performance and corner cases
- Many FSDP fixes
- AdaScale & SDP/FSDP test added but not officially supported
127

Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
128
## [0.3.0] - 2021-02-22
Min Xu's avatar
Min Xu committed
129
130
### Added
- FullyShardedDataParallel (FSDP) ([#413](https://github.com/facebookresearch/fairscale/issues/413))
131
- ShardedDDP fp16 grad reduction option ([#402](https://github.com/facebookresearch/fairscale/issues/402))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
132
- Expose experimental algorithms within the pip package ([#410](https://github.com/facebookresearch/fairscale/pull/410))
Min Xu's avatar
Min Xu committed
133

134
### Fixed
Min Xu's avatar
Min Xu committed
135
- Catch corner case when the model is too small with respect to the world size, and shards are empty ([#406](https://github.com/facebookresearch/fairscale/pull/406))
136
- Memory leak in `checkpoint_wrapper` ([#412](https://github.com/facebookresearch/fairscale/pull/412))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
137
138

## [0.1.7] - 2021-02-19
139
140
### Fixed
- ShardedDDP and OSS handle model trainability changes during training ([#369](https://github.com/facebookresearch/fairscale/issues/369))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
141
142
143
- ShardedDDP state dict load/save bug ([#386](https://github.com/facebookresearch/fairscale/issues/386))
- ShardedDDP handle train/eval modes ([#393](https://github.com/facebookresearch/fairscale/issues/393))
- AdaScale handling custom scaling factors ([#401](https://github.com/facebookresearch/fairscale/issues/401))
144
145

### Added
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
146
- ShardedDDP manual reduce option for checkpointing ([#389](https://github.com/facebookresearch/fairscale/issues/389))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
147

Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
148
149
150
151
152
153
154
155
156
157
## [0.1.6] - 2021-02-10
### Added
- Checkpointing model wrapper (#376)
- Faster OSS, flatbuffers (#371)
- Small speedup in OSS clipgradnorm (#363)

### Fixed
- Bug in ShardedDDP with 0.1.5 depending the init (KeyError / OSS)
- Much refactoring in Pipe (#357, #358, #360, #362, #370, #373)
- Better pip integration / resident pytorch (#375)
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
158
159

## [0.1.5] - 2021-02-03
160
### Added
161
162
163
- Pytorch compatibility for OSS checkpoints (#310)
- Elastic checkpoints for OSS, world size can vary in between save and loads (#310)
- Tensor views for OSS bucketing, reduced CPU use (#300)
164
- Bucket calls in ShardedDDP, for faster inter node communications (#327)
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
165
166
167
168
169
170
171
- FlattenParamWrapper, which flattens module parameters into a single tensor seamlessly (#317)
- AMPnet experimental support (#304)

### Fixed
- ShardedDDP properly handles device changes via `.to()` (#353)
- Add a new interface for AdaScale, AdaScaleWrapper, which makes it compatible with OSS (#347)

172

173
174
175
176
177
## [0.1.4] - 2021-01-07
### Fixed
- Missing cu files in the pip package


178
179
180
## [0.1.3] - 2021-01-04
### Fixed
- Release numbering within python and from pypi
181

182
## [0.1.2] - 2021-01-04
183
### Added
184
185
- AdaScale:
  . Added gradient accumulation feature (#202)
186
187
188
  . Added support of `torch.lr_scheduler` (#229)
  . Added support for `add_param_groups` (#266)
  . Added support for `scale != world_size` (#266)
189
190

### Fixed
Min Xu's avatar
Min Xu committed
191
192
- AdaScale: smoothing factor value fixed when using gradient accumulation (#235)
- Pipe: documentation on balancing functions (#243)
193
194
195
- ShardedDDP: handle typical NLP models
- ShardedDDP: better partitioning when finetuning

196

197
198
199
200
## [0.1.1] - 2020-12-01
### Fixed
- make sure pip package includes header files (#221)

msbaines's avatar
msbaines committed
201
202
203
204
205
206
## [0.1.0] - 2020-12-01
### Added
- ShardedDataParallel with autoreduce (#157)
- cpu support for Pipe (#188)
- ShardedOptim: Distributed Grad Scaler (for torch AMP)  (#182)
- OSS-aware clip grads, bridge sharded states (#167)
207
- oss: add `rank_local_state_dict` staticmethod (#174)
msbaines's avatar
msbaines committed
208
209
210
211
212
213
- support for PyTorch 1.7.0 (#171)
- Add implementation of AdaScale (#139)

### Fixed
- pip package install (#196, #200)

msbaines's avatar
msbaines committed
214
215
216
217
218
219
220
221
## [0.0.3] - 2020-10-14
### Added
- multi-process pipe

### Fixed
- multiple OSS fixes
- MegaTron+OSS DDP fix

msbaines's avatar
msbaines committed
222
223
## [0.0.2] - 2020-08-28
### Added
224
- add ddp that works with oss with `reduce()` not `all_reduce()` (#19)
msbaines's avatar
msbaines committed
225
226
227
228
229
230
231
232
233
- support for PyTorch v1.6
- add mixed precision Adam (#40)
- Adam optimizer state scaling (#44)

### Fixed
- properly restore a sharded optim state (#39)
- OSS restore state to proper device (#46)
- optim/oss: support optimizers with additional step kwargs (#53)
- optim/oss: fix state cast (#56)
234
- fix eval for `oss_ddp` (#55)
msbaines's avatar
msbaines committed
235
236
237
- optim/oss: work correctly with LRScheduler (#58)

## [0.0.1] - 2020-07-31
238
- Initial release.