CHANGELOG.md 22.7 KB
Newer Older
1
2
3
4
5
6
# Changelog
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

tmarkstrum's avatar
tmarkstrum committed
7

8
9
10
11
12
13
## [0.4.11] - TBD

- cleaned up some old issues and fixed a few bug in FSDP
- removing SSD offload to simplify the FSDP code

## [0.4.8]/[0.4.9]/[0.4.10]
tmarkstrum's avatar
tmarkstrum committed
14

tmarkstrum's avatar
tmarkstrum committed
15
16
### Added

17
18
19
- FSDP: Add pickle/unpickle support for SsdTensorHandle (and derived classes),
  verified that FSDP models w/ ssd_offload enabled can correctly call model.state_dict()
  and model.load_state_dict(...) and thus successfully checkpoint and recover parameters
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
  stored as SsdFlatParameters. [#964]
- FSDP Ssd Offload: [#974]
   * add __setattr__ function for SsdTensorHandle/SsdFlatParameter to capture when .data is overridden
     and perform necessary checks/updates before letting the tensor metadata be updated.
   * There was a bug in pytorch core that caused re-assigning TensorSubclass .data to
     disable the __torch_dispatch__ mechanism and effectively revert back to a normal
     tensor. Since ssd_offload feature uses __torch_dispatch__ extensively and FSDP overrides
     .data, ssd_offload now can only be used if pytorch version is 1.12.0 or later
     (currently pytorch-nightly release). pytest tests are disabled and trying to import
     ssd_offload or use FSDP with ssd_offload enabled will raise ImportError Exceptions.
   * Enhance storage_state ON_CPU into ON_CPU_CLEAN and ON_CPU_DIRTY. ON_CPU_CLEAN
     indicates that the .tensor value and the value stored on disk are identical, and no
     writes are needed when calling .to_file(). ON_CPU_DIRTY indicates .tensor does not
     match value stored on disk, and a write is necessary. This is disabled in FSDP as
     it does not currently flush variables to disk as soon as optimizer modifies values.
   * Fix detection in SsdTensorHandle __torch_dispatch__ when it is used in an in-place
     operator, or as an output, then calling mark_dirty().
   * Fix grad_fn on SsdFlatParameterViews so that both ssd_flat_param.view.grad_fn points
     to a split and view on ssd_flat_parameter so that when X.backwards() is called. ssd_flat_param.grad
     is properly updated in the backward pass.
   * Add unit tests to verify grad_fn and backward hooks are called appropriately on
     SsdFlatParameters/SsdFlatParameterViews
   * Implement SsdFlatParameter.from_tensor() direct_to_file option. This allows creating
     and SsdFlatParameter and directly writing it to disk, rather than creating a new tensor
     first. This prevents another copy of tensors to be instantiated and can save memory when
     creating very large SsdFlatParameters.
   * Add SsdFlatParameterViewProperty for overriding a parameter in an nn.Module
     similar to pytorch core's parameterization code path. This allows SsdFlatParameterViews
     to be updated as SsdFlatParameter.data is overridden by replacing Layer.weight with a
     property whose getter returns ssd_flat_param.views[view_id].
   * Added example SsdFlatParameterViewParameterization on how the existing parameterization
     method could be used instead of SsdFlatParameterViewProperty. But this method will result
     in memory inefficiencies due to parameterization always keeping a copy of the original
     parameter internally.
54

tmarkstrum's avatar
tmarkstrum committed
55
### Fixed
56
- fixed some bugs in FSDP related to supporting data2vec EMA modules.
tmarkstrum's avatar
tmarkstrum committed
57
58
59


[0.4.6] - 2022-03-08
60

61
62
63
64
65
66
### Added
- CosFace's LMCL is added to MEVO. This is a loss function that is suitable
  for large number of prediction target classes. It added normalization,
  class separation margins and feature vector scaling to the standard
  output projection + cross-entropy loss. MEVO supported this with its
  memory saving techniques so that peak GPU memory is much reduced. [#916]
ruanslv's avatar
ruanslv committed
67
68
69
- FSDP: Added skip_params_check_for_root flag to default_auto_wrap_policy which,
  if set, wraps the root module regardless of how many unwrapped params there were
  left after children were wrapped. [#930]
70
- FSDP: Add support for saving optimizer state when using expert replicas with FSDP.
tmarkstrum's avatar
tmarkstrum committed
71
72
73
- OSS: Add a new arg "forced_broadcast_object" to OSS __init__ to apply "_broadcast_object"
  for rebuilding the sharded optimizer. [#937]
- FSDP: Add an arg disable_reshard_on_root for FSDP __init__ [#878]
tmarkstrum's avatar
tmarkstrum committed
74

Min Xu's avatar
Min Xu committed
75
76
77
78
79
80
81
### Fixed
- FSDP: fixed handling of internal states with state_dict and load_state_dict
  function so that they don't change lazy init state if training hasn't started. [#922]
- FSDP: added support of optimizer state handling when some of the parameters are
  not used. An example is that in a model with a EMA copy that doesn't get trained
  but still wants to be sharded. [#922]

tmarkstrum's avatar
tmarkstrum committed
82
## [0.4.5] - 2022-01-14
Anupam Bhatnagar's avatar
Anupam Bhatnagar committed
83
84

### Added
85
86
87
88
- Layer-wise Gradient Scaling [new feature][experimental] Layer-wise gradient
scaling helps overcomes gradient overflow issues. When used in conjunction with
mixed precision, it enables training larger models and makes the training
process more stable, especially in deep networks [#879]
89
- FSDP: Added process_group_reduce_scatter parameter to allow users to pass in the process group that is used for reduce scatter operation. [#897]
90
- FSDP: Added state_dict_on_rank_0_only flag allow user choose to return full state dict on rank 0 and return empty dict non-rank 0 to prevent OOM [#844]
Anupam Bhatnagar's avatar
Anupam Bhatnagar committed
91
92

### Changed
93
94
- FSDP: Enabled reduce_scatter operation to overlap with all_gather stream and computation stream in backward propagation.
This change increases FSDP's throughput. To roll back this change, please pass ProcessGroupName.default to the process_group_reduce_scatter API parameter in FSDP [#897]
Anupam Bhatnagar's avatar
Anupam Bhatnagar committed
95

96
97
98
### Fixed
- FSDP: Fixed the issue that the padding size of the input tensor of the reduce scatter is not equal to the reduce scatter process group size # [#907]

99
## [0.4.4] - 2021-12-21
100
101

### Fixed
Anupam Bhatnagar's avatar
Anupam Bhatnagar committed
102
103
104
- Inf/nan check on all gradient tensors in ShardedGradScaler [#890]
- Allow user to supress warning on trainable_parameter in sharded_ddp.py [#886]
- KeyError in certain FSDP wrapping scenarios [#881]
105

Min Xu's avatar
Min Xu committed
106
### Added
Anupam Bhatnagar's avatar
Anupam Bhatnagar committed
107
108
- Support eval mode in MEVO [#884]
- A new benchmark for the MOE model [#866]
Min Xu's avatar
Min Xu committed
109
110

### Changed
111
- Fixed a corner case of FSDP init order and losing one of the flags [#880]
112
- FSDP: Adding basic training support for SSD Offload, it now only supports flattened parameters. Renamed OffloadConfig.ssd_filepath_dir to more generic OffloadConfig.dir. SSD Offload remains an experimental feature. [#887]
Min Xu's avatar
Min Xu committed
113
114

## [0.4.3] - 2021-11-18
115
116
117

### Added
- Sharded Grad Scaler works with cpu offload in mixed and full precision. [#831]
118
119
- API for specifying SSD offload for params with FSDP. You can use a OffloadConfig to specify the type of offload
  and the file path for storing params on SSD. Note: This is an experimental feature. [#855]
120
121

### Changed
122
- MEVO: fixed eval and checkpointing code paths [#851]
123
124
125
126
127
- Cleanup: Moving forward we would be testing all of our code with Python 3.9.7, CUDA 11.2 and the following three versions of PyTorch [#847]:
  - the most recent stable version
  - the most recent LTS version
  - a recent nightly build

128
## [0.4.2] - 2021-11-08
129
### Fixed
130
- FSDP: Fixed an pre-backward hook bug for certain type of models and FSDP config. [#833]
tmarkstrum's avatar
tmarkstrum committed
131
132

### Added
133
134
135
136
- FSDP: Add support for SSD offload for eval workloads. This is a new experimental feature and should be
        used with caution.
- LayerwiseMemoryTracker[feature][experimental]: This is a new experimental tool to help track, visualize and suggest fix for memory issues occurring during the forward/backward pass of your models. [#808]
- FSDP: limited support of shared weights between FSDP wrappers. This allows large parameter
137
138
          and gradient memory to be sharded despite being needed from different layers due to
          weight sharing. [#836]
139
- OffloadModel: Fix node names to enable correct sharding in auto_shard.py [#830]
140
- OSS: Relaxed speed and memory constraints on OSS golden data due to regression when we bumped up the
141
142
143
       PyTorch version to 1.9. [#828] [#825]
- Chore: Update PyTorch version that we run benchmarks with. [#823]
- Chore: Update PyTorch version that we run test with. [#809]
144
- OffloadModel: Extend auto_shard.py to allow dealing with conditionals automatically when tracing with
145
                torch.fx. This will work for most cases except when the conditional is part of the root instance. [#817]
Min Xu's avatar
Min Xu committed
146
- [MEVO]: a custom layer to help big vocab trainings. Experimental. Docs is still TBD. [#840]
147
- SlowMoDistributedDataParallel[feature][experimental] - This is a distributed training wrapper which should be useful on clusters with slow network interconnects (eg Ethernet). This improves on performance as compared to Distributed Data Parallel in such clusters. [#378]
tmarkstrum's avatar
tmarkstrum committed
148
149
150

## [0.4.1] - 2021-09-17
### Fixed
anj-s's avatar
anj-s committed
151
152
153
154
155
- FSDP: We don't attach post backward hooks for params that don't require grad. However in the hook
        triggered after the post backward hook, we assert on the POST_BACKWARD state which can only
        be set in the post backward hook. Modified the assert to account for the fact that the root
        FSDP module can have child modules with params that require grad and it can contain params
        that don't require grad and hence can fail the previous assert. [#761]
156
157
- FSDP: Fixed a bug when multiple backward pass is called within an iteration, parameters' sharding
        state might be incorrect. [#775]
158
159
- activation checkpoint: Ensure outputs of checkpointed modules only require grad if either
                         the input requires grad or if the parameters require grad. [#787]
Min Xu's avatar
Min Xu committed
160

161
162
163
164
- OSS: fix the broadcast_fp16 option, broken after a refactor, this flag was doing nothing (bugfix).[#795]
- OSS: update default device when refreshing the params, meaning that moving the model to GPU after
       the OSS wrap will not trigger warnings and slow the jobs (ease of use). [#786]

Min Xu's avatar
Min Xu committed
165
### Added
anj-s's avatar
anj-s committed
166
- FSDP: Added support for returning the original names of parameters when `named_parameters` is called on
167
        the module. To retrieve the orginal names of the parameters along with the params, you need to
anj-s's avatar
anj-s committed
168
169
170
        call `named_parameters` under the `summon_full_params` context when using flattened params or original
        params. If you are using original params (i.e flatten_params=False), calling `named_parameters` outside
        of the `summon_full_params` context will still return the original param names along with the local shards. [#755]
171
172
173
174
175
176
177
178
- FSDP: Ensure gradient reduction accumulates into the unsharded gradient tensor
        within a backwards pass. This matters when an FSDP module is called
        multiple times within a forward pass, and reduction is not deferred
        using activation checkpoint forward counters, bucketing or some other
        mechanism. [#784]
- activation checkpoint: Added a context manager to disable checkpoint in case the same wrapped module
                         needs to be checkpointed and not checkpointed in different parts of
                         the module forward pass. [#772]
tmarkstrum's avatar
tmarkstrum committed
179
180
181
- FSDP: Added a toggle with an environment variable ENABLE_NCCL_BASE_COLLECTIVES=[0,1] to allow users
        enable/disable using new nccl base collecectives. By default, using new nccl base collectives
        is enabled. [#801]
Min Xu's avatar
Min Xu committed
182
183
184

## [0.4.0] - 2021-07-31
### Fixed
185
186
187
188
189
190
191
- FSDP: fixed final backward callback in certain activation checkpointed cases. Before this fix,
        if a model is activation checkpointed in a certain way, the final backward
        callback can fire incorrectly. That's due to autograd and reentrant backward
        graphs. With this fix, the final callback is always registered on the outer
        most root FSDP instance (i.e. the outer most backward graph), which result
        in reliably firing. This makes FSDP much more robust with respect to different
        models and activation checkpoints. [#753]
Min Xu's avatar
Min Xu committed
192
193

### Added
194
195
196
197
- FSDP: support gradient accumulation without the `no_sync` context. This is useful
        in training with smaller number of GPU with same overall batch size as large
        number of GPUs. Compared with the `no_sync` context, this mode consumes less
        GPU memory but uses more networking bandwidth. [#752]
Min Xu's avatar
Min Xu committed
198

Min Xu's avatar
Min Xu committed
199

Min Xu's avatar
Min Xu committed
200
201
202
203
204
## [0.3.9] - 2021-07-26
### Fixed
- FSDP: fixed metadata saving and shard consolidation for MoE cases. When a model has
        shared parameters or mixture of expert layers, the handling of state dict
        metadata was broken. This release fixes that. [#746]
205
- OSS: fixed the buckets which would stay in fp16 if `broadcast fp16` was required [#751]
anj-s's avatar
anj-s committed
206
207

### Added
Min Xu's avatar
Min Xu committed
208
209
- FSDP: better performance; use `_allgather_base` and `_reduce_scatter_base` when they are
        available from pytorch nightly version (will be in 1.10 releases) [#729]
210
- FSDP: prepared FSDP internals for supporting multiple groups of flatten parameters (to support more general optimization) [#746]
anj-s's avatar
anj-s committed
211
212
213
214
215
216
217
218
219
220

## [0.3.8] - 2021-07-12
### Fixed
- checkpointing: Use dummy tensor to ensure backward pass is called. [#701]
- checkpointing: Ensure internal fwd counter is not incremented in eval mode. [#709]
- checkpointing: Use non-blocking CPU transfer to improve perf. [#719]
- FSDP: Fixed bug where buffers returned in `state_dict()` could still be half precision when `mixed_precision` is set to `True`. [#705]
- FSDP: Ensure requires_grad of FlatParameter is consistent with requires_grad of the original parameters. [#721]
- doc: Thoroughly improved the doc for FSDP. [#711]
- cleanup: Remove examples/ doc from the repo. [#712]
221
- cleanup: Future proof storage size test. [#735]
anj-s's avatar
anj-s committed
222
223
- cleanup: Migrate away from legacy torchtext iterators. [#713]
- chore: Updated torch 1.9 to release version. [#717]
Min Xu's avatar
Min Xu committed
224
225

### Added
226
- FSDP: supporting multiple flatten parameter groups [#708] [#711]
anj-s's avatar
anj-s committed
227
- chore: Add the latest numpy version to requirements-test.txt to prevent mypy errors on certain PR commits [#732]
Min Xu's avatar
Min Xu committed
228
229
230
231
232
233
234

## [0.3.7] - 2021-05-17
### Fixed
- setup.py: hide CUDA extensions behind `BUILD_CUDA_EXTENSIONS` envvar [#634]
- checkpointing: rename and move the `checkpoint_activations` wrapper [#654]
- FSDP: fix `local_state_dict` potentially called child class's `state_dict` [#574]
- FSDP: fix extra process groups being created by default. Old behavior can cause excessive GPU memory usage [#678] [#681]
235
236
237
- FSDP: fix forward pass not overlapping compute and allgather [#671]
- FSDP: improved frozen weight support [#657]
- FSDP: workaround AMP autocast cache issue with `clear_autocast_cache` flag [#650]
Min Xu's avatar
Min Xu committed
238
- FSDP: Rename API arg `cpu_offload` to `move_params_to_cpu` to better reflect functionality. We will deprecate `cpu_offload` in an upcoming release [#676]
239
240
- MoE: several fixes [#666] [#667] [#668]
- SDP: re-expose the module property [#647]
Min Xu's avatar
Min Xu committed
241
- wrap: support wrapping based on `wrapper_config` [#685]
242

243
### Added
244
245
- FSDP: added `force_input_to_fp32` flag for SyncBatchNorm [#659]
- FSDP: better memory usage for reduce bucket [#633]
246
247
- FSDP: added `local_metadata_dict` to save sharding relating information [#683]
- FSDP: added `consolidate_shard_weights` to reconstruct the consolidated (non-sharded) model weights from saved sharded weights and metadata on the disk [#683]
Min Xu's avatar
Min Xu committed
248
- Experimental SyncBatchNorm [#662] [#680]
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
249

Min Xu's avatar
Min Xu committed
250
251
## [0.3.6] - 2021-04-26
### Added
252
- FSDP: Consolidate cpu\_adam optimizer state dict ([#607](https://github.com/facebookresearch/fairscale/pull/607))
Min Xu's avatar
Min Xu committed
253
254
255
256
257
258
259
260
261

### Fixed
- FSDP: handle model with multiple forward pass and checkpoint ([#621](https://github.com/facebookresearch/fairscale/pull/621))
- FSDP & SDP: check before calling `_specify_ddp_gpu_num` ([#626](https://github.com/facebookresearch/fairscale/pull/626))
- FSDP: relax checking root condition ([#620](https://github.com/facebookresearch/fairscale/pull/620))
- SDP: removing an assert which does not seem always accurate ([#625](https://github.com/facebookresearch/fairscale/pull/625))
- FSDP: changing FSDP init to by pass pg validation ([#619](https://github.com/facebookresearch/fairscale/pull/619))
- OSS: to 100% coverage ([#618](https://github.com/facebookresearch/fairscale/pull/618))

Min Xu's avatar
Min Xu committed
262
263
264
265
266
267
268
269
270
271
## [0.3.5] - 2021-04-19
### Added
- [offload] Add API, tutorial and smaller doc string changes. ([#576](https://github.com/facebookresearch/fairscale/pull/576))

### Fixed
- FSDP: fixing training with freezing weights ([#614](https://github.com/facebookresearch/fairscale/pull/614))
- SDP: privatizing all the things ([#611](https://github.com/facebookresearch/fairscale/pull/611))
- FSDP: Make `_get_default_cuda_device` more robust to modules without params ([#606](https://github.com/facebookresearch/fairscale/pull/606))
- OffloadModel: Add prev codepath of using OffloadModel without activation checkpointing ([#608](https://github.com/facebookresearch/fairscale/pull/608))

Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
272
## [0.3.4] - 2021-04-13
Min Xu's avatar
Min Xu committed
273
### Added
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
274
275
- FSDP: Add no broadcast optim state option ([#560](https://github.com/facebookresearch/fairscale/pull/560))

Min Xu's avatar
Min Xu committed
276
### Fixed
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
277
278
279
280
281
- ShardedDDP: Properly handle .eval() mode ([#587](https://github.com/facebookresearch/fairscale/pull/587))
- ShardedDDP: Handle model being moved back to CPU prior to state consolidation ([#573](https://github.com/facebookresearch/fairscale/pull/573))
- FSDP: much faster state consolidation ([#595](https://github.com/facebookresearch/fairscale/pull/595))
- FSDP: Add gradient pre-dedivide to prevent overflow with large world sizes ([#565](https://github.com/facebookresearch/fairscale/pull/565))
- Offload: (experimental) Fix activation offloading to CPU ([#588]((https://github.com/facebookresearch/fairscale/pull/588) )
Min Xu's avatar
Min Xu committed
282

Min Xu's avatar
Min Xu committed
283
284
## [0.3.3] - 2021-04-1
### Added
285
- FSDP: changed `auto_wrap_bn` utility function so that single FSDP group is optional ([#556](https://github.com/facebookresearch/fairscale/pull/556))
Min Xu's avatar
Min Xu committed
286
287
288
289
290
291
292
293
- FSDP: optimizer state load/save ([#537](https://github.com/facebookresearch/fairscale/pull/537))
- FSDP: fix weight init when using apply() ([#543](https://github.com/facebookresearch/fairscale/pull/543))
- Multiprocess Pipe: retired old implementation
- Experimental: xpipe

### Fixed
- ShardedDDP deferred init ([#558](https://github.com/facebookresearch/fairscale/pull/558))

Min Xu's avatar
Min Xu committed
294
## [0.3.2] - 2021-03-18
Min Xu's avatar
Min Xu committed
295
### Added
296
- Experimental: Add spectrain support ([#372](https://github.com/facebookresearch/fairscale/issues/372))
Min Xu's avatar
Min Xu committed
297
- FSDP: enabled pytorch SyncBN (no asserting) ([#527](https://github.com/facebookresearch/fairscale/issues/527))
298
- FSDP: added `auto_wrap_bn` utility function ([#531](https://github.com/facebookresearch/fairscale/pull/531))
Min Xu's avatar
Min Xu committed
299
300

### Fixed
301
- OSS: fix a compatibily problem with lightning wrt optimizer state dict ([#510](https://github.com/facebookresearch/fairscale/issues/510))
302
- FSDP: fixed a bug when part of autograd graph is traversed multiple times in mixed precision mode ([#513](https://github.com/facebookresearch/fairscale/pull/513))
Min Xu's avatar
Min Xu committed
303
304
305
306

## [0.3.1] - 2021-03-09
### Added
- FSDP docs ([#455](https://github.com/facebookresearch/fairscale/issues/455))
307
- `enable_wrap` and `auto_wrap` APIs ([#446](https://github.com/facebookresearch/fairscale/issues/446))
Min Xu's avatar
Min Xu committed
308
309
- Added experimental.nn.OffloadModel API for training large models on a single GPU.([#432](https://github.com/facebookresearch/fairscale/issues/432))

310
### Fixed
Min Xu's avatar
Min Xu committed
311
312
313
314
- OSS: fix a broken state dict when using non contiguous param groups
- Several SDP fixes around performance and corner cases
- Many FSDP fixes
- AdaScale & SDP/FSDP test added but not officially supported
315

Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
316
## [0.3.0] - 2021-02-22
Min Xu's avatar
Min Xu committed
317
318
### Added
- FullyShardedDataParallel (FSDP) ([#413](https://github.com/facebookresearch/fairscale/issues/413))
319
- ShardedDDP fp16 grad reduction option ([#402](https://github.com/facebookresearch/fairscale/issues/402))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
320
- Expose experimental algorithms within the pip package ([#410](https://github.com/facebookresearch/fairscale/pull/410))
Min Xu's avatar
Min Xu committed
321

322
### Fixed
Min Xu's avatar
Min Xu committed
323
- Catch corner case when the model is too small with respect to the world size, and shards are empty ([#406](https://github.com/facebookresearch/fairscale/pull/406))
324
- Memory leak in `checkpoint_wrapper` ([#412](https://github.com/facebookresearch/fairscale/pull/412))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
325
326

## [0.1.7] - 2021-02-19
327
328
### Fixed
- ShardedDDP and OSS handle model trainability changes during training ([#369](https://github.com/facebookresearch/fairscale/issues/369))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
329
330
331
- ShardedDDP state dict load/save bug ([#386](https://github.com/facebookresearch/fairscale/issues/386))
- ShardedDDP handle train/eval modes ([#393](https://github.com/facebookresearch/fairscale/issues/393))
- AdaScale handling custom scaling factors ([#401](https://github.com/facebookresearch/fairscale/issues/401))
332
333

### Added
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
334
- ShardedDDP manual reduce option for checkpointing ([#389](https://github.com/facebookresearch/fairscale/issues/389))
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
335

Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
336
337
338
339
340
341
342
343
344
345
## [0.1.6] - 2021-02-10
### Added
- Checkpointing model wrapper (#376)
- Faster OSS, flatbuffers (#371)
- Small speedup in OSS clipgradnorm (#363)

### Fixed
- Bug in ShardedDDP with 0.1.5 depending the init (KeyError / OSS)
- Much refactoring in Pipe (#357, #358, #360, #362, #370, #373)
- Better pip integration / resident pytorch (#375)
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
346
347

## [0.1.5] - 2021-02-03
348
### Added
349
350
351
- Pytorch compatibility for OSS checkpoints (#310)
- Elastic checkpoints for OSS, world size can vary in between save and loads (#310)
- Tensor views for OSS bucketing, reduced CPU use (#300)
352
- Bucket calls in ShardedDDP, for faster inter node communications (#327)
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
353
354
355
356
357
358
359
- FlattenParamWrapper, which flattens module parameters into a single tensor seamlessly (#317)
- AMPnet experimental support (#304)

### Fixed
- ShardedDDP properly handles device changes via `.to()` (#353)
- Add a new interface for AdaScale, AdaScaleWrapper, which makes it compatible with OSS (#347)

360

361
362
363
364
365
## [0.1.4] - 2021-01-07
### Fixed
- Missing cu files in the pip package


366
367
368
## [0.1.3] - 2021-01-04
### Fixed
- Release numbering within python and from pypi
369

370
## [0.1.2] - 2021-01-04
371
### Added
372
373
- AdaScale:
  . Added gradient accumulation feature (#202)
374
375
376
  . Added support of `torch.lr_scheduler` (#229)
  . Added support for `add_param_groups` (#266)
  . Added support for `scale != world_size` (#266)
377
378

### Fixed
Min Xu's avatar
Min Xu committed
379
380
- AdaScale: smoothing factor value fixed when using gradient accumulation (#235)
- Pipe: documentation on balancing functions (#243)
381
382
383
- ShardedDDP: handle typical NLP models
- ShardedDDP: better partitioning when finetuning

384

385
386
387
388
## [0.1.1] - 2020-12-01
### Fixed
- make sure pip package includes header files (#221)

msbaines's avatar
msbaines committed
389
390
391
392
393
394
## [0.1.0] - 2020-12-01
### Added
- ShardedDataParallel with autoreduce (#157)
- cpu support for Pipe (#188)
- ShardedOptim: Distributed Grad Scaler (for torch AMP)  (#182)
- OSS-aware clip grads, bridge sharded states (#167)
395
- oss: add `rank_local_state_dict` staticmethod (#174)
msbaines's avatar
msbaines committed
396
397
398
399
400
401
- support for PyTorch 1.7.0 (#171)
- Add implementation of AdaScale (#139)

### Fixed
- pip package install (#196, #200)

msbaines's avatar
msbaines committed
402
403
404
405
406
407
408
409
## [0.0.3] - 2020-10-14
### Added
- multi-process pipe

### Fixed
- multiple OSS fixes
- MegaTron+OSS DDP fix

msbaines's avatar
msbaines committed
410
411
## [0.0.2] - 2020-08-28
### Added
412
- add ddp that works with oss with `reduce()` not `all_reduce()` (#19)
msbaines's avatar
msbaines committed
413
414
415
416
417
418
419
420
421
- support for PyTorch v1.6
- add mixed precision Adam (#40)
- Adam optimizer state scaling (#44)

### Fixed
- properly restore a sharded optim state (#39)
- OSS restore state to proper device (#46)
- optim/oss: support optimizers with additional step kwargs (#53)
- optim/oss: fix state cast (#56)
422
- fix eval for `oss_ddp` (#55)
msbaines's avatar
msbaines committed
423
424
425
- optim/oss: work correctly with LRScheduler (#58)

## [0.0.1] - 2020-07-31
426
- Initial release.