release-note.md 3.85 KB
Newer Older
Rick Ho's avatar
Rick Ho committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
## v1.1.0

### Performance

* Smart schedule of FasterMoE is updated with correct stream management, and becomes faster.

### Testing

* All unit tests are checked and they run correctly now.

### Adaption

* Megatron-LM 3.2 supported.

### Documentation

* README is updated with some bugs fixed.
* A detailed [document for process groups](/doc/parallelism).


Rick Ho's avatar
Rick Ho committed
21
22
23
24
25
26
27
28
29
## v1.0.1

### Compatibility

* PyTorch 2.0 supported.
* Megatron-LM 2.5 supported.

### Documentation

Rick Ho's avatar
Rick Ho committed
30
* A detailed [installation-guide](/doc/installation-guide.md) thanks to @santurini
Rick Ho's avatar
Rick Ho committed
31
32
33
34
35
36

### Performance related

* Generalize FasterMoE's schedule to `n_expert > 1` and more bug fixes.
* Synchronization reduction thanks to @Fragile-azalea

Rick Ho's avatar
Rick Ho committed
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
## v1.0.0

### FasterMoE

* The new performance boosting features in the PPoPP'22 paper FasterMoE, detailed in the document.
	* Expert Shadowing.
	* Smart Scheduling.
	* Topology-aware gate.

### Bug fixes

* Transformer-XL examples.
* Compatibility to PyTorch versions.
* Megatron-LM documents.
* GShardGate.

Rick Ho's avatar
Rick Ho committed
53
54
55
56
## v0.3.0

### FMoE core

Rick Ho's avatar
Rick Ho committed
57
* Previous `mp_group` is renamed to `slice_group`, indicating that all workers in the group receive the same input batch, and process a slice of the input. `mp_group` will be deprecated in our next release.
Rick Ho's avatar
Rick Ho committed
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
* ROCm supported.
* `FMoELinear` is moved to a stand-alone file.

### Groupped data parallel

* Support any group name by their relative tag name.

###  Load balancing

* A brand new balancing strategy - SWIPE. Contributed by authors of a (currently unpublished) paper.
* A property `has_loss` is added to each gate, in order to identify whether balance loss should be collected.

### Megatron-LM support

* Experts are partitioned by tensor model parallelism in `mp_group`, instead of expert parallelism.
* Support arbitrary customized gate in `MegatronMLP`.
* Move the patches to a stand-alone file.

### Tests

* Move util functions into `test_ddp.py`.

Rick Ho's avatar
Rick Ho committed
80
81
82
83
84
85
## v0.2.1

## Load balancing

* Fix gradient for balance loss.

Rick Ho's avatar
Rick Ho committed
86
### Misc
Rick Ho's avatar
Rick Ho committed
87
88
89
90
91
92
93

* Typos.
* Update benchmark interface.
* Remove some redundant code for performance improvement.
* Enable `USE_NCCL` by default.
* Compatibility for PyTorch `<1.8.0` and `>=1.8.0`.

Rick Ho's avatar
Rick Ho committed
94
### Megatron adaption
Rick Ho's avatar
Rick Ho committed
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120

* Patch for numerical correctness of gradient clipping.
* Support to pipeline parallelism.

## v0.2.0

## Load balancing

* A brand new gate module with capacity-related utilities.
* GShard's and Switch Transformer's balance strategies are implemented as integrated gates.
* Balance loss is enabled.
* Balance monitor is provided.

## Checkpointing

* MoE models can be loaded and saved by fmoe's checkpointing module.

## Performance

* FP16 training performance is improved.

## Misc

* CUDA code directory is reconstructed.
* More tests are added.

Rick Ho's avatar
Rick Ho committed
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
## v0.1.2

### Compilation

- Remove dependency on the CUDA examples repository.

### Distributed

- Fix a bug related to PyTorch v1.8.0. FastMoE can now operate on multiple GPUs
on multiple nodes with PyTorch v1.8.0.

### Misc

- Fix tons of typos.
- Format the code.

Rick Ho's avatar
Rick Ho committed
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
## v0.1.1

### Distributed

- Broadcast data-parallel parameters before training.

### Megatron adaption

- Initialize `FMoELinear` parameters using different seed in model parallel even using the same random seed in megatron.
- Use proper comm for mp and dp.

### Transformer-XL example

- Improve scripts.

### Misc

- Logo and slack workspace link.
- Document in Chinese.
- Figures to explain how FastMoE works.

Rick Ho's avatar
Rick Ho committed
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
## v0.1.0

### Functions

- A model-injection-style easy-to-use user interface for Megatron-LM. 
- Support both data parallel and model parallel, and a hybrid of the two,
- Provide a new customized DDP module to synchronize in different comm groups.
- Support to customized `nn.Module` as an expert.

### Document and infrastructure

- Use PyTest.
- Setup PyLint.
- Installation and usage guide.
- Explanation of functions and code structure in code.

### Performance

- A benchmark to compare FastMoE and old PyTorch impl.