release-note.md

## v1.1.0

### Performance

* Smart schedule of FasterMoE is updated with correct stream management, and becomes faster.

### Testing

* All unit tests are checked and they run correctly now.

### Adaption

* Megatron-LM 3.2 supported.

### Documentation

* README is updated with some bugs fixed.
* A detailed [document for process groups](/doc/parallelism).


## v1.0.1

### Compatibility

* PyTorch 2.0 supported.
* Megatron-LM 2.5 supported.

### Documentation

* A detailed [installation-guide](/doc/installation-guide.md) thanks to @santurini

### Performance related

* Generalize FasterMoE's schedule to `n_expert > 1` and more bug fixes.
* Synchronization reduction thanks to @Fragile-azalea

## v1.0.0

### FasterMoE

* The new performance boosting features in the PPoPP'22 paper FasterMoE, detailed in the document.
	* Expert Shadowing.
	* Smart Scheduling.
	* Topology-aware gate.

### Bug fixes

* Transformer-XL examples.
* Compatibility to PyTorch versions.
* Megatron-LM documents.
* GShardGate.

## v0.3.0

### FMoE core

* Previous `mp_group` is renamed to `slice_group`, indicating that all workers in the group receive the same input batch, and process a slice of the input. `mp_group` will be deprecated in our next release.
* ROCm supported.
* `FMoELinear` is moved to a stand-alone file.

### Groupped data parallel

* Support any group name by their relative tag name.

###  Load balancing

* A brand new balancing strategy - SWIPE. Contributed by authors of a (currently unpublished) paper.
* A property `has_loss` is added to each gate, in order to identify whether balance loss should be collected.

### Megatron-LM support

* Experts are partitioned by tensor model parallelism in `mp_group`, instead of expert parallelism.
* Support arbitrary customized gate in `MegatronMLP`.
* Move the patches to a stand-alone file.

### Tests

* Move util functions into `test_ddp.py`.

## v0.2.1

## Load balancing

* Fix gradient for balance loss.

### Misc

* Typos.
* Update benchmark interface.
* Remove some redundant code for performance improvement.
* Enable `USE_NCCL` by default.
* Compatibility for PyTorch `<1.8.0` and `>=1.8.0`.

### Megatron adaption

* Patch for numerical correctness of gradient clipping.
* Support to pipeline parallelism.

## v0.2.0

## Load balancing

* A brand new gate module with capacity-related utilities.
* GShard's and Switch Transformer's balance strategies are implemented as integrated gates.
* Balance loss is enabled.
* Balance monitor is provided.

## Checkpointing

* MoE models can be loaded and saved by fmoe's checkpointing module.

## Performance

* FP16 training performance is improved.

## Misc

* CUDA code directory is reconstructed.
* More tests are added.

## v0.1.2

### Compilation

- Remove dependency on the CUDA examples repository.

### Distributed

- Fix a bug related to PyTorch v1.8.0. FastMoE can now operate on multiple GPUs
on multiple nodes with PyTorch v1.8.0.

### Misc

- Fix tons of typos.
- Format the code.

## v0.1.1

### Distributed

- Broadcast data-parallel parameters before training.

### Megatron adaption

- Initialize `FMoELinear` parameters using different seed in model parallel even using the same random seed in megatron.
- Use proper comm for mp and dp.

### Transformer-XL example

- Improve scripts.

### Misc

- Logo and slack workspace link.
- Document in Chinese.
- Figures to explain how FastMoE works.

## v0.1.0

### Functions

- A model-injection-style easy-to-use user interface for Megatron-LM. 
- Support both data parallel and model parallel, and a hybrid of the two,
- Provide a new customized DDP module to synchronize in different comm groups.
- Support to customized `nn.Module` as an expert.

### Document and infrastructure

- Use PyTest.
- Setup PyLint.
- Installation and usage guide.
- Explanation of functions and code structure in code.

### Performance

- A benchmark to compare FastMoE and old PyTorch impl.