- Devices with compute capability 3.0 (GTX 700s, K10) and 3.2 (Tegra K1, Jetson TK1) are now deprecated and support will be removed in 0.39.0.
- Support for CUDA 10.0 and 10.2 will be removed in bitsandbytes 0.39.0
### 0.38.1
Features:
- Added Int8 SwitchBack layers
- Added Fake FP8 layers for research purposes (available under `bnb.research.nn. ...`)
### 0.39.0
Features:
- 4-bit matrix multiplication for Float4 and NormalFloat4 data types.
- Added 4-bit quantization routines
- Doubled quantization routines for 4-bit quantization
- Paged optimizers for Adam and Lion.
- bfloat16 gradient / weight support for Adam and Lion with 8 or 32-bit states.
Bug fixes:
- Fixed a bug where 8-bit models consumed twice the memory as expected after serialization
Deprecated:
- Kepler binaries (GTX 700s and Tesla K40/K80) are not longer provided via pip and need to be compiled from source. Kepler support might be fully removed in the future.
1. Run `python speed_benchmark/speed_benchmark.py` which times operations and writes their time to `speed_benchmark/info_a100_py2.jsonl` (change the name of the jsonl to a different name for your profiling).
2. Run `python speed_benchmark/make_plot_with_jsonl.py`, which produces the `speed_benchmark/plot_with_info.pdf`. Again make sure you change the jsonl which is being processed.
# first plot the time occupied by different operations
fork,marker,ls,color,namein[
('standard_gx+standard_gw+standard_fwd','s','-','C2','Standard fp16 (sum of parts)'),
('x_quantize_rowwise+g_quantize_rowwise+w_quantize_global+w_quantize_global_transpose+standard_gw+global_fwd+global_bwd','o','-','C4','SwitchBack int8 (sum of parts)'),