CHANGELOG.md 25.9 KB
Newer Older
Matthew Douglas's avatar
Matthew Douglas committed
1
### v0.45.1
Johnny's avatar
Johnny committed
2
3
4

#### Improvements:

Matthew Douglas's avatar
Matthew Douglas committed
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
* Compatibility for `triton>=3.2.0`
* Moved package configuration to `pyproject.toml`
* Build system: initial support for NVIDIA Blackwell B100 GPUs, RTX 50 Blackwell series GPUs and Jetson Thor Blackwell.
  * Note: Binaries built for these platforms are not included in this release. They will be included in future releases upon the availability of the upcoming CUDA Toolkit 12.7 and 12.8.

#### Bug Fixes:
* Packaging: wheels will no longer include unit tests. (#1478)

#### Dependencies:
* Sets the minimum PyTorch version to 2.0.0.

### 0.45.0

This is a significant release, bringing support for LLM.int8() to NVIDIA Hopper GPUs such as the H100.

As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32 or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios.

#### Performance Improvements
This release includes broad performance improvements for a wide variety of inference scenarios. See [this X thread](https://x.com/Tim_Dettmers/status/1864706051171287069) for a detailed explanation.

#### Breaking Changes
🤗[PEFT](https://github.com/huggingface/peft) users wishing to merge adapters with 8-bit weights will need to upgrade to `peft>=0.14.0`.

#### Packaging Improvements
* The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB.
* Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution.
* The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1.


#### Deprecations
* A number of public API functions have been marked for deprecation and will emit `FutureWarning` when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users.
* The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using `block_wise=False` is not recommended and support will be removed in a future release.
* As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts.

#### Full Changelog

* refine docs for multi-backend alpha release by @Titus-von-Koeller in #1380
* README: Replace special Unicode text symbols with regular characters by @akx in #1385
* Update CI tools & fix typos by @akx in #1386
* Fix invalid escape sequence warning in Python 3.12 by @oshiteku in #1420
* [Build] Add CUDA 12.6.2 build; update 12.5.0 to 12.5.1 by @matthewdouglas in #1431
* LLM.int8() Refactoring: Part 1 by @matthewdouglas in #1401

### 0.44.1

#### Bug fixes:
* Fix optimizer support for Python <= 3.9 by @matthewdouglas in #1379

### 0.44.0

#### New: AdEMAMix Optimizer
The [AdEMAMix](https://hf.co/papers/2409.03137) optimizer is a modification to AdamW which proposes tracking two EMAs to better leverage past gradients. This allows for faster convergence with less training data and improved resistance to forgetting.

We've implemented 8bit and paged variations: `AdEMAMix`, `AdEMAMix8bit`, `PagedAdEMAMix`, and `PagedAdEMAMix8bit`. These can be used with a similar API to existing optimizers.

#### Improvements:
* **8-bit Optimizers**: The block size for all 8-bit optimizers has been reduced from 2048 to 256 in this release. This is a change from the original implementation proposed in [the paper](https://hf.co/papers/2110.02861) which improves accuracy.
* **CUDA Graphs support**: A fix to enable [CUDA Graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) capture of kernel functions was made in #1330. This allows for performance improvements with inference frameworks like vLLM. Thanks @jeejeelee!

#### Full Changelog:
* Embedding4bit and Embedding8bit implementation by @galqiwi in #1292
* Bugfix: Load correct nocublaslt library variant when BNB_CUDA_VERSION override is set by @matthewdouglas in #1318
* Enable certain CUDA kernels to accept specified cuda stream by @jeejeelee in #1330
* Initial support for ppc64le by @mgiessing in #1316
* Cuda source cleanup , refactor and fixes by @abhilash1910 in #1328
* Update for VS2022 17.11 compatibility with CUDA < 12.4 by @matthewdouglas in #1341
* Bump the minor-patch group with 3 updates by @dependabot in #1362
* Update matplotlib requirement from ~=3.9.1 to ~=3.9.2 in the major group by @dependabot in #1361
* docs: add internal reference to multi-backend guide by @Titus-von-Koeller in #1352
* Add move_to_device kwarg to the optimizer's load_state_dict by @koute in #1344
* Add AdEMAMix optimizer by @matthewdouglas in #1360
* Change 8bit optimizer blocksize 2048->256; additional bf16 support by @matthewdouglas in #1365
Johnny's avatar
Johnny committed
77

Titus von Koeller's avatar
Titus von Koeller committed
78
79
80
81
82
83
84
85
86
### 0.43.3

#### Improvements:

- FSDP: Enable loading prequantized weights with bf16/fp16/fp32 quant_storage
    - Background: This update, linked to [Transformer PR #32276](https://github.com/huggingface/transformers/pull/32276), allows loading prequantized weights with alternative storage formats. Metadata is tracked similarly to `Params4bit.__new__` post PR #970. It supports models exported with non-default `quant_storage`, such as [this NF4 model with BF16 storage](https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-BNB-NF4-BF16).
    - Special thanks to @winglian and @matthewdouglas for enabling FSDP+QLoRA finetuning of Llama 3.1 405B on a single 8xH100 or 8xA100 node with as little as 256GB system RAM.


Titus von Koeller's avatar
Titus von Koeller committed
87
88
### 0.43.2

89
90
91
92
93
94
95
96
This release is quite significant as the QLoRA bug fix big implications for higher `seqlen` and batch sizes.

For each sequence (i.e. batch size increase of one) we expect memory savings of:
- 405B: 39GB for `seqlen=1024`, and 4888GB for `seqlen=128,00`
- 70B: 10.1GB for `seqlen=1024` and  1258GB for `seqlen=128,00`

This was due to activations being unnecessary for frozen parameters, yet the memory for them was still erroneously allocated due to the now fixed bug.

Titus von Koeller's avatar
Titus von Koeller committed
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
#### Improvements:

- docs: FSDP+QLoRA and CPU install guide (#1211 #1227, thanks @stevhliu)
- Add CUDA 12.5 and update 12.4 builds (#1284)

#### Bug Fixes

- 4bit getstate and 8bit deepcopy (#1230 #1231, thanks @BenjaminBossan)
- missing optimizers in `str2optimizer32bit` (#1222, thanks @EtienneDosSantos)
- CUDA 12.5 build issue (#1273, thanks @HennerM)
- fix for min_8bit_size functionality in Optimizer base classes (#1286, thanks @Edenzzzz)
- QLoRA mem bug (#1270, thanks @Ther-nullptr)
- tests for cpu only platforms (#1259, thanks @galqiwi)
- restoration of quant_storage for CPU offloading (#1279)
- optim update error with non-contiguous grads/params (deepspeed) (#1187)

Titus von Koeller's avatar
Titus von Koeller committed
113
114
### 0.43.1

Titus von Koeller's avatar
Titus von Koeller committed
115
#### Improvements:
Titus von Koeller's avatar
Titus von Koeller committed
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132

- Improved the serialization format for 8-bit weights; this change is fully backwards compatible. (#1164, thanks to @younesbelkada for the contributions and @akx for the review).
- Added CUDA 12.4 support to the Linux x86-64 build workflow, expanding the library's compatibility with the latest CUDA versions. (#1171, kudos to @matthewdouglas for this addition).
- Docs enhancement: Improved the instructions for installing the library from source. (#1149, special thanks to @stevhliu for the enhancements).

#### Bug Fixes

- Fix 4bit quantization with blocksize = 4096, where an illegal memory access was encountered. (#1160, thanks @matthewdouglas for fixing and @YLGH for reporting)

#### Internal Improvements:

- Tests: improve memory usage (#1147, thanks @matthewdouglas)
- Add CUDA 12.4 to docs/install helper (#1136, thanks @matthewdouglas)
- Minor type/doc fixes (#1128, thanks @akx)
- Reformat Python code with Ruff (#1081, thanks @akx)
- Rework of CUDA/native-library setup and diagnostics (#1041, thanks @akx)

133
### 0.43.0
Tim Dettmers's avatar
Tim Dettmers committed
134

135
#### Improvements and New Features:
Tim Dettmers's avatar
Tim Dettmers committed
136

137
138
139
140
141
142
- QLoRA + FSDP official support is now live! https://github.com/TimDettmers/bitsandbytes/pull/970 by @warner-benjamin and team - with FSDP you can train very large models (70b scale) on multiple 24GB consumer-type GPUs. See https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html for more details.
- Introduced improvements to the CI process for enhanced performance and efficiency during builds, specifically enabling more effective cross-compilation on Linux platforms. This was accomplished by deprecating Make and migrating to Cmake, as well as implementing new corresponding workflows. Huge thanks go to @wkpark, @rickardp, @matthewdouglas and @younesbelkada; #1055, #1050, #1111.
- Windows should be officially supported in bitsandbytes if you install the library from source. See: https://huggingface.co/docs/bitsandbytes/main/en/index for more details
- Updated installation instructions to provide more comprehensive guidance for users. This includes clearer explanations and additional tips for various setup scenarios, making the library more accessible to a broader audience (@rickardp, #1047).
- Enhanced the library's compatibility and setup process, including fixes for CPU-only installations and improvements in CUDA setup error messaging. This effort aims to streamline the installation process and improve user experience across different platforms and setups (@wkpark, @akx, #1038, #996, #1012).
- Setup a new documentation at https://huggingface.co/docs/bitsandbytes/main with extensive new sections and content to help users better understand and utilize the library. Especially notable are the new API docs. (big thanks to @stevhliu and @mishig25 from HuggingFace #1012). The API docs have been also addressed in #1075.
Tim Dettmers's avatar
Tim Dettmers committed
143

144
#### Bug Fixes:
Tim Dettmers's avatar
Tim Dettmers committed
145

146
147
- Addressed a race condition in kEstimateQuantiles, enhancing the reliability of quantile estimation in concurrent environments (@pnunna93, #1061).
- Fixed various minor issues, including typos in code comments and documentation, to improve code clarity and prevent potential confusion (@Brian Vaughan, #1063).
Tim Dettmers's avatar
Tim Dettmers committed
148

149
#### Backwards Compatibility
Tim Dettmers's avatar
Tim Dettmers committed
150

151
- After upgrading from `v0.42` to `v0.43`, when using 4bit quantization, models may generate slightly different outputs (approximately up to the 2nd decimal place) due to a fix in the code. For anyone interested in the details, [see this comment](https://github.com/TimDettmers/bitsandbytes/discussions/1094#discussioncomment-8984069).
Tim Dettmers's avatar
Tim Dettmers committed
152

153
#### Internal and Build System Enhancements:
Tim Dettmers's avatar
Tim Dettmers committed
154

155
- Implemented several enhancements to the internal and build systems, including adjustments to the CI workflows, portability improvements, and build artifact management. These changes contribute to a more robust and flexible development process, ensuring the library's ongoing quality and maintainability (@rickardp, @akx, @wkpark, @matthewdouglas; #949, #1053, #1045, #1037).
Tim Dettmers's avatar
Tim Dettmers committed
156

157
#### Contributors:
Tim Dettmers's avatar
Tim Dettmers committed
158

159
160
161
162
163
164
165
This release is made possible thanks to the many active contributors that submitted PRs and many others who contributed to discussions, reviews, and testing. Your efforts greatly enhance the library's quality and user experience. It's truly inspiring to work with such a dedicated and competent group of volunteers and professionals!

We give a special thanks to @TimDettmers for managing to find a little bit of time for valuable consultations on critical topics, despite preparing for and touring the states applying for professor positions. We wish him the utmost success!

We also extend our gratitude to the broader community for your continued support, feedback, and engagement, which play a crucial role in driving the library's development forward.

### 0.42.0
Tim Dettmers's avatar
Tim Dettmers committed
166
167

Features:
168
169
170

- 4-bit serialization now supported. This enables 4-bit load/store. Thank you @poedator #753
- the bitsandbytes library now has a version attribute: `bitsandbytes.__version__` @rasbt #710
Tim Dettmers's avatar
Tim Dettmers committed
171
172
173

Bug fixes:

174
175
176
177
178
179
180
181
- Fixed bugs in dynamic exponent data type creation. Thank you @RossM, @KohakuBlueleaf, @ArrowM #659 #227 #262 #152
- Fixed an issue where 4-bit serialization would fail for layers without double quantization #868. Thank you, @poedator
- Fixed an issue where calling .to() or .cuda() on a 4-bit layer twice would result in an error #867. Thank you, @jph00
- Fixed a bug where a missing access permission in a path searched for CUDA would lead to an error @osma #677
- Fixed a bug where the GOOGLE_VM_CONFIG_LOCK_FILE variable could cause errors in colab environments @akrentsel @xaptronic #715 #883 #622
- Fixed a bug where kgetColRowStats (LLM.int8()) would fail for certain dimensions @LucQueen @905
- Fixed a bug where the adjusted regular Embedding layer was not available via bnb.nn.Embedding @neel04 #563
- Fixed added missing scipy requirement @dulalbert #525
Tim Dettmers's avatar
Tim Dettmers committed
182

183
### 0.41.3
Tim Dettmers's avatar
Tim Dettmers committed
184

185
Bug fixes:
Tim Dettmers's avatar
Tim Dettmers committed
186

187
188
- Fixed an issue where 4-bit serialization would fail for layers without double quantization #868. Thank you, @poedator
- Fixed an issue where calling .to() or .cuda() on a 4-bit layer twice would result in an error #867. Thank you, @jph00
Tim Dettmers's avatar
Tim Dettmers committed
189

190
### 0.41.2
191

192
Feature:
193

194
- 4-bit serialization now supported. This enables 4-bit load/store. Thank you @poedator #753
195

196
### 0.41.1
197

198
199
200
201
202
Bug fixes:

- Fixed bugs in dynamic exponent data type creation. Thank you @RossM, @KohakuBlueleaf, @ArrowM #659 #227 #262 #152

### 0.41.0
203
204

Features:
205

206
207
- Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571
- CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk
208

209
Bug fixes:
210

211
212
213
214
215
216
217
218
- Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553
- Fixed a missing scipy dependency in requirements.txt. #544
- Fixed a bug, where a view operation could cause an error in 8-bit layers.
- Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani
- Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588
- Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk
- Fixed bug where read-permission was assumed for a file. #497
- Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro
219

220
Documentation:
221

222
223
- Improved documentation for GPUs that do not support 8-bit matmul. #529
- Added description and pointers for the NF4 data type. #543
224

225
User experience:
226

227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
- Improved handling of default compute_dtype for Linear4bit Layers, so that compute_dtype = input_dtype if the input data type is stable enough (float32, bfloat16, but not float16).

Performance:

- improved 4-bit inference performance for A100 GPUs. This degraded performance for A40/RTX3090 and RTX 4090 GPUs slightly.

### 0.40.2

Bug fixes:

- Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588
- Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk
- Fixed bug where read-permission was assumed for a file. #497
- Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro

### 0.40.1
243
244

Features:
245
246
247

- Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571
- CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk
248
249

Bug fixes:
Tim Dettmers's avatar
Tim Dettmers committed
250

251
252
253
254
- Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553
- Fixed a missing scipy dependency in requirements.txt. #544
- Fixed a bug, where a view operation could cause an error in 8-bit layers.
- Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani
Tim Dettmers's avatar
Tim Dettmers committed
255

256
Documentation:
Tim Dettmers's avatar
Tim Dettmers committed
257

258
259
260
261
- Improved documentation for GPUs that do not support 8-bit matmul. #529
- Added description and pointers for the NF4 data type. #543

### 0.40.0
Tim Dettmers's avatar
Tim Dettmers committed
262
263

Features:
264
265
266

- Added 4-bit inference kernels for batch size=1. Currently support are the NF4, FP4 data types.
- Added support for quantizations of bfloat16 input data.
Tim Dettmers's avatar
Tim Dettmers committed
267
268
269

Bug fixes:

270
- Added `device` variable for bitsandbytes layers to be compatible with PyTorch layers.
271

272
Deprecated:
273

274
- Binaries for CUDA 11.2, 11.6 no longer ship with `pip install bitsandbytes` and need to be compiled from source.
275

276
### 0.39.0
277
278

Features:
279
280
281
282
283
284

- 4-bit matrix multiplication for Float4 and NormalFloat4 data types.
- Added 4-bit quantization routines
- Doubled quantization routines for 4-bit quantization
- Paged optimizers for Adam and Lion.
- bfloat16 gradient / weight support for Adam and Lion with 8 or 32-bit states.
285
286

Bug fixes:
287

288
- Fixed a bug where 8-bit models consumed twice the memory as expected after serialization
289

290
Deprecated:
291

292
293
294
- Kepler binaries (GTX 700s and Tesla K40/K80) are not longer provided via pip and need to be compiled from source. Kepler support might be fully removed in the future.

### 0.38.1
295
296
297

Features:

298
299
- Added Int8 SwitchBack layers
- Added Fake FP8 layers for research purposes (available under `bnb.research.nn. ...`)
300

301
302
303
### 0.38.0

#### 8-bit Lion, Load/Store 8-bit Models directly from/to HF Hub
304

305
Features:
306
307
308
309

- Support for 32 and 8-bit Lion has been added. Thank you @lucidrains
- Support for serialization of Linear8bitLt layers (LLM.int8()). This allows to store and load 8-bit weights directly from the HuggingFace Hub. Thank you @myrab
- New bug report features `python -m bitsandbytes` now gives extensive debugging details to debug CUDA setup failures.
310

311
Bug fixes:
Tim Dettmers's avatar
Tim Dettmers committed
312

313
314
- Fixed a bug where some bitsandbytes methods failed in a model-parallel setup on multiple GPUs. Thank you @tonylins
- Fixed a bug where cudart.so libraries could not be found in newer PyTorch releases.
Tim Dettmers's avatar
Tim Dettmers committed
315

316
Improvements:
317

318
- Improved the CUDA Setup procedure by doing a more extensive search for CUDA libraries
319

320
Deprecated:
321

322
323
- Devices with compute capability 3.0 (GTX 700s, K10) and 3.2 (Tegra K1, Jetson TK1) are now deprecated and support will be removed in 0.39.0.
- Support for CUDA 10.0 and 10.2 will be removed in bitsandbytes 0.39.0
324

325
326
327
328
329
### 0.37.0

#### Int8 Matmul + backward support for all GPUs

Features:
330

331
332
333
334
335
336
- Int8 MatmulLt now supports backward through inversion of the ColTuring/ColAmpere format. Slow, but memory efficient. Big thanks to @borzunov
- Int8 now supported on all GPUs. On devices with compute capability \< 7.5, the Int weights are cast to 16/32-bit for the matrix multiplication. Contributed by @borzunov

Improvements:

- Improved logging for the CUDA detection mechanism.
337
338
339
340
341
342

### 0.36.0

#### Improvements, Ada/Hopper support, fake k-bit quantization.

Features:
343
344
345
346
347
348
349
350

- CUDA 11.8 and 12.0 support added
- support for Ada and Hopper GPUs added (compute capability 8.9 and 9.0)
- support for fake k-bit block-wise quantization for Int, Float, quantile quantization, and dynamic exponent data types added
- Added CUDA instruction generator to fix some installations.
- Added additional block sizes for quantization {64, 128, 256, 512, 1024}
- Added SRAM Quantile algorithm to quickly estimate less than 256 quantiles
- Added option to suppress the bitsandbytes welcome message (@Cyberes)
351
352

Regression:
353
354

- Compute capability 3.0 removed: GTX 600s and 700s series is no longer supported (except GTX 780 and GTX 780 Ti)
355
356

Bug fixes:
357
358
359
360
361
362
363
364
365
366
367

- fixed a bug where too long directory names would crash the CUDA SETUP #35 (@tomaarsen)
- fixed a bug where CPU installations on Colab would run into an error  #34 (@tomaarsen)
- fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52
- fixed a bug where the CUDA setup failed due to a wrong function call.
- fixed a bug in the CUDA Setup which led to an incomprehensible error if no GPU was detected.
- fixed a bug in the CUDA Setup failed with the cuda runtime was found, but not the cuda library.
- fixed a bug where not finding the cuda runtime led to an incomprehensible error.
- fixed a bug where with missing CUDA the default was an error instead of the loading the CPU library
- fixed a bug where the CC version of the GPU was not detected appropriately (@BlackHC)
- fixed a bug in CPU quantization which lead to errors when the input buffer exceeded 2^31 elements
368
369

Improvements:
Tim Dettmers's avatar
Tim Dettmers committed
370

371
372
373
374
- multiple improvements in formatting, removal of unused imports, and slight performance improvements (@tomaarsen)
- StableEmbedding layer now has device and dtype parameters to make it 1:1 replaceable with regular Embedding layers (@lostmsu)
- runtime performance of block-wise quantization slightly improved
- added error message for the case multiple libcudart.so are installed and bitsandbytes picks the wrong one
Tim Dettmers's avatar
Tim Dettmers committed
375

376
### 0.35.4
Tim Dettmers's avatar
Tim Dettmers committed
377

378
Bug fixes:
Tim Dettmers's avatar
Tim Dettmers committed
379

380
381
- Fixed a bug in the CUDA Setup failed with the cuda runtime was found, but not the cuda library.
- Fixed a bug where not finding the cuda runtime led to an incomprehensible error.
Tim Dettmers's avatar
Tim Dettmers committed
382

383
### 0.35.3
Tim Dettmers's avatar
Tim Dettmers committed
384

385
Bug fixes:
Tim Dettmers's avatar
Tim Dettmers committed
386

387
- Fixed a bug in the CUDA Setup which led to an incomprehensible error if no GPU was detected.
Tim Dettmers's avatar
Tim Dettmers committed
388

389
### 0.35.2
Tim Dettmers's avatar
Tim Dettmers committed
390
391

Bug fixes:
Tim Dettmers's avatar
Tim Dettmers committed
392

393
- Fixed a bug where the CUDA setup failed due to a wrong function call.
Tim Dettmers's avatar
Tim Dettmers committed
394

395
### 0.35.1
396

397
Features:
398

399
- Added CUDA instruction generator to fix some installations.
400

401
Bug fixes:
Tim Dettmers's avatar
Tim Dettmers committed
402

403
- Fixed a problem where warning messages would be displayed even though everything worked correctly.
Tim Dettmers's avatar
Tim Dettmers committed
404

405
### 0.35.0
Tim Dettmers's avatar
Tim Dettmers committed
406

407
#### CUDA 11.8 support and bug fixes
Tim Dettmers's avatar
Tim Dettmers committed
408
409

Features:
410
411

- CUDA 11.8 support added and binaries added to the PyPI release.
Tim Dettmers's avatar
Tim Dettmers committed
412
413
414

Bug fixes:

415
416
417
- fixed a bug where too long directory names would crash the CUDA SETUP #35 (thank you @tomaarsen)
- fixed a bug where CPU installations on Colab would run into an error  #34 (thank you @tomaarsen)
- fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52
Tim Dettmers's avatar
Tim Dettmers committed
418

419
### 0.34.0
Tim Dettmers's avatar
Tim Dettmers committed
420

421
#### Bug fixes and memory efficient backprop
Tim Dettmers's avatar
Tim Dettmers committed
422
423

Features:
424
425

- Linear8bitLt layer now supports `memory_efficient_backward=True` which enables backprop of gradients through frozen weights.
Tim Dettmers's avatar
Tim Dettmers committed
426
427

Bug fixes:
428

429
- fixed an issue where too many threads were created in blockwise quantization on the CPU for large tensors
430

431
### 0.33.0
432

433
#### Various bug fixes
434
435

Features:
436
437

- CPU quantization now supports a variable `blocksize` variable to enhance quantization speed or precision.
438
439

Bug fixes:
440

441
442
443
444
- fixed an issue in CPU quantization where tensors with more than 2^31 elements would fail 19a7adca7a6c9bf7061a384d7e9d9b13676a1a88
- fixed a bug where cpu binaries would fail if no GPU would be detected eab4d8232d558f2e6bd7f7cc3d00e2e6e94f4e80
- fixed an issue where cpu binaries cause additional stdout messages 92a3363096e10ad6a5c4e944af898bd1186d806a
- fixed an import of bnb.utils 2e630b55f51d454f3bd723dffda68a07ef93190c
445

446
We thank @mryab, @mbrukman, @chessgecko, @dbaranchuk for pull request with bug fixes and new features.
447

448
### 0.32.0
Tim Dettmers's avatar
Tim Dettmers committed
449

450
#### 8-bit Inference Performance Enhancements
Tim Dettmers's avatar
Tim Dettmers committed
451

452
We added performance enhancements for small models. This makes small models about 2x faster for LLM.int8() inference.
Tim Dettmers's avatar
Tim Dettmers committed
453
454

Features:
455
456
457
458

- Int32 dequantization now supports fused biases.
- Linear8bitLt now uses a fused bias implementation.
- Change `.data.storage().data_ptr()` to `.data.data_ptr()` to enhance inference performance.
Tim Dettmers's avatar
Tim Dettmers committed
459
460
461

Bug fixes:

462
463
- Now throws and error if LLM.int8() is used on a GPU that is not supported.
- Enhances error messaging if CUDA SETUP fails.
Tim Dettmers's avatar
Tim Dettmers committed
464

465
### 0.31.0
Tim Dettmers's avatar
Tim Dettmers committed
466

467
#### 8-bit Inference and Packaging Update
Tim Dettmers's avatar
Tim Dettmers committed
468

469
Features:
Tim Dettmers's avatar
Tim Dettmers committed
470

471
472
- added direct outlier extraction. This enables outlier extraction without fp16 weights without performance degradation.
- Added automatic CUDA SETUP procedure and packaging all binaries into a single bitsandbytes package.
Tim Dettmers's avatar
Tim Dettmers committed
473

474
### 0.30.0
Tim Dettmers's avatar
Tim Dettmers committed
475

476
#### 8-bit Inference Update
477

478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
Features:

- Added 8-bit matrix multiplication form cuBLAS,  and cuBLASLt as well as multiple GEMM kernels (GEMM, GEMMEx, GEMMLt)
- Added 8-bit Linear layers with 8-bit Params that perform memory efficient inference with an option for 8-bit mixed precision matrix decomposition for inference without performance degradation
- Added quantization methods for "fake" quantization as well as optimized kernels vector-wise quantization and equalization as well as optimized cuBLASLt transformations
- CPU only build now available (Thank you, @mryab)

Deprecated:

- Pre-compiled release for CUDA 9.2, 10.0, 10.2 no longer available

### 0.26.0:

Features:

- Added Adagrad (without grad clipping) as 32-bit and 8-bit block-wise optimizer.
- Added AdamW (copy of Adam with weight decay init 1e-2). #10
- Introduced ModuleConfig overrides which can be seamlessly be used at initialization time of a module.
- Added `bnb.nn.Embedding` layer which runs at 32-bit but without the layernorm. This works well if you need to fine-tune pretrained models that do not have a embedding layer norm. #19
497
498
499

Bug fixes:

500
501
502
503
504
505
506
507
508
- Fixed a bug where weight decay was incorrectly applied to 32-bit Adam. #13
- Fixed an unsafe use of eval. #8
- Fixed a bug where the StableEmbedding layer 32-bit optimizer override would not work without registering the whole model first (`bnb.optim.GlobalOptimManager.get_instance().register_parameters(model.parameters())`).  #13 #15

Docs:

- Added instructions how to solve "\_\_fatbinwrap\_" errors.

### 0.0.25:
509
510

Features:
511
512
513
514
515

- Added `skip_zeros` for block-wise and 32-bit optimizers. This ensures correct updates for sparse gradients and sparse models.
- Added support for Kepler GPUs. (#4)
- Added Analysis Adam to track 8-bit vs 32-bit quantization errors over time.
- Make compilation more user friendly.
516
517

Bug fixes:
Titus von Koeller's avatar
Titus von Koeller committed
518

519
- fixed "undefined symbol: \_\_fatbinwrap_38" error for P100 GPUs on CUDA 10.1 (#5)
Titus von Koeller's avatar
Titus von Koeller committed
520

521
Docs:
Titus von Koeller's avatar
Titus von Koeller committed
522

523
- Added docs with instructions to compile from source.
Titus von Koeller's avatar
Titus von Koeller committed
524

525
### 0.0.24:
526

527
528
- Fixed a bug where a float/half conversion led to a compilation error for CUDA 11.1 on Turning GPUs.
- removed Apex dependency for bnb LAMB
529

530
### 0.0.23:
Titus von Koeller's avatar
Titus von Koeller committed
531

532
Bugs:
Titus von Koeller's avatar
Titus von Koeller committed
533

534
535
- Unified quantization API: each quantization function now returns `Q, S` where `Q` is the quantized tensor and `S` the quantization state which may hold absolute max values, a quantization map or more. For dequantization all functions now accept the inputs `Q, S` so that `Q` is dequantized with the quantization state `S`.
- Fixed an issue where the CUDA 11.1 binary was not compiled with the right headers
Titus von Koeller's avatar
Titus von Koeller committed
536

537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
API changes:

- Block-wise quantization for optimizers now enabled by default

Features:

- Block-wise quantization routines now support CPU Tensors.

### 0.0.22:

- Fixed an error where a `reset_parameters()` call on the `StableEmbedding` would lead to an error in older PyTorch versions (from 1.7.0).

### 0.0.21

- Ampere, RTX 30 series GPUs now compatible with the library.