Commits · 74f7aa06f31ee8d453f18290537b69359fd69fb2 · OpenDAS / bitsandbytes

23 Oct, 2025 1 commit
- fix code, compiled and tested successfully · 74f7aa06
  limm authored Oct 23, 2025
  
  74f7aa06
13 Jun, 2025 1 commit
- Apply clang-format rules (#1678) · 4955d136
  Matthew Douglas authored Jun 13, 2025
  
  4955d136
04 Jun, 2025 1 commit

Matthew Douglas authored Jun 04, 2025

* Deprecation cleanup: remove histogram_scatter_add_2d

* Deprecation cleanup: vectorwise_mm_dequant

* Deprecation cleanup: vectorwise_quant

* Remove unused test

* Optimizer test cleanup

* Deprecations: remove estimate_quantiles, create_quantile_map

* Move deprecated test

849d9449

25 Mar, 2025 1 commit

PyTorch Custom Operator Integration (#1544) · e82f72b3

Matthew Douglas authored Mar 25, 2025



* Sketch out first custom op registration

* Add note

* Initial int8 op registration

* Cleanup some deprecated functions.

* Int8 ops updates; tests

* Implement 4bit quant/dequant ops

* Fix nested quant

* cleanup

* Test improvements

* Clean up and improve tests

* Add higher level custom op for int8 matmul + dequant + bias

* Add gemv 4bit custom op

* Cleanup

* Implement out kwarg overloads for custom ops

* Update PyTorch minimum to 2.1

* Deprecation updates

* Deprecation updates

* Cleanup; rename int8_linear_dequant -> int8_scaled_mm

* Bump min pytorch to 2.2

* cleanup

* Test reorganization

* Remove deprecated supports_igemmlt

* More cleanup

* Cleanup obsolete C++/CUDA code

* Cleanup

* Create 'default' backend for fallback op implementations; initial CPU nf4 work

* Stub out for multi-platform

* Fix serialization tests for torch>=2.6.0

* Add example for torch.compile e2e inference

* Test update

---------
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>

e82f72b3

14 Jan, 2025 1 commit
- cleanup: remove unused kernels/C++ code (#1458) · 58922237
  Matthew Douglas authored Jan 14, 2025
```
* (chore) Remove unused dotfiles

* cleanup: remove unused kernels/C++ code
```
  58922237
05 Dec, 2024 1 commit

LLM.int8() Refactoring: Part 1 (#1401) · 81e6345d

Matthew Douglas authored Dec 05, 2024



* Start of int8 refactor: remove col32/col_ampere/col_turing transforms in new igemmlt implementation

* Fix unintended change

* New naive mm_dequant kernel for row-major; cleanup

* fix

* int8 refactor: initial sparse decomp, cleanup

* Int8 refactoring: remove separate NO_CUBLASLT build; more cleanup

* int8: inference optimizations, some cleanup

* int8: more tests passing, cleanup

* int8 - more cleanup, most tests passing

* int8: specify CUDA stream for int8 ops

* perf: reduce overhead from getting cudaStream ptr

* Mark some functions for deprecation.

* int8 sparse decomp: small perf improvement

* update setup.py

* Update bitsandbytes/autograd/_functions.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/research/autograd/_functions.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* int8 - perf improvement for sparse decomposition inference; deprecate get_tensor_stream() in favor of new private fn

* int8 cleanup

* Ignore ruff rule ISC001 (incompatible with formatter)

* add comment

* int8 more cleanup

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* int8: rename / deprecate old fn signatures

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* type annotation

* format update

* Update bitsandbytes/research/autograd/_functions.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* cleanup

* Add comment to explain division optimization

* more cleanup

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* cleanup

* Type annotations, cleanup

* remove unused kernels; improved type annotations

* small perf optimization for single-GPU systems

* small perf optimization for single-GPU systems

* update docstrings

* Improve docs and tests

* Update docstring

* Update test

* add benchmarking script

* test cleanup: add deprecated marker, move benchmarks out

* Add int8 dequant function; misc improvements

* int8 matmul fallback for inner dims not divisible by 4

* improve register usage of kInt8VectorQuant - especially for A100/H100

* disable fail-fast for package build

* maxwell compat

* ptxas verbose

* docs update

* doc update

* backward fix

* Bugfix sparse decomp

* Int8 fix for PEFT OLoRA init

* Fix test for deprecated spmm_coo

* test improvement

* doc update

* typo

* doc cleanup

* docs

* add inference benchmark script

* Add benchmarks, doc update

---------
Co-authored-by: Aarni Koskela <akx@iki.fi>

81e6345d

20 Sep, 2024 2 commits

Change 8bit optimizer blocksize 2048->256; additional bf16 support (#1365) · aa57bd89
Matthew Douglas authored Sep 20, 2024
```
* Change 8bit optimizer blocksize 2048->256; additional bf16 support
* Update tolerances for 8bit optimizer tests
```
aa57bd89

Add AdEMAMix optimizer (#1360) · d9645465

Matthew Douglas authored Sep 20, 2024

* Add AdEMAMix optimizer

* Add PagedAdEMAMix32bit, AdEMAMix32bit

* Add PagedAdEMAMix32bit, AdEMAMix32bit

* AdEMAMix: add support for alpha/beta3 scheduling

* Update paged AdEMAMix

d9645465

26 Aug, 2024 1 commit

Cuda source cleanup , refactor and fixes (#1328) · 6bef412a

Abhilash Majumder authored Aug 26, 2024

* remove kcompress

* fix initial template call

* fix function name

* remove vector load

* cleanup reduce  & rearrange

* format

6bef412a

22 Aug, 2024 1 commit

Enable certain CUDA kernels to accept specified cuda stream (#1330) · a685654b

Jee Jee Li authored Aug 22, 2024

* Done

* fix format

* fix format

* fix format

* fix format

* Address format error and fix default arg bug

* Refine stream argument passing mechanism

* Fix bug

* Delete unused code

a685654b

29 Mar, 2024 1 commit
- Fix 4bit quantization with blocksize=4096 · c17fb8eb
  Matthew Douglas authored Mar 29, 2024
  
  c17fb8eb
30 Jan, 2024 1 commit
- Don't crash Python interpreter via assert(false) (#998) · 29a637bc
  Aarni Koskela authored Jan 30, 2024
  
  29a637bc
11 Jul, 2023 1 commit
- Added more extensive gemv tests; blocksize guard for gemv. · ba51d95d
  Tim Dettmers authored Jul 11, 2023
  
  ba51d95d
10 Jul, 2023 1 commit
- Added fp32 compute type for gemv_4bit. · 5fab6734
  Tim Dettmers authored Jul 09, 2023
  
  5fab6734
09 Jul, 2023 1 commit
- Added abitrary data types; fixed a bug for small matrices. · 4b88d69d
  Tim Dettmers authored Jul 09, 2023
  
  4b88d69d
05 Jul, 2023 1 commit
- Added bfloat16 quantizations and tests. · 02fd80cb
  Tim Dettmers authored Jul 04, 2023
  
  02fd80cb
04 Jul, 2023 2 commits
- Vectorized loads, conflict free NF4; 52 vs 172. · dfe6900b
  Tim Dettmers authored Jul 04, 2023
  
  dfe6900b
- Initial 4-bit naive batch size 1, 81 vs 185. · f89ff93e
  Tim Dettmers authored Jul 03, 2023
  
  f89ff93e
24 May, 2023 1 commit
- Added PagedLion and bf16 Lion. · 1b8772a8
  Tim Dettmers authored May 23, 2023
  
  1b8772a8
06 May, 2023 1 commit
- Added paging. · ec38ba95
  Tim Dettmers authored May 06, 2023
  
  ec38ba95
02 May, 2023 3 commits
- 4-bit draft; 128 vector load 240. · 264a9485
  Tim Dettmers authored May 02, 2023
  
  264a9485
- Shared memory efficient 240. · 77f15fdc
  Tim Dettmers authored May 02, 2023
  
  77f15fdc
- Baseline for debugging. · f9bfea8f
  Tim Dettmers authored May 02, 2023
  
  f9bfea8f
01 May, 2023 6 commits
- 8x32 240 6 warps. · 7bfa09d0
  Tim Dettmers authored May 01, 2023
  
  7bfa09d0
- 16x16 240. · 3d4a2ead
  Tim Dettmers authored May 01, 2023
  
  3d4a2ead
- Warp specalization 362. · 7cc8ff47
  Tim Dettmers authored May 01, 2023
  
  7cc8ff47
- 64 threads, high smem, 434. · 30d03e02
  Tim Dettmers authored Apr 30, 2023
  
  30d03e02
- Slow non-vector 530. · 604bb3fb
  Tim Dettmers authored Apr 30, 2023
  
  604bb3fb
- Slow tensor core solution. · ad07d254
  Tim Dettmers authored Apr 30, 2023
  
  ad07d254
30 Apr, 2023 1 commit
- 4-bit draft. · 21723f79
  Tim Dettmers authored Apr 29, 2023
  
  21723f79
29 Apr, 2023 5 commits
- Added bit template. · cad83994
  Tim Dettmers authored Apr 28, 2023
  
  cad83994
- New implementation for batch size 1. · f3e97ccb
  Tim Dettmers authored Apr 28, 2023
  
  f3e97ccb
- Added fp16 and thread/item template. · f6df4aef
  Tim Dettmers authored Apr 28, 2023
  
  f6df4aef
- Added template refactor. · 3aef7834
  Tim Dettmers authored Apr 28, 2023
  
  3aef7834
- First baseline kernel. · c1bfb210
  Tim Dettmers authored Apr 28, 2023
  
  c1bfb210
27 Apr, 2023 3 commits
- Adedd pipeline draft. · 9cab14a3
  Tim Dettmers authored Apr 27, 2023
  
  9cab14a3
- Added non-cutlass template. · d1c4c205
  Tim Dettmers authored Apr 27, 2023
  
  d1c4c205
- Best attempt at cutlass3. · 0afc8e9e
  Tim Dettmers authored Apr 26, 2023
  
  0afc8e9e
26 Apr, 2023 1 commit
- CUTLASS compiles. · 84964db9
  Tim Dettmers authored Apr 25, 2023
  
  84964db9
25 Apr, 2023 1 commit
- Added cutlass example. · 6e2544da
  Tim Dettmers authored Apr 25, 2023
  
  6e2544da