Commits · b1f80b8acc6f8cfa7932dece460f6b600466dd34 · OpenDAS / bitsandbytes

18 Sep, 2025 1 commit

[CUDA] Branchless NF4/FP4 kDequantizeBlockwise kernel for faster dequantization (#1746) · b1f80b8a

Mohamed Hisham authored Sep 18, 2025

* Added branchless LUT-based dequantization for FP4 and NF4

* Added extra command line options to control reproducibility

* Restore FP4 quantization/dequantization order

b1f80b8a

02 Aug, 2025 1 commit
- Fixing quantization uint8 packing bug for NF4 and FP4 · 639f8c05
  Mohamed Hisham authored Aug 02, 2025
  
  639f8c05
13 Jun, 2025 1 commit
- Apply clang-format rules (#1678) · 4955d136
  Matthew Douglas authored Jun 13, 2025
  
  4955d136
04 Jun, 2025 1 commit

Deprecation cleanup (#1669) · 849d9449

Matthew Douglas authored Jun 04, 2025

* Deprecation cleanup: remove histogram_scatter_add_2d

* Deprecation cleanup: vectorwise_mm_dequant

* Deprecation cleanup: vectorwise_quant

* Remove unused test

* Optimizer test cleanup

* Deprecations: remove estimate_quantiles, create_quantile_map

* Move deprecated test

849d9449

25 Mar, 2025 1 commit

PyTorch Custom Operator Integration (#1544) · e82f72b3

Matthew Douglas authored Mar 25, 2025



* Sketch out first custom op registration

* Add note

* Initial int8 op registration

* Cleanup some deprecated functions.

* Int8 ops updates; tests

* Implement 4bit quant/dequant ops

* Fix nested quant

* cleanup

* Test improvements

* Clean up and improve tests

* Add higher level custom op for int8 matmul + dequant + bias

* Add gemv 4bit custom op

* Cleanup

* Implement out kwarg overloads for custom ops

* Update PyTorch minimum to 2.1

* Deprecation updates

* Deprecation updates

* Cleanup; rename int8_linear_dequant -> int8_scaled_mm

* Bump min pytorch to 2.2

* cleanup

* Test reorganization

* Remove deprecated supports_igemmlt

* More cleanup

* Cleanup obsolete C++/CUDA code

* Cleanup

* Create 'default' backend for fallback op implementations; initial CPU nf4 work

* Stub out for multi-platform

* Fix serialization tests for torch>=2.6.0

* Add example for torch.compile e2e inference

* Test update

---------
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>

e82f72b3

14 Jan, 2025 1 commit
- cleanup: remove unused kernels/C++ code (#1458) · 58922237
  Matthew Douglas authored Jan 14, 2025
```
* (chore) Remove unused dotfiles

* cleanup: remove unused kernels/C++ code
```
  58922237
05 Dec, 2024 1 commit

LLM.int8() Refactoring: Part 1 (#1401) · 81e6345d

Matthew Douglas authored Dec 05, 2024



* Start of int8 refactor: remove col32/col_ampere/col_turing transforms in new igemmlt implementation

* Fix unintended change

* New naive mm_dequant kernel for row-major; cleanup

* fix

* int8 refactor: initial sparse decomp, cleanup

* Int8 refactoring: remove separate NO_CUBLASLT build; more cleanup

* int8: inference optimizations, some cleanup

* int8: more tests passing, cleanup

* int8 - more cleanup, most tests passing

* int8: specify CUDA stream for int8 ops

* perf: reduce overhead from getting cudaStream ptr

* Mark some functions for deprecation.

* int8 sparse decomp: small perf improvement

* update setup.py

* Update bitsandbytes/autograd/_functions.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/research/autograd/_functions.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* int8 - perf improvement for sparse decomposition inference; deprecate get_tensor_stream() in favor of new private fn

* int8 cleanup

* Ignore ruff rule ISC001 (incompatible with formatter)

* add comment

* int8 more cleanup

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* int8: rename / deprecate old fn signatures

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* type annotation

* format update

* Update bitsandbytes/research/autograd/_functions.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* cleanup

* Add comment to explain division optimization

* more cleanup

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* cleanup

* Type annotations, cleanup

* remove unused kernels; improved type annotations

* small perf optimization for single-GPU systems

* small perf optimization for single-GPU systems

* update docstrings

* Improve docs and tests

* Update docstring

* Update test

* add benchmarking script

* test cleanup: add deprecated marker, move benchmarks out

* Add int8 dequant function; misc improvements

* int8 matmul fallback for inner dims not divisible by 4

* improve register usage of kInt8VectorQuant - especially for A100/H100

* disable fail-fast for package build

* maxwell compat

* ptxas verbose

* docs update

* doc update

* backward fix

* Bugfix sparse decomp

* Int8 fix for PEFT OLoRA init

* Fix test for deprecated spmm_coo

* test improvement

* doc update

* typo

* doc cleanup

* docs

* add inference benchmark script

* Add benchmarks, doc update

---------
Co-authored-by: Aarni Koskela <akx@iki.fi>

81e6345d

23 Oct, 2024 1 commit
- Update CI tools & fix typos (#1386) · 9568735b
  Aarni Koskela authored Oct 23, 2024
```
* Update pre-commit tools

* Fix typos
```
  9568735b
20 Sep, 2024 2 commits

Change 8bit optimizer blocksize 2048->256; additional bf16 support (#1365) · aa57bd89
Matthew Douglas authored Sep 20, 2024
```
* Change 8bit optimizer blocksize 2048->256; additional bf16 support
* Update tolerances for 8bit optimizer tests
```
aa57bd89

Add AdEMAMix optimizer (#1360) · d9645465

Matthew Douglas authored Sep 20, 2024

* Add AdEMAMix optimizer

* Add PagedAdEMAMix32bit, AdEMAMix32bit

* Add PagedAdEMAMix32bit, AdEMAMix32bit

* AdEMAMix: add support for alpha/beta3 scheduling

* Update paged AdEMAMix

d9645465

26 Aug, 2024 1 commit

Cuda source cleanup , refactor and fixes (#1328) · 6bef412a

Abhilash Majumder authored Aug 26, 2024

* remove kcompress

* fix initial template call

* fix function name

* remove vector load

* cleanup reduce  & rearrange

* format

6bef412a

12 Jul, 2024 1 commit

Fix CUDA 12.5 build issue (#1273) · 85e01276

Markus Hennerbichler authored Jul 12, 2024

pythonInterface.cpp depends on ops.cuh
which in turn depends on some thrust headers.
It is defined as a C++ compilation unit
which is problematic  becuase thrift doesn't guarantee
compatibility with a host compiler.

This is starting to cause issues with CUDA 12.5.
There is no dependency on the thrust headers,
which means they can be removed without other consequences.

85e01276

23 Feb, 2024 1 commit
- fix newly found typo due to upgraded typos pkg · 5d6dfe6f
  Titus von Koeller authored Feb 23, 2024
  
  5d6dfe6f
14 Feb, 2024 1 commit
- Fix race condition in kEstimateQuantiles (#1061) · 5b28fd3f
  pnunna93 authored Feb 14, 2024
  
  5b28fd3f
05 Feb, 2024 1 commit
- Enable crate-ci/typos lint; fix typos (#1005) · 8c507d92
  Aarni Koskela authored Feb 05, 2024
```
Co-authored-by: Titus von Koeller <titus@vonkoeller.com>

fix erroneous correction
```
  8c507d92
01 Feb, 2024 1 commit
- Enable line-ending and other hygiene lints (#1006) · 6974920b
  Aarni Koskela authored Feb 01, 2024
  
  6974920b
31 Jan, 2024 1 commit

minimal fix to support Windows · fd319d51

James Wyatt authored Sep 25, 2023



based on @Jamezo97 and @acpopescu work

manually cherry-picked from PR #788 and PR #229 and cleanup by wkpark
Signed-off-by: Won-Kyu Park <wkpark@gmail.com>

fd319d51

09 Dec, 2023 1 commit
- fix errors about array index out of bounds in kgetColRowStats · 51d30913
  修艺 authored Dec 09, 2023
  
  51d30913
19 Jul, 2023 1 commit
- Increased occupancy. · c82f51c0
  Tim Dettmers authored Jul 19, 2023
  
  c82f51c0
11 Jul, 2023 1 commit
- Added more extensive gemv tests; blocksize guard for gemv. · ba51d95d
  Tim Dettmers authored Jul 11, 2023
  
  ba51d95d
10 Jul, 2023 5 commits
- Removed debugging statement. · a26a321e
  Tim Dettmers authored Jul 10, 2023
  
  a26a321e
- Fixed accidential deletion of limits in kernel. · 306f6b23
  Tim Dettmers authored Jul 10, 2023
  
  306f6b23
- Fixed potential memory leak. · 2221f4ce
  Tim Dettmers authored Jul 10, 2023
  
  2221f4ce
- Added ARCH guard for bfloat16 computations. · 1c774ece
  Tim Dettmers authored Jul 10, 2023
  
  1c774ece
- Added fp32 compute type for gemv_4bit. · 5fab6734
  Tim Dettmers authored Jul 09, 2023
  
  5fab6734
09 Jul, 2023 2 commits
- Added FP4 fast inference support. · 94168d79
  Tim Dettmers authored Jul 09, 2023
  
  94168d79
- Added abitrary data types; fixed a bug for small matrices. · 4b88d69d
  Tim Dettmers authored Jul 09, 2023
  
  4b88d69d
08 Jul, 2023 2 commits
- Turning optimization (float accumulation). 185 vs 50. · eefbf602
  Tim Dettmers authored Jul 08, 2023
  
  eefbf602
- Added warp_shuffle indexing 185 vs 54. · 7e49b5b9
  Tim Dettmers authored Jul 08, 2023
  
  7e49b5b9
05 Jul, 2023 1 commit
- Added bfloat16 quantizations and tests. · 02fd80cb
  Tim Dettmers authored Jul 04, 2023
  
  02fd80cb
04 Jul, 2023 2 commits
- Vectorized loads, conflict free NF4; 52 vs 172. · dfe6900b
  Tim Dettmers authored Jul 04, 2023
  
  dfe6900b
- Initial 4-bit naive batch size 1, 81 vs 185. · f89ff93e
  Tim Dettmers authored Jul 03, 2023
  
  f89ff93e
31 May, 2023 3 commits
- Added debugging functions. · e54d2730
  Tim Dettmers authored May 30, 2023
  
  e54d2730
- Added lookup table. · b7f04e2a
  Tim Dettmers authored May 30, 2023
  
  b7f04e2a
- Added changes for deployment. · ac5550a0
  Tim Dettmers authored May 30, 2023
  
  ac5550a0
24 May, 2023 1 commit
- Added PagedLion and bf16 Lion. · 1b8772a8
  Tim Dettmers authored May 23, 2023
  
  1b8772a8
06 May, 2023 1 commit
- Added paging. · ec38ba95
  Tim Dettmers authored May 06, 2023
  
  ec38ba95
02 May, 2023 3 commits
- 4-bit draft; 128 vector load 240. · 264a9485
  Tim Dettmers authored May 02, 2023
  
  264a9485
- Warp multi-specialization 240. · 869b7e83
  Tim Dettmers authored May 02, 2023
  
  869b7e83
- Shared memory efficient 240. · 77f15fdc
  Tim Dettmers authored May 02, 2023
  
  77f15fdc