Commits · 6aa71bec51306331f422643af4c111a0cd1d0339 · OpenDAS / bitsandbytes

23 Oct, 2025 1 commit
- fix code, compiled and tested successfully · 74f7aa06
  limm authored Oct 23, 2025
  
  74f7aa06
24 Sep, 2025 1 commit

Fix for warpSize deprecation in ROCm 7.0 (#1762) · b72b766e

pnunna93 authored Sep 24, 2025



* Port ROCm changes from multi-backend-refactor branch

* Update ops.py

* Update functional.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update functional.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update functional.py

* Update functional.py

* Update functional.py

* Update functional.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update functional.py

* Update functional.py

* Update functional.py

* Update test_ops.py

* Update test_functional.py

* Update test_ops.py

* Update test_functional.py

* Update test_functional.py

* Update functional.py

* Update functional.py

* Update ops.py

* Update ops.py

* Update test_functional.py

* Update test_functional.py

* Update cextension.py

* Update cuda_specs.py

* Update cuda_specs.py

* Update test_functional.py

* Update test_linear4bit.py

* Update test_cuda_setup_evaluator.py

* Update test_functional.py

* Update modules.py

* Update modules.py

* Update ops.py

* Update test_linear4bit.py

* Update ops.py

* Update ops.py

* Update test_linear4bit.py

* Update test_linear4bit.py

* Update python-package.yml

* Update python-package.yml

* Update python-package.yml

* Update python-package.yml

* Create build-rocm.sh

* Update cuda_specs.py

* Fix trailing whitespace

* Remove conflicts.diff

* update for hipblasVersionMajor >=3

* Update test_functional.py

* Update test_linear4bit.py

* Update test_ops.py

* Update main.py

* Update test_functional.py

* Update test_linear4bit.py

* Update test_ops.py

* Update test_linear4bit.py

* Lint

* Lint

* Update helpers.py

* Update test_functional.py

* Update test_linear4bit.py

* Update test_ops.py

* Lint

* Update pythonInterface.cpp

* lint fix

* lint

* Update pythonInterface.cpp

* revert permissions change

* Fix indentation

* Update kernels_hip.cuh

* Update kernels.hip

* Update ops.hip

* Update ops_hip.cuh

* Update kernels_hip.cuh

* Update kernels.hip

* Update kernels.hip

* Update ops.hip

* Update ops_hip.cuh

* Update ops.hip

* Update CMakeLists.txt

* Update functional.py

* Update cextension.py

* Update cextension.py

* warpSize is being made non constexpr in ROCm 7.0

* Merge pull request #90 from ROCm/IFU-rocm_enabled-09-23-2025

Ifu rocm enabled 09 23 2025

* Fix typo

* unskip test_4bit_quant

---------
Co-authored-by: MISHANMAURYA <118961433+MISHANMAURYA@users.noreply.github.com>
Co-authored-by: MISHANMAUYRA <mishanmaurya31081@gmail.com>
Co-authored-by: amcamd <andrew.chapman@amd.com>
Co-authored-by: Prasanth Nunna <root@banff-cyxtera-s78-1.amd.com>
Co-authored-by: sstamenk <strahinja.stamenkovic@amd.com>

b72b766e

23 Sep, 2025 1 commit

Add CUDA 13.0 Support (#1761) · bdb8b2b7

Matthew Douglas authored Sep 23, 2025

* CUDA 13 build enablement

* Try to fix Windows build workflow

* Add torch 2.9+cu130 to tests

* Fix python version

* Update test workflow

* Don't test CPU on torch 2.9 yet

* Update doc

bdb8b2b7

18 Sep, 2025 1 commit

[CUDA] Branchless NF4/FP4 kDequantizeBlockwise kernel for faster dequantization (#1746) · b1f80b8a

Mohamed Hisham authored Sep 18, 2025

* Added branchless LUT-based dequantization for FP4 and NF4

* Added extra command line options to control reproducibility

* Restore FP4 quantization/dequantization order

b1f80b8a

15 Sep, 2025 1 commit

Add SYCL Kernels for XPU backend (#1679) · 1813b058

Liu Xiaoli authored Sep 15, 2025



* Add SYCL Kernels for XPU backend

* fix transpose
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix log and format
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* revert cpu changes
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* clean ipex_xpu
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* clean ipex import
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix ipex cpu import
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix typo
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix comments
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* refine gemv_4bit kernel

* enable FP4 for dequant_4bit and gemv_4bit

* refine FP4 dequantization performance

* remove check for better performance
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix doc
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* clean code

* fix tests
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* rm comments
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix memory issue

* fix ut failure

* adjust threshold
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix xpu check
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* change test_functional check
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix test_module
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix device check
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix tests
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* Enable Windows build and refine code

* fix xpu log
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* remove ipex entirely
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix cpu int8 CB
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix lint
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix logs (#12)

* fix logs
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix format
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

---------
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* Fix sycl lint error and tests (#13)

* fix sycl nd
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix tests
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

---------
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* skip typo check for xpu kernel codes (#14)

* skip test for xpu ops
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix lint
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* skip typo for xpu
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* skip
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* skip
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

---------
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* register triton kernel for quantization (#15)
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* Fix version comparison issue (#18)

# Description

The version comparison expression miss reference the .release property from the version object. This lead to compare between the tuple and the string

# Error message
```
The 8-bit optimizer is not available on your device, only available on CUDA for now.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Traceback (most recent call last):
  File "/home/erxin/jenkins/workspace/Unsloth_Benchmark/unsloth_validation/run.py", line 1, in <module>
    import unsloth
  File "/home/erxin/jenkins/workspace/Unsloth_Benchmark/v/lib/python3.10/site-packages/unsloth/__init__.py", line 235, in <module>
    from .models import *
  File "/home/erxin/jenkins/workspace/Unsloth_Benchmark/v/lib/python3.10/site-packages/unsloth/models/__init__.py", line 15, in <module>
    from .llama     import FastLlamaModel
  File "/home/erxin/jenkins/workspace/Unsloth_Benchmark/v/lib/python3.10/site-packages/unsloth/models/llama.py", line 23, in <module>
    from ._utils import *
  File "/home/erxin/jenkins/workspace/Unsloth_Benchmark/v/lib/python3.10/site-packages/unsloth/models/_utils.py", line 89, in <module>
    from unsloth_zoo.patching_utils import (
  File "/home/erxin/jenkins/workspace/Unsloth_Benchmark/v/lib/python3.10/site-packages/unsloth_zoo/patching_utils.py", line 629, in <module>
    import transformers.integrations.bitsandbytes
  File "/home/erxin/jenkins/workspace/Unsloth_Benchmark/v/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 20, in <module>
    import bitsandbytes as bnb
  File "/home/erxin/jenkins/workspace/Unsloth_Benchmark/bitsandbytes/bitsandbytes/__init__.py", line 39, in <module>
    from .backends.xpu import ops as xpu_ops
  File "/home/erxin/jenkins/workspace/Unsloth_Benchmark/bitsandbytes/bitsandbytes/backends/xpu/ops.py", line 17, in <module>
    if version.parse(torch.__version__).release >= version.parse("2.9"):
TypeError: '>=' not supported between instances of 'tuple' and 'Version'
```

---------
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: Er-Xin (Edwin) Shang <shangerxin@hotmail.com>

1813b058

02 Aug, 2025 1 commit
- Fixing quantization uint8 packing bug for NF4 and FP4 · 639f8c05
  Mohamed Hisham authored Aug 02, 2025
  
  639f8c05
20 Jun, 2025 1 commit

Enable ROCm backend with custom ops integration (#1683) · 888788d7

pnunna93 authored Jun 20, 2025



* Port ROCm changes from multi-backend-refactor branch

* Update ops.py

* Update functional.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update functional.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update functional.py

* Update functional.py

* Update functional.py

* Update functional.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update ops.py

* Update functional.py

* Update functional.py

* Update functional.py

* Update test_ops.py

* Update test_functional.py

* Update test_ops.py

* Update test_functional.py

* Update test_functional.py

* Update functional.py

* Update functional.py

* Update ops.py

* Update ops.py

* Update test_functional.py

* Update test_functional.py

* Update cextension.py

* Update cuda_specs.py

* Update cuda_specs.py

* Update test_functional.py

* Update test_linear4bit.py

* Update test_cuda_setup_evaluator.py

* Update test_functional.py

* Update modules.py

* Update modules.py

* Update ops.py

* Update test_linear4bit.py

* Update ops.py

* Update ops.py

* Update test_linear4bit.py

* Update test_linear4bit.py

* Update python-package.yml

* Update python-package.yml

* Update python-package.yml

* Update python-package.yml

* Create build-rocm.sh

* Update cuda_specs.py

* Fix trailing whitespace

* Remove conflicts.diff

* update for hipblasVersionMajor >=3

* Update test_functional.py

* Update test_linear4bit.py

* Update test_ops.py

* Update main.py

* Update test_functional.py

* Update test_linear4bit.py

* Update test_ops.py

* Update test_linear4bit.py

* Lint

* Lint

* Update helpers.py

* Update test_functional.py

* Update test_linear4bit.py

* Update test_ops.py

* Lint

* Update pythonInterface.cpp

* lint fix

* lint

* Update pythonInterface.cpp

* revert permissions change

* Fix indentation

* Update kernels_hip.cuh

* Update kernels.hip

* Update ops.hip

* Update ops_hip.cuh

* Update kernels_hip.cuh

* Update kernels.hip

* Update kernels.hip

* Update ops.hip

* Update ops_hip.cuh

* Update ops.hip

* Update CMakeLists.txt

* Update functional.py

* Update cextension.py

* Update cextension.py

---------
Co-authored-by: MISHANMAURYA <118961433+MISHANMAURYA@users.noreply.github.com>
Co-authored-by: MISHANMAUYRA <mishanmaurya31081@gmail.com>
Co-authored-by: amcamd <andrew.chapman@amd.com>
Co-authored-by: Prasanth Nunna <root@banff-cyxtera-s78-1.amd.com>

888788d7

13 Jun, 2025 1 commit
- Apply clang-format rules (#1678) · 4955d136
  Matthew Douglas authored Jun 13, 2025
  
  4955d136
04 Jun, 2025 1 commit

Deprecation cleanup (#1669) · 849d9449

Matthew Douglas authored Jun 04, 2025

* Deprecation cleanup: remove histogram_scatter_add_2d

* Deprecation cleanup: vectorwise_mm_dequant

* Deprecation cleanup: vectorwise_quant

* Remove unused test

* Optimizer test cleanup

* Deprecations: remove estimate_quantiles, create_quantile_map

* Move deprecated test

849d9449

25 Mar, 2025 1 commit

PyTorch Custom Operator Integration (#1544) · e82f72b3

Matthew Douglas authored Mar 25, 2025



* Sketch out first custom op registration

* Add note

* Initial int8 op registration

* Cleanup some deprecated functions.

* Int8 ops updates; tests

* Implement 4bit quant/dequant ops

* Fix nested quant

* cleanup

* Test improvements

* Clean up and improve tests

* Add higher level custom op for int8 matmul + dequant + bias

* Add gemv 4bit custom op

* Cleanup

* Implement out kwarg overloads for custom ops

* Update PyTorch minimum to 2.1

* Deprecation updates

* Deprecation updates

* Cleanup; rename int8_linear_dequant -> int8_scaled_mm

* Bump min pytorch to 2.2

* cleanup

* Test reorganization

* Remove deprecated supports_igemmlt

* More cleanup

* Cleanup obsolete C++/CUDA code

* Cleanup

* Create 'default' backend for fallback op implementations; initial CPU nf4 work

* Stub out for multi-platform

* Fix serialization tests for torch>=2.6.0

* Add example for torch.compile e2e inference

* Test update

---------
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>

e82f72b3

14 Jan, 2025 1 commit
- cleanup: remove unused kernels/C++ code (#1458) · 58922237
  Matthew Douglas authored Jan 14, 2025
```
* (chore) Remove unused dotfiles

* cleanup: remove unused kernels/C++ code
```
  58922237
05 Dec, 2024 1 commit

LLM.int8() Refactoring: Part 1 (#1401) · 81e6345d

Matthew Douglas authored Dec 05, 2024



* Start of int8 refactor: remove col32/col_ampere/col_turing transforms in new igemmlt implementation

* Fix unintended change

* New naive mm_dequant kernel for row-major; cleanup

* fix

* int8 refactor: initial sparse decomp, cleanup

* Int8 refactoring: remove separate NO_CUBLASLT build; more cleanup

* int8: inference optimizations, some cleanup

* int8: more tests passing, cleanup

* int8 - more cleanup, most tests passing

* int8: specify CUDA stream for int8 ops

* perf: reduce overhead from getting cudaStream ptr

* Mark some functions for deprecation.

* int8 sparse decomp: small perf improvement

* update setup.py

* Update bitsandbytes/autograd/_functions.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/research/autograd/_functions.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* int8 - perf improvement for sparse decomposition inference; deprecate get_tensor_stream() in favor of new private fn

* int8 cleanup

* Ignore ruff rule ISC001 (incompatible with formatter)

* add comment

* int8 more cleanup

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* int8: rename / deprecate old fn signatures

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* type annotation

* format update

* Update bitsandbytes/research/autograd/_functions.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* cleanup

* Add comment to explain division optimization

* more cleanup

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update bitsandbytes/functional.py
Co-authored-by: Aarni Koskela <akx@iki.fi>

* cleanup

* Type annotations, cleanup

* remove unused kernels; improved type annotations

* small perf optimization for single-GPU systems

* small perf optimization for single-GPU systems

* update docstrings

* Improve docs and tests

* Update docstring

* Update test

* add benchmarking script

* test cleanup: add deprecated marker, move benchmarks out

* Add int8 dequant function; misc improvements

* int8 matmul fallback for inner dims not divisible by 4

* improve register usage of kInt8VectorQuant - especially for A100/H100

* disable fail-fast for package build

* maxwell compat

* ptxas verbose

* docs update

* doc update

* backward fix

* Bugfix sparse decomp

* Int8 fix for PEFT OLoRA init

* Fix test for deprecated spmm_coo

* test improvement

* doc update

* typo

* doc cleanup

* docs

* add inference benchmark script

* Add benchmarks, doc update

---------
Co-authored-by: Aarni Koskela <akx@iki.fi>

81e6345d

23 Oct, 2024 1 commit
- Update CI tools & fix typos (#1386) · 9568735b
  Aarni Koskela authored Oct 23, 2024
```
* Update pre-commit tools

* Fix typos
```
  9568735b
20 Sep, 2024 2 commits

Change 8bit optimizer blocksize 2048->256; additional bf16 support (#1365) · aa57bd89
Matthew Douglas authored Sep 20, 2024
```
* Change 8bit optimizer blocksize 2048->256; additional bf16 support
* Update tolerances for 8bit optimizer tests
```
aa57bd89

Add AdEMAMix optimizer (#1360) · d9645465

Matthew Douglas authored Sep 20, 2024

* Add AdEMAMix optimizer

* Add PagedAdEMAMix32bit, AdEMAMix32bit

* Add PagedAdEMAMix32bit, AdEMAMix32bit

* AdEMAMix: add support for alpha/beta3 scheduling

* Update paged AdEMAMix

d9645465

26 Aug, 2024 1 commit

Cuda source cleanup , refactor and fixes (#1328) · 6bef412a

Abhilash Majumder authored Aug 26, 2024

* remove kcompress

* fix initial template call

* fix function name

* remove vector load

* cleanup reduce  & rearrange

* format

6bef412a

22 Aug, 2024 1 commit

Enable certain CUDA kernels to accept specified cuda stream (#1330) · a685654b

Jee Jee Li authored Aug 22, 2024

* Done

* fix format

* fix format

* fix format

* fix format

* Address format error and fix default arg bug

* Refine stream argument passing mechanism

* Fix bug

* Delete unused code

a685654b

12 Jul, 2024 1 commit

Fix CUDA 12.5 build issue (#1273) · 85e01276

Markus Hennerbichler authored Jul 12, 2024

pythonInterface.cpp depends on ops.cuh
which in turn depends on some thrust headers.
It is defined as a C++ compilation unit
which is problematic  becuase thrift doesn't guarantee
compatibility with a host compiler.

This is starting to cause issues with CUDA 12.5.
There is no dependency on the thrust headers,
which means they can be removed without other consequences.

85e01276

29 Mar, 2024 1 commit
- Fix 4bit quantization with blocksize=4096 · c17fb8eb
  Matthew Douglas authored Mar 29, 2024
  
  c17fb8eb
23 Feb, 2024 1 commit
- fix newly found typo due to upgraded typos pkg · 5d6dfe6f
  Titus von Koeller authored Feb 23, 2024
  
  5d6dfe6f
14 Feb, 2024 1 commit
- Fix race condition in kEstimateQuantiles (#1061) · 5b28fd3f
  pnunna93 authored Feb 14, 2024
  
  5b28fd3f
05 Feb, 2024 3 commits

Make native code portable and add GitHub workflow for building (#949) · 73d3e7b6

Rickard authored Feb 05, 2024



* Make native code portable and add GitHub workflow for building

* Removed deprecated Python versions

* Update python-package.yml
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update python-package.yml
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update python-package.yml
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update python-package.yml
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update python-package.yml
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update python-package.yml
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update python-package.yml
Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update python-package.yml

* Do not test on Python 3.13 until released

* Update python-package.yml

* Update python-package.yml

* Update python-package.yml

* Update python-package.yml

* Refactor build stage

* Fixed breaking actions change

* Slim down Windows cuda

* Create dependabot.yml

* Bespoke local dev requirements.txt

* Enable VS integration

* Group Dependabot updates

* Cleanup

* Update python-package.yml

* Reinstate file that was wrongly merged

* Fixed regression caused by new version of download-artifact

* Update python-package.yml

* Update python-package.yml

* Fix matrix

* Update python-package.yml

* Merge

* Pipeline

* Fixed conflict

* Fixed conflict

* Update CMakeLists.txt

* Fixed merge error

* cleanup

* cleanup

* Find CUDA

* Fix

* Fixing merge error from latest merge from main

* Fix setup.py

* Fixed typo in artifact name

* Remove linker flags

* Build nocublaslt versions

* Fixed formatting

* Fixed VS Code format on save

* Ran format on save from VScode

* Re-saved the json files using the new settings

* Re-saved CMakeLists.txt to get formatting right

* Add path filter

* Formatting

---------
Co-authored-by: Aarni Koskela <akx@iki.fi>

73d3e7b6

quantize_block C->C++, use std::thread everywhere (#1024) · 332530ba
Rickard authored Feb 05, 2024

332530ba
Enable crate-ci/typos lint; fix typos (#1005) · 8c507d92
Aarni Koskela authored Feb 05, 2024
```
Co-authored-by: Titus von Koeller <titus@vonkoeller.com>

fix erroneous correction
```
8c507d92

01 Feb, 2024 1 commit
- Enable line-ending and other hygiene lints (#1006) · 6974920b
  Aarni Koskela authored Feb 01, 2024
  
  6974920b
31 Jan, 2024 1 commit

minimal fix to support Windows · fd319d51

James Wyatt authored Sep 25, 2023



based on @Jamezo97 and @acpopescu work

manually cherry-picked from PR #788 and PR #229 and cleanup by wkpark
Signed-off-by: Won-Kyu Park <wkpark@gmail.com>

fd319d51

30 Jan, 2024 1 commit
- Don't crash Python interpreter via assert(false) (#998) · 29a637bc
  Aarni Koskela authored Jan 30, 2024
  
  29a637bc
09 Dec, 2023 1 commit
- fix errors about array index out of bounds in kgetColRowStats · 51d30913
  修艺 authored Dec 09, 2023
  
  51d30913
19 Jul, 2023 1 commit
- Increased occupancy. · c82f51c0
  Tim Dettmers authored Jul 19, 2023
  
  c82f51c0
17 Jul, 2023 1 commit
- Guard for prefetchAsync GPU capability. #470 #451 #477 · 7be5f2c7
  Tim Dettmers authored Jul 16, 2023
  
  7be5f2c7
11 Jul, 2023 1 commit
- Added more extensive gemv tests; blocksize guard for gemv. · ba51d95d
  Tim Dettmers authored Jul 11, 2023
  
  ba51d95d
10 Jul, 2023 5 commits
- Removed debugging statement. · a26a321e
  Tim Dettmers authored Jul 10, 2023
  
  a26a321e
- Fixed accidential deletion of limits in kernel. · 306f6b23
  Tim Dettmers authored Jul 10, 2023
  
  306f6b23
- Fixed potential memory leak. · 2221f4ce
  Tim Dettmers authored Jul 10, 2023
  
  2221f4ce
- Added ARCH guard for bfloat16 computations. · 1c774ece
  Tim Dettmers authored Jul 10, 2023
  
  1c774ece
- Added fp32 compute type for gemv_4bit. · 5fab6734
  Tim Dettmers authored Jul 09, 2023
  
  5fab6734
09 Jul, 2023 2 commits
- Added FP4 fast inference support. · 94168d79
  Tim Dettmers authored Jul 09, 2023
  
  94168d79
- Added abitrary data types; fixed a bug for small matrices. · 4b88d69d
  Tim Dettmers authored Jul 09, 2023
  
  4b88d69d
08 Jul, 2023 2 commits
- Turning optimization (float accumulation). 185 vs 50. · eefbf602
  Tim Dettmers authored Jul 08, 2023
  
  eefbf602
- Added warp_shuffle indexing 185 vs 54. · 7e49b5b9
  Tim Dettmers authored Jul 08, 2023
  
  7e49b5b9