Added bitsandbytes

144fd688 · zhaoying1 · 387082e1 · 144fd688 · 144fd688 · 144fd688
Commit 144fd688 authored Jun 08, 2023 by zhaoying1
20 changed files
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
+### 0.0.21
+- Ampere, RTX 30 series GPUs now compatible with the library.
+### 0.0.22:
+- Fixed an error where a `reset_parameters()` call on the `StableEmbedding` would lead to an error in older PyTorch versions (from 1.7.0).
+### 0.0.23:
+Bugs:
+ - Unified quantization API: each quantization function now returns `Q, S` where `Q` is the quantized tensor and `S` the quantization state which may hold absolute max values, a quantization map or more. For dequantization all functions now accept the inputs `Q, S` so that `Q` is dequantized with the quantization state `S`.
+ - Fixed an issue where the CUDA 11.1 binary was not compiled with the right headers
+API changes:
+ - Block-wise quantization for optimizers now enabled by default
+Features:
+ - Block-wise quantization routines now support CPU Tensors.
+### 0.0.24:
+- Fixed a bug where a float/half conversion led to a compilation error for CUDA 11.1 on Turning GPUs.
+- removed Apex dependency for bnb LAMB
+### 0.0.25:
+Features:
+ - Added `skip_zeros` for block-wise and 32-bit optimizers. This ensures correct updates for sparse gradients and sparse models.
+ - Added support for Kepler GPUs. (#4)
+ - Added Analysis Adam to track 8-bit vs 32-bit quantization errors over time.
+ - Make compilation more user friendly.
+Bug fixes:
+ - fixed "undefined symbol: \_\_fatbinwrap_38" error for P100 GPUs on CUDA 10.1 (#5)
+Docs:
+ - Added docs with instructions to compile from source.
+### 0.26.0:
+Features:
+ - Added Adagrad (without grad clipping) as 32-bit and 8-bit block-wise optimizer.
+ - Added AdamW (copy of Adam with weight decay init 1e-2). #10
+ - Introduced ModuleConfig overrides which can be seamlessly be used at initialization time of a module.
+ - Added `bnb.nn.Embedding` layer which runs at 32-bit but without the layernorm. This works well if you need to fine-tune pretrained models that do not have a embedding layer norm. #19
+Bug fixes:
+ - Fixed a bug where weight decay was incorrectly applied to 32-bit Adam. #13
+ - Fixed an unsafe use of eval. #8
+ - Fixed a bug where the StableEmbedding layer 32-bit optimizer override would not work without registering the whole model first (`bnb.optim.GlobalOptimManager.get_instance().register_parameters(model.parameters())`).  #13 #15 
+Docs:
+ - Added instructions how to solve "\_\_fatbinwrap_" errors.
+### 0.30.0
+#### 8-bit Inference Update
+Features:
+ - Added 8-bit matrix multiplication form cuBLAS,  and cuBLASLt as well as multiple GEMM kernels (GEMM, GEMMEx, GEMMLt)
+ - Added 8-bit Linear layers with 8-bit Params that perform memory efficient inference with an option for 8-bit mixed precision matrix decomposition for inference without performance degradation
+ - Added quantization methods for "fake" quantization as well as optimized kernels vector-wise quantization and equalization as well as optimized cuBLASLt transformations
+ - CPU only build now available (Thank you, @mryab)
+Deprecated:
+ - Pre-compiled release for CUDA 9.2, 10.0, 10.2 no longer available
+### 0.31.0
+#### 8-bit Inference and Packaging Update
+Features:
+ - added direct outlier extraction. This enables outlier extraction without fp16 weights without performance degradation.
+ - Added automatic CUDA SETUP procedure and packaging all binaries into a single bitsandbytes package.
+### 0.32.0
+#### 8-bit Inference Performance Enhancements
+We added performance enhancements for small models. This makes small models about 2x faster for LLM.int8() inference.
+Features:
+ - Int32 dequantization now supports fused biases.
+ - Linear8bitLt now uses a fused bias implementation.
+ - Change `.data.storage().data_ptr()` to `.data.data_ptr()` to enhance inference performance.
+Bug fixes:
+ - Now throws and error if LLM.int8() is used on a GPU that is not supported.
+ - Enhances error messaging if CUDA SETUP fails.
+### 0.33.0
+#### Various bug fixes
+Features:
+ - CPU quantization now supports a variable `blocksize` variable to enhance quantization speed or precision.
+Bug fixes:
+ - fixed an issue in CPU quantization where tensors with more than 2^31 elements would fail 19a7adca7a6c9bf7061a384d7e9d9b13676a1a88
+ - fixed a bug where cpu binaries would fail if no GPU would be detected eab4d8232d558f2e6bd7f7cc3d00e2e6e94f4e80
+ - fixed an issue where cpu binaries cause additional stdout messages 92a3363096e10ad6a5c4e944af898bd1186d806a
+ - fixed an import of bnb.utils 2e630b55f51d454f3bd723dffda68a07ef93190c
+We thank @mryab, @mbrukman, @chessgecko, @dbaranchuk for pull request with bug fixes and new features.
+### 0.34.0
+#### Bug fixes and memory efficient backprop
+Features:
+ - Linear8bitLt layer now supports `memory_efficient_backward=True` which enables backprop of gradients through frozen weights.
+Bug fixes:
+ - fixed an issue where too many threads were created in blockwise quantization on the CPU for large tensors
+### 0.35.0
+#### CUDA 11.8 support and bug fixes
+Features:
+ - CUDA 11.8 support added and binaries added to the PyPI release.
+Bug fixes:
+ - fixed a bug where too long directory names would crash the CUDA SETUP #35 (thank you @tomaarsen)
+ - fixed a bug where CPU installations on Colab would run into an error  #34 (thank you @tomaarsen)
+ - fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52
+### 0.35.1
+Features:
+ - Added CUDA instruction generator to fix some installations.
+Bug fixes:
+ - Fixed a problem where warning messages would be displayed even though everything worked correctly.
+### 0.35.2
+Bug fixes:
+ - Fixed a bug where the CUDA setup failed due to a wrong function call.
+### 0.35.3
+Bug fixes:
+ - Fixed a bug in the CUDA Setup which led to an incomprehensible error if no GPU was detected.
+### 0.35.4
+Bug fixes:
+ - Fixed a bug in the CUDA Setup failed with the cuda runtime was found, but not the cuda library.
+ - Fixed a bug where not finding the cuda runtime led to an incomprehensible error.
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
+# Code of Conduct
+## Our Pledge
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to make participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+## Our Standards
+Examples of behavior that contributes to creating a positive environment
+include:
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+Examples of unacceptable behavior by participants include:
+* The use of sexualized language or imagery and unwelcome sexual attention or
+  advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+  address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+## Our Responsibilities
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+## Scope
+This Code of Conduct applies within all project spaces, and it also applies when
+an individual is representing the project or its community in public spaces.
+Examples of representing a project or community include using an official
+project e-mail address, posting via an official social media account, or acting
+as an appointed representative at an online or offline event. Representation of
+a project may be further defined and clarified by project maintainers.
+This Code of Conduct also applies outside the project spaces when there is a
+reasonable belief that an individual's behavior may have a negative impact on
+the project or its community.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at <opensource-conduct@fb.com>. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+[homepage]: https://www.contributor-covenant.org
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
+# Contributing to bitsandbytes
+We want to make contributing to this project as easy and transparent as
+possible.
+## Pull Requests
+We actively welcome your pull requests.
+1. Fork the repo and create your branch from `main`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code lints.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Facebook's open source projects.
+Complete your CLA here: <https://code.facebook.com/cla>
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+## License
+By contributing to bitsandbytes, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+MIT License
+Copyright (c) Facebook, Inc. and its affiliates.
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/Makefile
+++ b/Makefile
+MKFILE_PATH := $(abspath $(lastword $(MAKEFILE_LIST)))
+ROOT_DIR := $(patsubst %/,%,$(dir $(MKFILE_PATH)))
+DTK_PATH := /opt/dtk
+GPP:= /usr/bin/g++
+ifeq ($(CUDA_HOME),)
+	CUDA_HOME:= $(shell which nvcc | rev | cut -d'/' -f3- | rev)
+endif
+ifndef CUDA_VERSION
+$(warning WARNING: CUDA_VERSION not set. Call make with CUDA string, for example: make cuda11x CUDA_VERSION=115 or make cpuonly CUDA_VERSION=CPU)
+CUDA_VERSION:=
+endif
+NVCC := $(CUDA_HOME)/bin/nvcc
+###########################################
+CSRC := $(ROOT_DIR)/csrc
+BUILD_DIR:= $(ROOT_DIR)/build
+FILES_CUDA := $(CSRC)/ops.cu $(CSRC)/kernels.cu
+# FILES_HIP := $(CSRC)/ops.cu $(CSRC)/kernels.cu
+FILES_CPP := $(CSRC)/common.cpp $(CSRC)/cpu_ops.cpp $(CSRC)/pythonInterface.c
+INCLUDE :=  -I $(CUDA_HOME)/include -I $(ROOT_DIR)/csrc -I $(CONDA_PREFIX)/include -I $(ROOT_DIR)/include
+LIB := -L $(CUDA_HOME)/lib64 -lcudart -lcublas -lcublasLt -lcurand -lcusparse -L $(CONDA_PREFIX)/lib
+# NVIDIA NVCC compilation flags
+COMPUTE_CAPABILITY := -gencode arch=compute_35,code=sm_35 # Kepler 
+COMPUTE_CAPABILITY += -gencode arch=compute_37,code=sm_37 # Kepler 
+COMPUTE_CAPABILITY += -gencode arch=compute_50,code=sm_50 # Maxwell
+COMPUTE_CAPABILITY += -gencode arch=compute_52,code=sm_52 # Maxwell
+COMPUTE_CAPABILITY += -gencode arch=compute_60,code=sm_60 # Pascal
+COMPUTE_CAPABILITY += -gencode arch=compute_61,code=sm_61 # Pascal
+COMPUTE_CAPABILITY += -gencode arch=compute_70,code=sm_70 # Volta
+COMPUTE_CAPABILITY += -gencode arch=compute_72,code=sm_72 # Volta 
+# CUDA 9.2 supports CC 3.0, but CUDA >= 11.0 does not
+CC_CUDA92 := -gencode arch=compute_30,code=sm_30
+# Later versions of CUDA support the new architectures
+CC_CUDA10x := -gencode arch=compute_30,code=sm_30
+CC_CUDA10x += -gencode arch=compute_75,code=sm_75
+CC_CUDA110 := -gencode arch=compute_75,code=sm_75
+CC_CUDA110 += -gencode arch=compute_80,code=sm_80
+# CC_CUDA11x := -gencode arch=compute_52,code=sm_52
+CC_CUDA11x := -gencode arch=compute_75,code=sm_75
+CC_CUDA11x += -gencode arch=compute_80,code=sm_80
+CC_CUDA11x += -gencode arch=compute_86,code=sm_86
+CC_cublasLt110 := -gencode arch=compute_75,code=sm_75
+CC_cublasLt110 += -gencode arch=compute_80,code=sm_80
+# CC_cublasLt111 := -gencode arch=compute_52,code=sm_52
+CC_cublasLt111 := -gencode arch=compute_75,code=sm_75
+CC_cublasLt111 += -gencode arch=compute_80,code=sm_80
+CC_cublasLt111 += -gencode arch=compute_86,code=sm_86
+all: $(BUILD_DIR) env
+	$(NVCC) $(COMPUTE_CAPABILITY) -Xcompiler '-fPIC' --use_fast_math -Xptxas=-v -dc $(FILES_CUDA) $(INCLUDE) $(LIB) --output-directory $(BUILD_DIR) 
+	$(NVCC) $(COMPUTE_CAPABILITY) -Xcompiler '-fPIC' -dlink $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o -o $(BUILD_DIR)/link.o 
+	$(GPP) -std=c++14 -DBUILD_CUDA -shared -fPIC $(INCLUDE) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(BUILD_DIR)/link.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cuda$(CUDA_VERSION).so $(LIB)
+cuda92: $(BUILD_DIR) env
+	$(NVCC) $(COMPUTE_CAPABILITY) $(CC_CUDA92) -Xcompiler '-fPIC' --use_fast_math -Xptxas=-v -dc $(FILES_CUDA) $(INCLUDE) $(LIB) --output-directory $(BUILD_DIR) -D NO_CUBLASLT
+	$(NVCC) $(COMPUTE_CAPABILITY) $(CC_CUDA92) -Xcompiler '-fPIC' -dlink $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o -o $(BUILD_DIR)/link.o 
+	$(GPP) -std=c++14 -DBUILD_CUDA -shared -fPIC $(INCLUDE) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(BUILD_DIR)/link.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cuda$(CUDA_VERSION)_nocublaslt.so $(LIB)
+cuda10x_nomatmul: $(BUILD_DIR) env
+	$(NVCC) $(COMPUTE_CAPABILITY) $(CC_CUDA10x) -Xcompiler '-fPIC' --use_fast_math -Xptxas=-v -dc $(FILES_CUDA) $(INCLUDE) $(LIB) --output-directory $(BUILD_DIR) -D NO_CUBLASLT
+	$(NVCC) $(COMPUTE_CAPABILITY) $(CC_CUDA10x) -Xcompiler '-fPIC' -dlink $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o -o $(BUILD_DIR)/link.o 
+	$(GPP) -std=c++14 -DBUILD_CUDA -shared -fPIC $(INCLUDE) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(BUILD_DIR)/link.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cuda$(CUDA_VERSION)_nocublaslt.so $(LIB)
+cuda110_nomatmul: $(BUILD_DIR) env
+	$(NVCC) $(COMPUTE_CAPABILITY) $(CC_CUDA110) -Xcompiler '-fPIC' --use_fast_math -Xptxas=-v -dc $(FILES_CUDA) $(INCLUDE) $(LIB) --output-directory $(BUILD_DIR) -D NO_CUBLASLT
+	$(NVCC) $(COMPUTE_CAPABILITY) $(CC_CUDA110) -Xcompiler '-fPIC' -dlink $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o -o $(BUILD_DIR)/link.o 
+	$(GPP) -std=c++14 -DBUILD_CUDA -shared -fPIC $(INCLUDE) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(BUILD_DIR)/link.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cuda$(CUDA_VERSION)_nocublaslt.so $(LIB)
+cuda11x_nomatmul: $(BUILD_DIR) env
+	$(NVCC) $(COMPUTE_CAPABILITY) $(CC_CUDA11x) -Xcompiler '-fPIC' --use_fast_math -Xptxas=-v -dc $(FILES_CUDA) $(INCLUDE) $(LIB) --output-directory $(BUILD_DIR) -D NO_CUBLASLT
+	$(NVCC) $(COMPUTE_CAPABILITY) $(CC_CUDA11x) -Xcompiler '-fPIC' -dlink $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o -o $(BUILD_DIR)/link.o 
+	$(GPP) -std=c++14 -DBUILD_CUDA -shared -fPIC $(INCLUDE) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(BUILD_DIR)/link.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cuda$(CUDA_VERSION)_nocublaslt.so $(LIB)
+cuda110: $(BUILD_DIR) env
+	$(NVCC) $(CC_cublasLt110) -Xcompiler '-fPIC' --use_fast_math -Xptxas=-v -dc $(FILES_CUDA) $(INCLUDE) $(LIB) --output-directory $(BUILD_DIR)
+	$(NVCC) $(CC_cublasLt110) -Xcompiler '-fPIC' -dlink $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o -o $(BUILD_DIR)/link.o 
+	$(GPP) -std=c++14 -DBUILD_CUDA -shared -fPIC $(INCLUDE) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(BUILD_DIR)/link.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cuda$(CUDA_VERSION).so $(LIB)
+cuda11x: $(BUILD_DIR) env
+	$(NVCC) $(CC_cublasLt111) -Xcompiler '-fPIC' --use_fast_math -Xptxas=-v -dc $(FILES_CUDA) $(INCLUDE) $(LIB) --output-directory $(BUILD_DIR)
+	$(NVCC) $(CC_cublasLt111) -Xcompiler '-fPIC' -dlink $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o -o $(BUILD_DIR)/link.o 
+	$(GPP) -std=c++14 -DBUILD_CUDA -shared -fPIC $(INCLUDE) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(BUILD_DIR)/link.o $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cuda$(CUDA_VERSION).so $(LIB)
+cpuonly: $(BUILD_DIR) env
+	$(GPP) -std=c++14 -shared -fPIC -I $(ROOT_DIR)/csrc -I $(ROOT_DIR)/include $(FILES_CPP) -o ./bitsandbytes/libbitsandbytes_cpu.so
+HIP_INCLUDE := -I $(ROOT_DIR)/csrc -I $(ROOT_DIR)/include 
+# -I /opt/rocm-5.3.0/hipcub/include
+HIP_LIB := -L$(DTK_PATH)/lib -L$(DTK_PATH)/llvm/bin/../lib/clang/14.0.0/lib/linux -L/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7 -L/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib -lgcc_s -lgcc -lpthread -lm -lrt -lamdhip64 -lhipblas -lhipsparse -lclang_rt.builtins-x86_64 -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc
+hip: $(BUILD_DIR)
+	$(DTK_PATH)/bin/hipcc -std=c++14 -c -fPIC --amdgpu-target=gfx906 --gpu-max-threads-per-block=1024 --offload-arch=gfx906 $(HIP_INCLUDE) -o $(BUILD_DIR)/ops.o -D NO_CUBLASLT $(CSRC)/ops.cu
+	$(DTK_PATH)/bin/hipcc -std=c++14 -c -fPIC --amdgpu-target=gfx906 --gpu-max-threads-per-block=1024 --offload-arch=gfx906 $(HIP_INCLUDE) -o $(BUILD_DIR)/kernels.o -D NO_CUBLASLT $(CSRC)/kernels.cu
+	# /usr/bin/hipcc -fPIC -static $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o -o $(BUILD_DIR)/link.so 
+	$(GPP) -std=c++14 -D__HIP_PLATFORM_AMD__ -DBUILD_CUDA -shared -fPIC -I $(DTK_PATH)/include $(HIP_INCLUDE) $(BUILD_DIR)/ops.o $(BUILD_DIR)/kernels.o $(FILES_CPP) $(HIP_LIB) -o ./bitsandbytes/libbitsandbytes_hip_nocublaslt.so 
+env:
+	@echo "ENVIRONMENT"
+	@echo "============================"
+	@echo "CUDA_VERSION: $(CUDA_VERSION)"
+	@echo "============================"
+	@echo "NVCC path: $(NVCC)"
+	@echo "GPP path: $(GPP) VERSION: `$(GPP) --version | head -n 1`"
+	@echo "CUDA_HOME: $(CUDA_HOME)"
+	@echo "CONDA_PREFIX: $(CONDA_PREFIX)"
+	@echo "PATH: $(PATH)"
+	@echo "LD_LIBRARY_PATH: $(LD_LIBRARY_PATH)"
+	@echo "============================"
+$(BUILD_DIR):
+	mkdir -p build
+	mkdir -p dependencies
+# $(ROOT_DIR)/dependencies/cub:
+# 	git clone https://github.com/NVlabs/cub $(ROOT_DIR)/dependencies/cub
+# 	cd dependencies/cub; git checkout 1.11.0
+clean:
+	rm build/* 
+cleaneggs:
+	rm -rf *.egg*
+cleanlibs:
+	rm ./bitsandbytes/libbitsandbytes*.so
--- a/NOTICE.md
+++ b/NOTICE.md
+The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.
+We thank Fabio Cannizzo for this work on FastBinarySearch which is included in this project.
--- a/README.md
+++ b/README.md
 # bitsandbytes
+The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions. 
+Resources:
+- [8-bit Optimizer Paper](https://arxiv.org/abs/2110.02861) --  [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE) -- [Docs](https://bitsandbytes.readthedocs.io/en/latest/)
+- [LLM.int8() Paper](https://arxiv.org/abs/2208.07339) -- [LLM.int8() Software Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration) -- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)
+## Installation
+**Note ！！！: The repository is also under development and currently only supports optimizer-related features, other features (such as bnb.nn.Linear8bitLt(...) ) and so on can not be used yet.**
+**Pre-Requisites**
+- An AMD GPU capable of supporting ROCm and an appropriate amdgpu driver
+- Assumes your ROCm tools are installed in /opt/rocm/ and rocminfo is in your path (often found in /opt/rocm/bin/), if not edit the Makefile to match your distro and/or update your PATH before running.
+Linux distribution (Ubuntu, MacOS, etc.) + CUDA >= 10.0. LLM.int8() requires Turing or Ampere GPUs.
+**Installation**:
+``pip install bitsandbytes``
+**Compiling**
+```sh
+# activate your VENV, if using this within a VENV
+git clone http://developer.hpccube.com/codes/aicomponent/bitsandbytes.git
+export CUDA_VERSION=gfx906
+make hip
+python setup.py install
+python3 -m bitsandbytes # to validate it works
+```
+## TL;DR
+**Using 8-bit optimizer**:
+1. Comment out optimizer: ``#torch.optim.Adam(....)``
+2. Add 8-bit optimizer of your choice ``bnb.optim.Adam8bit(....)`` (arguments stay the same)
+3. Replace embedding layer if necessary: ``torch.nn.Embedding(..) -> bnb.nn.Embedding(..)``
+**Using 8-bit Inference**:
+1. Comment out torch.nn.Linear: ``#linear = torch.nn.Linear(...)``
+2. Add bnb 8-bit linear light module: ``linear = bnb.nn.Linear8bitLt(...)`` (base arguments stay the same)
+3. There are two modes:
+   - Mixed 8-bit training with 16-bit main weights. Pass the argument ``has_fp16_weights=True`` (default)
+   - Int8 inference. Pass the argument ``has_fp16_weights=False``
+4. To use the full LLM.int8() method, use the ``threshold=k`` argument. We recommend ``k=6.0``.
+```python
+# LLM.int8()
+linear = bnb.nn.Linear8bitLt(dim1, dim2, bias=True, has_fp16_weights=False, threshold=6.0)
+# inputs need to be fp16
+out = linear(x.to(torch.float16))
+```
+## Features
+- 8-bit Matrix multiplication with mixed precision decomposition
+- LLM.int8() inference
+- 8-bit Optimizers: Adam, AdamW, RMSProp, LARS, LAMB (saves 75% memory)
+- Stable Embedding Layer: Improved stability through better initialization, and normalization
+- 8-bit quantization: Quantile, Linear, and Dynamic quantization
+- Fast quantile estimation: Up to 100x faster than other algorithms
+## Requirements & Installation
+Requirements: anaconda, cudatoolkit, pytorch
+Hardware requirements: 
+ - LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older).
+ - 8-bit optimizers and quantization: NVIDIA Maxwell GPU or newer (>=GTX 9XX).
+Supported CUDA versions: 10.2 - 11.7
+The bitsandbytes library is currently only supported on Linux distributions. Windows is not supported at the moment.
+The requirements can best be fulfilled by installing pytorch via anaconda. You can install PyTorch by following the ["Get Started"](https://pytorch.org/get-started/locally/) instructions on the official website.
+## Using bitsandbytes
+### Using Int8 Matrix Multiplication
+For straight Int8 matrix multiplication with mixed precision decomposition you can use ``bnb.matmul(...)``. To enable mixed precision decomposition, use the threshold parameter:
+```python
+bnb.matmul(..., threshold=6.0)
+```
+For instructions how to use LLM.int8() inference layers in your own code, see the TL;DR above or for extended instruction see [this blog post](https://github.com/huggingface/transformers).
+### Using the 8-bit Optimizers
+With bitsandbytes 8-bit optimizers can be used by changing a single line of code in your codebase. For NLP models we recommend also to use the StableEmbedding layers (see below) which improves results and helps with stable 8-bit optimization.  To get started with 8-bit optimizers, it is sufficient to replace your old optimizer with the 8-bit optimizer in the following way:
+```python
+import bitsandbytes as bnb
+# adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
+adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
+adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=8) # equivalent
+torch.nn.Embedding(...) ->  bnb.nn.StableEmbedding(...) # recommended for NLP models
+```
+Note that by default all parameter tensors with less than 4096 elements are kept at 32-bit even if you initialize those parameters with 8-bit optimizers. This is done since such small tensors do not save much memory and often contain highly variable parameters (biases) or parameters that require high precision (batch norm, layer norm). You can change this behavior like so:
+```
+# parameter tensors with less than 16384 values are optimized in 32-bit
+# it is recommended to use multiplies of 4096
+adam = bnb.optim.Adam8bit(model.parameters(), min_8bit_size=16384) 
+```
+### Change Bits and other Hyperparameters for Individual Parameters
+If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things: (1) register the parameter while they are still on the CPU, (2) override the config with the new desired hyperparameters (anytime, anywhere). See our [guide](howto_config_override.md) for more details
+### Fairseq Users
+To use the Stable Embedding Layer, override the respective `build_embedding(...)` function of your model. Make sure to also use the `--no-scale-embedding` flag to disable scaling of the word embedding layer (nor replaced with layer norm). You can use the optimizers by replacing the optimizer in the respective file (`adam.py` etc.).
+## Release and Feature History
+For upcoming features and changes and full history see [Patch Notes](CHANGELOG.md).
+## Errors
+1. RuntimeError: CUDA error: no kernel image is available for execution on the device. [Solution](errors_and_solutions.md#No-kernel-image-available)
+2. __fatbinwrap_.. [Solution](errors_and_solutions.md#fatbinwrap_)
+## Compile from source
+To compile from source, please follow the [compile_from_source.md](compile_from_source.md) instructions.
+## License
+The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.
+We thank Fabio Cannizzo for his work on [FastBinarySearch](https://github.com/fabiocannizzo/FastBinarySearch) which we use for CPU quantization.
+## How to cite us
+If you found this library and found LLM.int8() useful, please consider citing our work:
+```bibtex
+@article{dettmers2022llmint8,
+  title={LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
+  author={Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke},
+  journal={arXiv preprint arXiv:2208.07339},
+  year={2022}
+}
+```
+For 8-bit optimizers or quantization routines, please consider citing the following work:
+```bibtex
+@article{dettmers2022optimizers,
+  title={8-bit Optimizers via Block-wise Quantization},
+  author={Dettmers, Tim and Lewis, Mike and Shleifer, Sam and Zettlemoyer, Luke},
+  journal={9th International Conference on Learning Representations, ICLR},
+  year={2022}
+}
+```
--- a/bitsandbytes/__init__.py
+++ b/bitsandbytes/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from .autograd._functions import (
+    MatmulLtState,
+    bmm_cublas,
+    matmul,
+    matmul_cublas,
+    mm_cublas,
+)
+from .cextension import COMPILED_WITH_CUDA
+from .nn import modules
+if COMPILED_WITH_CUDA:
+    from .optim import adam
+__pdoc__ = {
+    "libbitsandbytes": False,
+    "optim.optimizer.Optimizer8bit": False,
+    "optim.optimizer.MockArgs": False,
+}
+PACKAGE_GITHUB_URL = "https://github.com/TimDettmers/bitsandbytes"
--- a/bitsandbytes/__main__.py
+++ b/bitsandbytes/__main__.py
+# from bitsandbytes.debug_cli import cli
+# cli()
+import os
+import sys
+from warnings import warn
+import torch
+HEADER_WIDTH = 60
+def print_header(
+    txt: str, width: int = HEADER_WIDTH, filler: str = "+"
+) -> None:
+    txt = f" {txt} " if txt else ""
+    print(txt.center(width, filler))
+def print_debug_info() -> None:
+    print(
+        "\nAbove we output some debug information. Please provide this info when "
+        f"creating an issue via {PACKAGE_GITHUB_URL}/issues/new/choose ...\n"
+    )
+print_header("")
+print_header("DEBUG INFORMATION")
+print_header("")
+print()
+from . import COMPILED_WITH_CUDA, PACKAGE_GITHUB_URL
+print_header("")
+print_header("DEBUG INFO END")
+print_header("")
+print(
+    """
+Running a quick check that:
+    + library is importable
+    + CUDA function is callable
+"""
+)
+try:
+    from bitsandbytes.optim import Adam
+    p = torch.nn.Parameter(torch.rand(10, 10).cuda())
+    a = torch.rand(10, 10).cuda()
+    p1 = p.data.sum().item()
+    adam = Adam([p])
+    out = a * p
+    loss = out.sum()
+    loss.backward()
+    adam.step()
+    p2 = p.data.sum().item()
+    assert p1 != p2
+    print("SUCCESS!")
+    print("Installation was successful!")
+    sys.exit(0)
+except ImportError:
+    print()
+    warn(
+        f"WARNING: {__package__} is currently running as CPU-only!\n"
+        "Therefore, 8-bit optimizers and GPU quantization are unavailable.\n\n"
+        f"If you think that this is so erroneously,\nplease report an issue!"
+    )
+    print_debug_info()
+    sys.exit(0)
+except Exception as e:
+    print(e)
+    print_debug_info()
+    sys.exit(1)
--- a/bitsandbytes/autograd/__init__.py
+++ b/bitsandbytes/autograd/__init__.py
--- a/bitsandbytes/autograd/_functions.py
+++ b/bitsandbytes/autograd/_functions.py
+import operator
+import warnings
+import torch
+import bitsandbytes.functional as F
+from dataclasses import dataclass
+from functools import reduce  # Required in Python 3
+# math.prod not compatible with python < 3.8
+def prod(iterable):
+    return reduce(operator.mul, iterable, 1)
+tensor = torch.Tensor
+"""
+    This class pools outlier dimensions across layers.
+    This is particularly important for small models where outlier features 
+    are less systematic and occur with low frequency.
+"""
+class GlobalOutlierPooler(object):
+    _instance = None
+    def __init__(self):
+        raise RuntimeError("Call get_instance() instead")
+    def initialize(self):
+        self.outliers = set()
+        self.model_dim = None
+    @classmethod
+    def get_instance(cls):
+        if cls._instance is None:
+            cls._instance = cls.__new__(cls)
+            cls._instance.initialize()
+        return cls._instance
+    def add_outliers(self, outlier_idx, feature_dim):
+        if self.model_dim is None:
+            self.model_dim = feature_dim
+        if feature_dim != self.model_dim:
+            return  # we do not encode outliers for the 2nd FFN layer
+        self.outliers.update(outlier_idx.tolist())
+    def get_current_outlier_idx(self):
+        return torch.Tensor(list(self.outliers)).to(torch.int64)
+class MatMul8bit(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, A, B, out=None, quant_type="vector", precision=[8, 8, 8]):
+        if precision[0] != 8:
+            with torch.no_grad():
+                output = torch.matmul(A, B)
+        else:
+            if len(B.shape) == 2:
+                dim = 0
+            else:
+                dim = 1
+            qA, SA = F.vectorwise_quant(A, dim=-1, quant_type=quant_type)
+            qB, SB = F.vectorwise_quant(B, dim=dim, quant_type=quant_type)
+            iout = F.igemm(qA, qB)
+            output = F.vectorwise_mm_dequant(iout, SA, SB, A.dtype, quant_type)
+        if A.requires_grad or B.requires_grad:
+            ctx.save_for_backward(A, B)
+        ctx.quant_type = quant_type
+        ctx.precision = precision
+        return output
+    @staticmethod
+    def backward(ctx, grad_output):
+        A, B = ctx.saved_tensors
+        quant_type = ctx.quant_type
+        precision = ctx.precision
+        grad_A = grad_B = None
+        if B.requires_grad:
+            if len(A.shape) == 3:
+                dims = [0, 1]
+                # bsi -> ibs
+                permute_dim = [0, 2, 1]
+            else:
+                dims = [0]
+                # bs -> sb
+                permute_dim = [1, 0]
+            if precision[1] != 8:
+                with torch.no_grad():
+                    grad_B = torch.matmul(A.permute(permute_dim), grad_output)
+            else:
+                if len(B.shape) == 2 and len(A.shape) == 3:
+                    grad_output = grad_output.contiguous()
+                    if not grad_output.is_contiguous():
+                        grad_output.contiguous()
+                    qgrad_output, S1 = F.vectorwise_quant(
+                        grad_output.view(-1, grad_output.shape[2]),
+                        dim=0,
+                        quant_type=quant_type,
+                    )
+                    if not A.is_contiguous():
+                        A = A.contiguous()
+                    qA, S2 = F.vectorwise_quant(
+                        A.view(-1, A.shape[2]), dim=0, quant_type=quant_type
+                    )
+                    igrad_B = F.igemm(qA.t(), qgrad_output)
+                    grad_B = F.vectorwise_mm_dequant(
+                        igrad_B, S2.t(), S1, grad_output.dtype, quant_type
+                    )
+                else:
+                    qgrad_output, S1 = F.vectorwise_quant(
+                        grad_output, dim=dims, quant_type=quant_type
+                    )
+                    qA, S2 = F.vectorwise_quant(
+                        A, dim=dims, quant_type=quant_type
+                    )
+                    igrad_B = F.igemm(qA.permute(permute_dim), qgrad_output)
+                    grad_B = F.vectorwise_mm_dequant(
+                        igrad_B,
+                        S2.permute(permute_dim),
+                        S1,
+                        grad_output.dtype,
+                        quant_type,
+                    )
+        if A.requires_grad:
+            if len(grad_output.shape) == 3:
+                dims = [2]
+            else:
+                dims = [1]
+            if len(B.shape) == 3:
+                # bio -> boi
+                permute_dim = [0, 2, 1]
+                dim_B = dims
+            else:
+                # io -> oi
+                permute_dim = [1, 0]
+                dim_B = [1]
+            if precision[2] != 8:
+                with torch.no_grad():
+                    grad_A = torch.matmul(grad_output, B.permute(permute_dim))
+            else:
+                qgrad_output, S1 = F.vectorwise_quant(
+                    grad_output, dim=dims, quant_type=quant_type
+                )
+                qB, S3 = F.vectorwise_quant(B, dim=dim_B, quant_type=quant_type)
+                igrad_A = F.igemm(qgrad_output, qB.permute(permute_dim))
+                grad_A = F.vectorwise_mm_dequant(
+                    igrad_A,
+                    S1,
+                    S3.permute(permute_dim),
+                    grad_output.dtype,
+                    quant_type,
+                )
+        return grad_A, grad_B, None, None, None
+mm_cublas = MatMul8bit.apply
+bmm_cublas = MatMul8bit.apply
+matmul_cublas = MatMul8bit.apply
+@dataclass
+class MatmulLtState:
+    CB = None
+    CxB = None
+    SB = None
+    SCB = None
+    CxBt = None
+    SBt = None
+    CBt = None
+    subB = None
+    outlier_pool = None
+    has_accumulated_gradients = False
+    threshold = 0.0
+    idx = None
+    is_training = True
+    has_fp16_weights = True
+    memory_efficient_backward = False
+    use_pool = False
+    formatB = F.get_special_format_str()
+    def reset_grads(self):
+        self.CB = None
+        self.CxB = None
+        self.SB = None
+        self.SCB = None
+        self.CxBt = None
+        self.SBt = None
+        self.CBt = None
+class MatMul8bitLt(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, A, B, out=None, bias=None, state=MatmulLtState()):
+        # default to pytorch behavior if inputs are empty
+        ctx.is_empty = False
+        if prod(A.shape) == 0:
+            ctx.is_empty = True
+            ctx.A = A
+            ctx.B = B
+            ctx.bias = bias
+            if A.shape[-1] == B.shape[0]:
+                return torch.empty(A.shape[:-1]+B.shape[1:], dtype=A.dtype, device=A.device)
+            else:
+                return torch.empty(A.shape[:-1]+B.shape[:1], dtype=A.dtype, device=A.device)
+        # 1. Quantize A
+        # 2. Quantize B
+        # 3. Matmul
+        # 4. Mixed-precision decomposition matmul
+        # 5. Save state
+        formatB = state.formatB
+        input_shape = A.shape
+        if state.outlier_pool is None:
+            state.outlier_pool = GlobalOutlierPooler.get_instance()
+        # Cast A to fp16
+        if A.dtype != torch.float16:
+            warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
+        # 1. Quantize A
+        if len(A.shape) == 3:
+            A = A.view(-1, A.shape[-1]).contiguous()
+        CA, CAt, SCA, SCAt, coo_tensorA = F.double_quant(
+            A.to(torch.float16), threshold=state.threshold
+        )
+        if state.threshold > 0.0 and coo_tensorA is not None:
+            if state.has_fp16_weights:
+                idx = torch.unique(coo_tensorA.colidx).long()
+                CA[:, idx] = 0
+                CAt[:, idx] = 0
+                subA = A[:, idx]
+                state.subB = B[:, idx].t().contiguous()
+                state.idx = idx
+            else:
+                if state.CxB is None:
+                    # B in in 8-bit row-major, we can transform it back to 16-bit to extract outlier dimensions
+                    # we also need to convert it to the turing/ampere format
+                    state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
+        else:
+            if not state.has_fp16_weights and state.CxB is None:
+                state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
+            subA = None
+        # 2. Quantize B
+        if state.has_fp16_weights:
+            has_grad = True if (getattr(B, "grad", None) is not None) else False
+            is_transposed = not B.is_contiguous() and B.shape[0] == B.stride(1)
+            if is_transposed:
+                B = B.contiguous()
+            if (state.is_training and not has_grad) or state.CxB is None:
+                state.reset_grads()
+                (
+                    CB,
+                    state.CBt,
+                    state.SCB,
+                    state.SCBt,
+                    coo_tensorB,
+                ) = F.double_quant(B.to(torch.float16))
+                state.CxB, state.SB = F.transform(CB, to_order=formatB)
+        else:
+            has_grad = False
+        if coo_tensorA is not None and not state.has_fp16_weights:
+            # extract outliers
+            outlier_idx = torch.unique(coo_tensorA.colidx)
+            state.idx = outlier_idx
+            # state.outlier_pool.add_outliers(outlier_idx, A.shape[-1])
+            # if state.use_pool and state.outlier_pool.model_dim == A.shape[-1]:
+            #    # do not use pool for 2nd FFN layer
+            #    state.idx = state.outlier_pool.get_current_outlier_idx().to(A.device)
+            # else:
+            #    state.idx = outlier_idx
+            outliers = F.extract_outliers(state.CxB, state.SB, state.idx.int())
+            state.subB = (
+                (outliers * state.SCB.view(-1, 1) / 127.0)
+                .t()
+                .contiguous()
+                .to(A.dtype)
+            )
+            CA[:, state.idx.long()] = 0
+            CAt[:, state.idx.long()] = 0
+            subA = A[:, state.idx.long()]
+        shapeB = state.SB[0]
+        if len(input_shape) == 3:
+            output_shape = (input_shape[0], input_shape[1], shapeB[0])
+        else:
+            output_shape = (input_shape[0], shapeB[0])
+        # 3. Matmul
+        C32A, SA = F.transform(CA, "col32")
+        out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
+        # we apply the fused bias here
+        if bias is None or bias.dtype == torch.float16:
+            output = F.mm_dequant(out32, Sout32, SCA, state.SCB, bias=bias)
+            output = output.to(A.dtype)
+        else:  # apply bias separately
+            output = F.mm_dequant(out32, Sout32, SCA, state.SCB, bias=None)
+            output = output.to(A.dtype).add_(bias)
+        # 4. Mixed-precision decomposition matmul
+        if coo_tensorA is not None and subA is not None:
+            output += torch.matmul(subA, state.subB)
+        # 5. Save state
+        ctx.state = state
+        ctx.formatB = formatB
+        ctx.grad_shape = input_shape
+        ctx.dtype_A, ctx.dtype_B, ctx.dtype_bias = A.dtype, B.dtype, None if bias is None else bias.dtype
+        if any(ctx.needs_input_grad[:2]):
+            ctx.tensors = (CAt, subA)
+            ctx.tensor_states = (SCAt, state.idx)
+        else:
+            ctx.tensors = [None, None]
+            ctx.tensor_states = (None, None)
+            ctx.save_for_backward(None, None)
+        clone_func = torch.clone if len(output_shape) == 3 else lambda x : x
+        return clone_func(output.view(output_shape))
+    @staticmethod
+    def backward(ctx, grad_output):
+        if ctx.is_empty:
+            bias_grad = (None if ctx.bias is None else torch.zeros_like(ctx.bias))
+            return torch.zeros_like(ctx.A), torch.zeros_like(ctx.B), None, bias_grad, None
+        req_gradA, req_gradB, _, req_gradBias, _ = ctx.needs_input_grad
+        CAt, subA = ctx.tensors
+        SCAt, idx = ctx.tensor_states
+        formatB = ctx.formatB
+        state = ctx.state
+        grad_A = grad_B = grad_bias = None
+        if req_gradBias:
+            # compute grad_bias first before changing grad_output dtype
+            grad_bias = grad_output.sum(0, dtype=ctx.dtype_bias)
+        # Cast grad_output to fp16
+        if len(grad_output.shape) == 3:
+            grad_output = grad_output.reshape(
+                -1, grad_output.shape[-1]
+            ).contiguous()
+        Cgrad, Cgradt, SCgrad, SCgradt, coo_tensor = F.double_quant(grad_output.to(torch.float16))
+        if req_gradB:
+            CxAt, SAt = F.transform(CAt, formatB, transpose=True)
+            C32grad, Sgrad = F.transform(Cgradt, "col32", transpose=True)
+            gradB32, SgradB32 = F.igemmlt(C32grad, CxAt, Sgrad, SAt)
+            grad_B = F.mm_dequant(gradB32, SgradB32, SCgradt, SCAt)
+            if state.threshold > 0.0 and subA is not None:
+                grad_B[:, idx] += torch.matmul(grad_output.t(), subA)
+        if req_gradA:
+            if state.CBt is not None:
+                C32grad, Sgrad = F.transform(Cgrad, "col32")
+                if state.CxBt is None:
+                    state.CxBt, state.SBt = F.transform(
+                        state.CBt, to_order=formatB, transpose=True
+                    )
+                gradA32, SgradA32 = F.igemmlt(C32grad, state.CxBt, Sgrad, state.SBt)
+                grad_A = F.mm_dequant(gradA32, SgradA32, SCgrad, state.SCBt).view(ctx.grad_shape).to(ctx.dtype_A)
+            elif state.CB is not None:
+                CB = state.CB.to(ctx.dtype_A, copy=True).mul_(state.SCB.unsqueeze(1).mul(1. / 127.0))
+                grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)
+            else:
+                raise Exception('State must contain either CBt or CB matrix for backward')
+        return grad_A, grad_B, None, grad_bias, None
+def matmul(
+    A: tensor,
+    B: tensor,
+    out: tensor = None,
+    state: MatmulLtState = None,
+    threshold=0.0,
+    bias=None
+):
+    state = state or MatmulLtState()
+    if threshold > 0.0:
+        state.threshold = threshold
+    return MatMul8bitLt.apply(A, B, out, bias, state)
--- a/bitsandbytes/cextension.py
+++ b/bitsandbytes/cextension.py
+import ctypes as ct
+import torch
+from pathlib import Path
+from warnings import warn
+class CUDASetup(object):
+    _instance = None
+    def __init__(self):
+        raise RuntimeError("Call get_instance() instead")
+    def generate_instructions(self):
+        self.add_log_entry('CUDA SETUP: Something unexpected happened. Please compile from source:')
+        self.add_log_entry('git clone git@github.com:TimDettmers/bitsandbytes.git')
+        self.add_log_entry('cd bitsandbytes')
+        self.add_log_entry("<make_cmd here, commented out>")
+        self.add_log_entry('python setup.py install')
+    def initialize(self):
+        self.has_printed = False
+        self.lib = None
+        self.run_cuda_setup()
+    def run_cuda_setup(self):
+        self.initialized = True
+        self.cuda_setup_log = []
+        binary_name = "libbitsandbytes_hip_nocublaslt.so"
+        package_dir = Path(__file__).parent
+        binary_path = package_dir / binary_name
+        try:
+            if not binary_path.exists():
+                raise Exception('CUDA SETUP: Setup Failed!')
+            else:
+                self.add_log_entry(f"CUDA SETUP: Loading binary {binary_path}...")
+                self.lib = ct.cdll.LoadLibrary(binary_path)
+        except Exception as ex:
+            self.add_log_entry(str(ex))
+            self.print_log_stack()
+    def add_log_entry(self, msg, is_warning=False):
+        self.cuda_setup_log.append((msg, is_warning))
+    def print_log_stack(self):
+        for msg, is_warning in self.cuda_setup_log:
+            if is_warning:
+                warn(msg)
+            else:
+                print(msg)
+    @classmethod
+    def get_instance(cls):
+        if cls._instance is None:
+            cls._instance = cls.__new__(cls)
+            cls._instance.initialize()
+        return cls._instance
+lib = CUDASetup.get_instance().lib
+try:
+    if lib is None and torch.cuda.is_available():
+        CUDASetup.get_instance().generate_instructions()
+        CUDASetup.get_instance().print_log_stack()
+        raise RuntimeError('''
+        CUDA Setup failed despite GPU being available. Inspect the CUDA SETUP outputs aboveto fix your environment!
+        If you cannot find any issues and suspect a bug, please open an issue with detals about your environment:
+        https://github.com/TimDettmers/bitsandbytes/issues''')
+    lib.cadam32bit_g32
+    lib.get_context.restype = ct.c_void_p
+    lib.get_cusparse.restype = ct.c_void_p
+    COMPILED_WITH_CUDA = True
+except AttributeError:
+    warn("The installed version of bitsandbytes was compiled without GPU support. "
+        "8-bit optimizers and GPU quantization are unavailable.")
+    COMPILED_WITH_CUDA = False
--- a/bitsandbytes/debug_cli.py
+++ b/bitsandbytes/debug_cli.py
+import typer
+cli = typer.Typer()
+@cli.callback()
+def callback():
+    """
+    Awesome Portal Gun
+    """
+@cli.command()
+def shoot():
+    """
+    Shoot the portal gun
+    """
+    typer.echo("Shooting portal gun")
+@cli.command()
+def load():
+    """
+    Load the portal gun
+    """
+    typer.echo("Loading portal gun")
--- a/bitsandbytes/functional.py
+++ b/bitsandbytes/functional.py
--- a/bitsandbytes/nn/__init__.py
+++ b/bitsandbytes/nn/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from .modules import Int8Params, Linear8bitLt, StableEmbedding
--- a/bitsandbytes/nn/modules.py
+++ b/bitsandbytes/nn/modules.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from typing import (
+    Any,
+    Callable,
+    Dict,
+    Iterator,
+    Mapping,
+    Optional,
+    Set,
+    Tuple,
+    TypeVar,
+    Union,
+    overload,
+)
+import torch
+import torch.nn.functional as F
+from torch import Tensor, device, dtype, nn
+from torch.nn.parameter import Parameter
+import bitsandbytes as bnb
+from bitsandbytes.optim import GlobalOptimManager
+T = TypeVar("T", bound="torch.nn.Module")
+class StableEmbedding(torch.nn.Embedding):
+    def __init__(
+        self,
+        num_embeddings: int,
+        embedding_dim: int,
+        padding_idx: Optional[int] = None,
+        max_norm: Optional[float] = None,
+        norm_type: float = 2.0,
+        scale_grad_by_freq: bool = False,
+        sparse: bool = False,
+        _weight: Optional[Tensor] = None,
+    ) -> None:
+        super(StableEmbedding, self).__init__(
+            num_embeddings,
+            embedding_dim,
+            padding_idx,
+            max_norm,
+            norm_type,
+            scale_grad_by_freq,
+            sparse,
+            _weight,
+        )
+        self.norm = torch.nn.LayerNorm(embedding_dim)
+        GlobalOptimManager.get_instance().register_module_override(
+            self, "weight", {"optim_bits": 32}
+        )
+    def reset_parameters(self) -> None:
+        torch.nn.init.xavier_uniform_(self.weight)
+        self._fill_padding_idx_with_zero()
+    """ !!! This is a redefinition of _fill_padding_idx_with_zero in torch.nn.Embedding
+        to make the Layer compatible with Pytorch < 1.9.
+        This means that if this changes in future PyTorch releases this need to change too
+        which is cumbersome. However, with this we can ensure compatibility with previous
+        PyTorch releases.
+    """
+    def _fill_padding_idx_with_zero(self) -> None:
+        if self.padding_idx is not None:
+            with torch.no_grad():
+                self.weight[self.padding_idx].fill_(0)
+    def forward(self, input: Tensor) -> Tensor:
+        emb = F.embedding(
+            input,
+            self.weight,
+            self.padding_idx,
+            self.max_norm,
+            self.norm_type,
+            self.scale_grad_by_freq,
+            self.sparse,
+        )
+        return self.norm(emb)
+class Embedding(torch.nn.Embedding):
+    def __init__(
+        self,
+        num_embeddings: int,
+        embedding_dim: int,
+        padding_idx: Optional[int] = None,
+        max_norm: Optional[float] = None,
+        norm_type: float = 2.0,
+        scale_grad_by_freq: bool = False,
+        sparse: bool = False,
+        _weight: Optional[Tensor] = None,
+    ) -> None:
+        super(Embedding, self).__init__(
+            num_embeddings,
+            embedding_dim,
+            padding_idx,
+            max_norm,
+            norm_type,
+            scale_grad_by_freq,
+            sparse,
+            _weight,
+        )
+        GlobalOptimManager.get_instance().register_module_override(
+            self, "weight", {"optim_bits": 32}
+        )
+    def reset_parameters(self) -> None:
+        torch.nn.init.xavier_uniform_(self.weight)
+        self._fill_padding_idx_with_zero()
+    """ !!! This is a redefinition of _fill_padding_idx_with_zero in torch.nn.Embedding
+        to make the Layer compatible with Pytorch < 1.9.
+        This means that if this changes in future PyTorch releases this need to change too
+        which is cumbersome. However, with this we can ensure compatibility with previous
+        PyTorch releases.
+    """
+    def _fill_padding_idx_with_zero(self) -> None:
+        if self.padding_idx is not None:
+            with torch.no_grad():
+                self.weight[self.padding_idx].fill_(0)
+    def forward(self, input: Tensor) -> Tensor:
+        emb = F.embedding(
+            input,
+            self.weight,
+            self.padding_idx,
+            self.max_norm,
+            self.norm_type,
+            self.scale_grad_by_freq,
+            self.sparse,
+        )
+        return emb
+class Int8Params(torch.nn.Parameter):
+    def __new__(
+        cls,
+        data=None,
+        requires_grad=True,
+        has_fp16_weights=False,
+        CB=None,
+        SCB=None,
+    ):
+        cls.has_fp16_weights = has_fp16_weights
+        cls.CB = None
+        cls.SCB = None
+        if data is None:
+            data = torch.empty(0)
+        return torch.Tensor._make_subclass(cls, data, requires_grad)
+    def cuda(self, device):
+        if self.has_fp16_weights:
+            return super().cuda(device)
+        else:
+            # we store the 8-bit rows-major weight
+            # we convert this weight to the turning/ampere weight during the first inference pass
+            B = self.data.contiguous().half().cuda(device)
+            CB, CBt, SCB, SCBt, coo_tensorB = bnb.functional.double_quant(B)
+            del CBt
+            del SCBt
+            self.data = CB
+            setattr(self, "CB", CB)
+            setattr(self, "SCB", SCB)
+        return self
+    @overload
+    def to(
+        self: T,
+        device: Optional[Union[int, device]] = ...,
+        dtype: Optional[Union[dtype, str]] = ...,
+        non_blocking: bool = ...,
+    ) -> T:
+        ...
+    @overload
+    def to(self: T, dtype: Union[dtype, str], non_blocking: bool = ...) -> T:
+        ...
+    @overload
+    def to(self: T, tensor: Tensor, non_blocking: bool = ...) -> T:
+        ...
+    def to(self, *args, **kwargs):
+        device, dtype, non_blocking, convert_to_format = torch._C._nn._parse_to(
+            *args, **kwargs
+        )
+        if (
+            device is not None
+            and device.type == "cuda"
+            and self.data.device.type == "cpu"
+        ):
+            return self.cuda(device)
+        else:
+            new_param = Int8Params(
+                super().to(
+                    device=device, dtype=dtype, non_blocking=non_blocking
+                ),
+                requires_grad=self.requires_grad,
+                has_fp16_weights=self.has_fp16_weights,
+            )
+            new_param.CB = self.CB
+            new_param.SCB = self.SCB
+            return new_param
+class Linear8bitLt(nn.Linear):
+    def __init__(
+        self,
+        input_features,
+        output_features,
+        bias=True,
+        has_fp16_weights=True,
+        memory_efficient_backward=False,
+        threshold=0.0,
+        index=None,
+    ):
+        super(Linear8bitLt, self).__init__(
+            input_features, output_features, bias
+        )
+        self.state = bnb.MatmulLtState()
+        self.index = index
+        self.state.threshold = threshold
+        self.state.has_fp16_weights = has_fp16_weights
+        self.state.memory_efficient_backward = memory_efficient_backward
+        if threshold > 0.0 and not has_fp16_weights:
+            self.state.use_pool = True
+        self.weight = Int8Params(
+            self.weight.data, has_fp16_weights=has_fp16_weights, requires_grad=has_fp16_weights
+        )
+    def init_8bit_state(self):
+        self.state.CB = self.weight.CB
+        self.state.SCB = self.weight.SCB
+        self.weight.CB = None
+        self.weight.SCB = None
+    def forward(self, x):
+        self.state.is_training = self.training
+        if self.weight.CB is not None:
+            self.init_8bit_state()
+        # weights are cast automatically as Int8Params, but the bias has to be cast manually
+        if self.bias is not None and self.bias.dtype != torch.float16:
+            self.bias.data = self.bias.data.half()
+        out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
+        if not self.state.has_fp16_weights:
+            if not self.state.memory_efficient_backward and self.state.CB is not None:
+                # we converted 8-bit row major to turing/ampere format in the first inference pass
+                # we no longer need the row-major weight
+                del self.state.CB
+                self.weight.data = self.state.CxB
+            elif self.state.memory_efficient_backward and self.state.CxB is not None:
+                # For memory efficient backward, we convert 8-bit row major to turing/ampere format at each inference pass.
+                # Thus, we delete CxB from the state. 
+                del self.state.CxB
+        return out
--- a/bitsandbytes/optim/__init__.py
+++ b/bitsandbytes/optim/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from bitsandbytes.cextension import COMPILED_WITH_CUDA
+from .adam import Adam, Adam8bit, Adam32bit
+from .adamw import AdamW, AdamW8bit, AdamW32bit
+from .sgd import SGD, SGD8bit, SGD32bit
+from .lars import LARS, LARS8bit, LARS32bit, PytorchLARS
+from .lamb import LAMB, LAMB8bit, LAMB32bit
+from .rmsprop import RMSprop, RMSprop8bit, RMSprop32bit
+from .adagrad import Adagrad, Adagrad8bit, Adagrad32bit
+from .optimizer import GlobalOptimManager
--- a/bitsandbytes/optim/adagrad.py
+++ b/bitsandbytes/optim/adagrad.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from bitsandbytes.optim.optimizer import Optimizer1State
+class Adagrad(Optimizer1State):
+    def __init__(
+        self,
+        params,
+        lr=1e-2,
+        lr_decay=0,
+        weight_decay=0,
+        initial_accumulator_value=0,
+        eps=1e-10,
+        optim_bits=32,
+        args=None,
+        min_8bit_size=4096,
+        percentile_clipping=100,
+        block_wise=True,
+    ):
+        if not 0.0 <= lr:
+            raise ValueError("Invalid learning rate: {}".format(lr))
+        if not 0.0 <= weight_decay:
+            raise ValueError(
+                "Invalid weight_decay value: {}".format(weight_decay)
+            )
+        if not 0.0 <= eps:
+            raise ValueError("Invalid epsilon value: {}".format(eps))
+        if initial_accumulator_value != 0.0:
+            raise ValueError("Initial accumulator value != 0.0 not supported!")
+        if lr_decay != 0.0:
+            raise ValueError("Lr Decay != 0.0 not supported!")
+        super(Adagrad, self).__init__(
+            "adagrad",
+            params,
+            lr,
+            (0.0, 0.0),
+            eps,
+            weight_decay,
+            optim_bits,
+            args,
+            min_8bit_size,
+            percentile_clipping,
+            block_wise,
+        )
+class Adagrad8bit(Optimizer1State):
+    def __init__(
+        self,
+        params,
+        lr=1e-2,
+        lr_decay=0,
+        weight_decay=0,
+        initial_accumulator_value=0,
+        eps=1e-10,
+        optim_bits=8,
+        args=None,
+        min_8bit_size=4096,
+        percentile_clipping=100,
+        block_wise=True,
+    ):
+        if not 0.0 <= lr:
+            raise ValueError("Invalid learning rate: {}".format(lr))
+        if not 0.0 <= weight_decay:
+            raise ValueError(
+                "Invalid weight_decay value: {}".format(weight_decay)
+            )
+        if not 0.0 <= eps:
+            raise ValueError("Invalid epsilon value: {}".format(eps))
+        if initial_accumulator_value != 0.0:
+            raise ValueError("Initial accumulator value != 0.0 not supported!")
+        if lr_decay != 0.0:
+            raise ValueError("Lr Decay != 0.0 not supported!")
+        assert block_wise
+        super(Adagrad8bit, self).__init__(
+            "adagrad",
+            params,
+            lr,
+            (0.0, 0.0),
+            eps,
+            weight_decay,
+            8,
+            args,
+            min_8bit_size,
+            percentile_clipping,
+            block_wise,
+        )
+class Adagrad32bit(Optimizer1State):
+    def __init__(
+        self,
+        params,
+        lr=1e-2,
+        lr_decay=0,
+        weight_decay=0,
+        initial_accumulator_value=0,
+        eps=1e-10,
+        optim_bits=32,
+        args=None,
+        min_8bit_size=4096,
+        percentile_clipping=100,
+        block_wise=True,
+    ):
+        if not 0.0 <= lr:
+            raise ValueError("Invalid learning rate: {}".format(lr))
+        if not 0.0 <= weight_decay:
+            raise ValueError(
+                "Invalid weight_decay value: {}".format(weight_decay)
+            )
+        if not 0.0 <= eps:
+            raise ValueError("Invalid epsilon value: {}".format(eps))
+        if initial_accumulator_value != 0.0:
+            raise ValueError("Initial accumulator value != 0.0 not supported!")
+        if lr_decay != 0.0:
+            raise ValueError("Lr Decay != 0.0 not supported!")
+        super(Adagrad32bit, self).__init__(
+            "adagrad",
+            params,
+            lr,
+            (0.0, 0.0),
+            eps,
+            weight_decay,
+            32,
+            args,
+            min_8bit_size,
+            percentile_clipping,
+            block_wise,
+        )
--- a/bitsandbytes/optim/adam.py
+++ b/bitsandbytes/optim/adam.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import math
+import os
+import torch
+import torch.distributed as dist
+import bitsandbytes.functional as F
+from bitsandbytes.optim.optimizer import Optimizer2State
+class Adam(Optimizer2State):
+    def __init__(
+        self,
+        params,
+        lr=1e-3,
+        betas=(0.9, 0.999),
+        eps=1e-8,
+        weight_decay=0,
+        amsgrad=False,
+        optim_bits=32,
+        args=None,
+        min_8bit_size=4096,
+        percentile_clipping=100,
+        block_wise=True,
+    ):
+        super(Adam, self).__init__(
+            "adam",
+            params,
+            lr,
+            betas,
+            eps,
+            weight_decay,
+            optim_bits,
+            args,
+            min_8bit_size,
+            percentile_clipping,
+            block_wise,
+        )
+class Adam8bit(Optimizer2State):
+    def __init__(
+        self,
+        params,
+        lr=1e-3,
+        betas=(0.9, 0.999),
+        eps=1e-8,
+        weight_decay=0,
+        amsgrad=False,
+        args=None,
+        min_8bit_size=4096,
+        percentile_clipping=100,
+        block_wise=True,
+    ):
+        super(Adam8bit, self).__init__(
+            "adam",
+            params,
+            lr,
+            betas,
+            eps,
+            weight_decay,
+            8,
+            args,
+            min_8bit_size,
+            percentile_clipping,
+            block_wise,
+        )
+class Adam32bit(Optimizer2State):
+    def __init__(
+        self,
+        params,
+        lr=1e-3,
+        betas=(0.9, 0.999),
+        eps=1e-8,
+        weight_decay=0,
+        amsgrad=False,
+        args=None,
+        min_8bit_size=4096,
+        percentile_clipping=100,
+        block_wise=True,
+    ):
+        super(Adam32bit, self).__init__(
+            "adam",
+            params,
+            lr,
+            betas,
+            eps,
+            weight_decay,
+            32,
+            args,
+            min_8bit_size,
+            percentile_clipping,
+            block_wise,
+        )
+class AnalysisAdam(torch.optim.Optimizer):
+    """Adam that performs 8-bit vs 32-bit error analysis.
+    This implementation is modified from torch.optim.Adam based on:
+    `Fixed Weight Decay Regularization in Adam`
+    (see https://arxiv.org/abs/1711.05101)
+    It has been proposed in `Adam: A Method for Stochastic Optimization`_.
+    Arguments:
+        params (iterable): iterable of parameters to optimize or dicts defining
+            parameter groups
+        lr (float, optional): learning rate (default: 1e-3)
+        betas (Tuple[float, float], optional): coefficients used for computing
+            running averages of gradient and its square (default: (0.9, 0.999))
+        eps (float, optional): term added to the denominator to improve
+            numerical stability (default: 1e-8)
+        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
+        amsgrad (boolean, optional): whether to use the AMSGrad variant of this
+            algorithm from the paper `On the Convergence of Adam and Beyond`_
+    .. _Adam: A Method for Stochastic Optimization:
+        https://arxiv.org/abs/1412.6980
+    .. _On the Convergence of Adam and Beyond:
+        https://openreview.net/forum?id=ryQu7f-RZ
+    """
+    def __init__(
+        self,
+        params,
+        lr=1e-3,
+        betas=(0.9, 0.999),
+        eps=1e-8,
+        weight_decay=0,
+        amsgrad=False,
+        bnb_analysis="dynamic-blockwise",
+        savedir=None,
+    ):
+        defaults = dict(
+            lr=lr,
+            betas=betas,
+            eps=eps,
+            weight_decay=weight_decay,
+            amsgrad=amsgrad,
+        )
+        super(AnalysisAdam, self).__init__(params, defaults)
+        self.analysis = bnb_analysis
+        self.savedir = savedir
+    @property
+    def supports_memory_efficient_fp16(self):
+        return True
+    @property
+    def supports_flat_params(self):
+        return True
+    def step(self, closure=None):
+        """Performs a single optimization step.
+        Arguments:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+        """
+        loss = None
+        if closure is not None:
+            loss = closure()
+        for group in self.param_groups:
+            for p_id, p in enumerate(group["params"]):
+                if p.grad is None:
+                    continue
+                grad = p.grad.data
+                if grad.dtype in {torch.float16, torch.bfloat16}:
+                    grad = grad.float()
+                if grad.is_sparse:
+                    raise RuntimeError(
+                        "Adam does not support sparse gradients, please consider SparseAdam instead"
+                    )
+                amsgrad = group.get("amsgrad", False)
+                assert not amsgrad
+                p_data_fp32 = p.data
+                if p.data.dtype in {torch.float16, torch.bfloat16}:
+                    p_data_fp32 = p_data_fp32.float()
+                state = self.state[p]
+                # State initialization
+                if len(state) == 0:
+                    state["step"] = 0
+                    # Exponential moving average of gradient values
+                    state["exp_avg"] = torch.zeros_like(p_data_fp32)
+                    # Exponential moving average of squared gradient values
+                    state["exp_avg_sq"] = torch.zeros_like(p_data_fp32)
+                    state["abserrors"] = torch.zeros(
+                        (256, 256), device=p_data_fp32.device
+                    )
+                    state["relerrors"] = torch.zeros(
+                        (256, 256), device=p_data_fp32.device
+                    )
+                    state["counts"] = torch.zeros(
+                        (256, 256), device=p_data_fp32.device
+                    )
+                    if amsgrad:
+                        # Maintains max of all exp. moving avg. of sq. grad. values
+                        state["max_exp_avg_sq"] = torch.zeros_like(p_data_fp32)
+                else:
+                    state["exp_avg"] = state["exp_avg"].to(p_data_fp32)
+                    state["exp_avg_sq"] = state["exp_avg_sq"].to(p_data_fp32)
+                    if amsgrad:
+                        state["max_exp_avg_sq"] = state["max_exp_avg_sq"].to(
+                            p_data_fp32
+                        )
+                state["step"] += 1
+                beta1, beta2 = group["betas"]
+                bias_correction1 = 1 - beta1 ** state["step"]
+                bias_correction2 = 1 - beta2 ** state["step"]
+                step_size = (
+                    group["lr"] * math.sqrt(bias_correction2) / bias_correction1
+                )
+                e = state["abserrors"]
+                rele = state["relerrors"]
+                counts = state["counts"]
+                if group["weight_decay"] != 0:
+                    p_data_fp32.add_(
+                        p_data_fp32, alpha=-group["weight_decay"] * group["lr"]
+                    )
+                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
+                if amsgrad:
+                    max_exp_avg_sq = state["max_exp_avg_sq"]
+                # Decay the first and second moment running average coefficient
+                exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
+                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
+                denom = exp_avg_sq.sqrt().add_(group["eps"])
+                update_fp32 = exp_avg / denom
+                if (
+                    p_data_fp32.numel() <= 8192
+                    or p_data_fp32.numel() > 50000 * 1000
+                ):
+                    # embedding layer or too small
+                    p_data_fp32 += -step_size * update_fp32
+                else:
+                    if self.analysis == "dynamic-blockwise":
+                        code1 = F.create_dynamic_map(signed=True).to(p.device)
+                        code2 = F.create_dynamic_map(signed=False).to(p.device)
+                        C1, S1 = F.quantize_blockwise(exp_avg, code=code1)
+                        state1 = F.dequantize_blockwise(C1, S1)
+                        C2, S2 = F.quantize_blockwise(exp_avg_sq, code=code2)
+                        state2 = F.dequantize_blockwise(C2, S2)
+                    elif self.analysis == "dynamic":
+                        code1 = F.create_dynamic_map(signed=True).to(p.device)
+                        code2 = F.create_dynamic_map(signed=False).to(p.device)
+                        C1, S1 = F.quantize(exp_avg, code=code1)
+                        state1 = F.dequantize(C1, S1)
+                        C2, S2 = F.quantize(exp_avg_sq, code=code2)
+                        state2 = F.dequantize(C2, S2)
+                    elif self.analysis == "linear":
+                        code1 = F.create_linear_map(signed=True).to(p.device)
+                        code2 = F.create_linear_map(signed=False).to(p.device)
+                        C1, S1 = F.quantize(exp_avg, code=code1)
+                        state1 = F.dequantize(C1, S1)
+                        C2, S2 = F.quantize(exp_avg_sq, code=code2)
+                        state2 = F.dequantize(C2, S2)
+                    elif self.analysis == "quantile":
+                        code1 = F.estimate_quantiles(exp_avg)
+                        code2 = F.estimate_quantiles(exp_avg_sq)
+                        C1 = F.quantize_no_absmax(exp_avg, code=code1)
+                        state1 = F.dequantize_no_absmax(C1, code1)
+                        C2 = F.quantize_no_absmax(exp_avg_sq, code=code2)
+                        state2 = F.dequantize_no_absmax(C2, code2)
+                    elif self.analysis == "my-quantization-routine":
+                        pass
+                        # 1. get code
+                        # 2. quantize
+                        # 3. dequantize
+                        # Error will be calculated automatically!
+                    else:
+                        raise ValueError(
+                            f"Invalid analysis value: {self.analysis}!"
+                        )
+                    denom = state2.sqrt().add_(group["eps"])
+                    update_8bit = state1 / denom
+                    abserr = torch.abs(update_8bit - update_fp32)
+                    relerr = abserr / torch.abs(update_fp32 + 1e-6)
+                    C1, C2 = C1.int(), C2.int()
+                    F.histogram_scatter_add_2d(e, C1.int(), C2.int(), abserr)
+                    F.histogram_scatter_add_2d(rele, C1.int(), C2.int(), relerr)
+                    F.histogram_scatter_add_2d(
+                        counts, C1.int(), C2.int(), torch.ones_like(abserr)
+                    )
+                    p_data_fp32 += -step_size * update_fp32
+                    if not dist.is_initialized() or dist.get_rank() == 0:
+                        if self.savedir != "" and state["step"] % 100 == 0:
+                            if not os.path.exists(self.savedir):
+                                os.makedirs(self.savedir)
+                            shapestr = "_".join(
+                                [str(dim) for dim in p_data_fp32.shape]
+                            )
+                            pathe = os.path.join(
+                                self.savedir, f"{p_id}_{shapestr}_abserr.pkl"
+                            )
+                            pathrele = os.path.join(
+                                self.savedir, f"{p_id}_{shapestr}_relerr.pkl"
+                            )
+                            pathcounts = os.path.join(
+                                self.savedir, f"{p_id}_{shapestr}_counts.pkl"
+                            )
+                            torch.save(e, pathe)
+                            torch.save(rele, pathrele)
+                            torch.save(counts, pathcounts)
+                if p.data.dtype in {torch.float16, torch.bfloat16}:
+                    p.data.copy_(p_data_fp32)
+        return loss
--- a/bitsandbytes/optim/adamw.py
+++ b/bitsandbytes/optim/adamw.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from bitsandbytes.optim.optimizer import Optimizer2State
+class AdamW(Optimizer2State):
+    def __init__(
+        self,
+        params,
+        lr=1e-3,
+        betas=(0.9, 0.999),
+        eps=1e-8,
+        weight_decay=1e-2,
+        amsgrad=False,
+        optim_bits=32,
+        args=None,
+        min_8bit_size=4096,
+        percentile_clipping=100,
+        block_wise=True,
+    ):
+        super(AdamW, self).__init__(
+            "adam",
+            params,
+            lr,
+            betas,
+            eps,
+            weight_decay,
+            optim_bits,
+            args,
+            min_8bit_size,
+            percentile_clipping,
+            block_wise,
+        )
+class AdamW8bit(Optimizer2State):
+    def __init__(
+        self,
+        params,
+        lr=1e-3,
+        betas=(0.9, 0.999),
+        eps=1e-8,
+        weight_decay=1e-2,
+        amsgrad=False,
+        args=None,
+        min_8bit_size=4096,
+        percentile_clipping=100,
+        block_wise=True,
+    ):
+        super(AdamW8bit, self).__init__(
+            "adam",
+            params,
+            lr,
+            betas,
+            eps,
+            weight_decay,
+            8,
+            args,
+            min_8bit_size,
+            percentile_clipping,
+            block_wise,
+        )
+class AdamW32bit(Optimizer2State):
+    def __init__(
+        self,
+        params,
+        lr=1e-3,
+        betas=(0.9, 0.999),
+        eps=1e-8,
+        weight_decay=1e-2,
+        amsgrad=False,
+        args=None,
+        min_8bit_size=4096,
+        percentile_clipping=100,
+        block_wise=True,
+    ):
+        super(AdamW32bit, self).__init__(
+            "adam",
+            params,
+            lr,
+            betas,
+            eps,
+            weight_decay,
+            32,
+            args,
+            min_8bit_size,
+            percentile_clipping,
+            block_wise,
+        )