Replace unsupported facebookresearch/bitsandbytes (#20093)

With https://github.com/TimDettmers/bitsandbytes, which is by the same author and is still being updated

Replace unsupported facebookresearch/bitsandbytes (#20093)
With https://github.com/TimDettmers/bitsandbytes, which is by the same author and is still being updated
b8112edd · Tom Aarsen · GitHub · 4ab6e9e2 · b8112edd · b8112edd
Unverified Commit b8112edd authored Nov 07, 2022 by Tom Aarsen Committed by GitHub Nov 07, 2022
Showing with 4 additions and 4 deletions

docs/source/en/perf_train_gpu_one.mdx docs/source/en/perf_train_gpu_one.mdx +3 -3

examples/research_projects/robust-speech-event/README.md examples/research_projects/robust-speech-event/README.md +1 -1

No files found.
--- a/docs/source/en/perf_train_gpu_one.mdx
+++ b/docs/source/en/perf_train_gpu_one.mdx
@@ -217,7 +217,7 @@ Let's look at the details.
 **Optimizer States:**

 - 8 bytes * number of parameters for normal AdamW (maintains 2 states)
- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/facebookresearch/bitsandbytes)
+- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
 - 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state)

 **Gradients**
@@ -425,7 +425,7 @@ $ python examples/pytorch/translation/run_translation.py -h | grep "\-optim"

 For example, if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed `--optim adamw_apex_fused` will give you the fastest training experience among all supported AdamW optimizers.

-On the other hand [8bit BNB optimizer](https://github.com/facebookresearch/bitsandbytes) can save 3/4 of memory normally used by a typical AdamW optimizer if it is configured to quantize all optimizer states, but in some situations only some optimizer states are quintized and then more memory is used.
+On the other hand [8bit BNB optimizer](https://github.com/TimDettmers/bitsandbytes) can save 3/4 of memory normally used by a typical AdamW optimizer if it is configured to quantize all optimizer states, but in some situations only some optimizer states are quintized and then more memory is used.

 Let's get a feel for the numbers and use for example use a 3B-parameter model, like `t5-3b`. Note that since a Gigabyte correpsonds to a billion bytes we can simply multiply the parameters (in billions) with the number of necessary bytes per parameter to get Gigabytes of GPU memory usage:

@@ -484,7 +484,7 @@ We went from 15 GB memory usage to 5 GB - a 3x improvement while maintaining the

 Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the idea behind FP16 training where using variables with lower precision saves memory.

-In contrast to the previous approaches is this one not integrated into the [`Trainer`] as a simple flag. We need to install the 8-bit optimizer and then pass it as a custom optimizer to the [`Trainer`]. Follow the installation guide in the Github [repo](https://github.com/facebookresearch/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer.
+In contrast to the previous approaches is this one not integrated into the [`Trainer`] as a simple flag. We need to install the 8-bit optimizer and then pass it as a custom optimizer to the [`Trainer`]. Follow the installation guide in the Github [repo](https://github.com/TimDettmers/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer.

 Once installed, we just need to initialize the the optimizer. Although this looks like a considerable amount of work it actually just involves two steps: first we need to group the model's parameters into two groups where to one group we apply weight decay and to the other we don't. Usually, biases and layer norm parameters are not weight decayed. Then in a second step we just do some argument housekeeping to use the same parameters as the previously used AdamW optimizer.


--- a/examples/research_projects/robust-speech-event/README.md
+++ b/examples/research_projects/robust-speech-event/README.md
@@ -703,7 +703,7 @@ We are very excited to be hosting 2 days of talks from Kensho-Technologies, Mozi
 - Memory efficient training:

 In case, you are getting out-of-memory errors on your GPU, we recommend to use 
-[bitsandbytes](https://github.com/facebookresearch/bitsandbytes) to replace the 
+[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) to replace the 
 native memory-intensive Adam optimizer with the one of `bitsandbytes`. You
 can simply run the script `./run_speech_recognition_ctc_bnb.py` provided in this 
 folder that makes use of `bitsandbytes` instead of the official one.