"git@developer.sourcefind.cn:renzhc/diffusers_dcu.git" did not exist on "9147c4c954fd82aeed5146c7a4c304de03e646b2"
Unverified Commit cf1ca728 authored by Sayak Paul's avatar Sayak Paul Committed by GitHub
Browse files

fix title for compile + offload quantized models (#12233)



* up

* up

* Apply suggestions from code review
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

---------
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
parent 144e6e25
...@@ -77,7 +77,7 @@ ...@@ -77,7 +77,7 @@
- local: optimization/memory - local: optimization/memory
title: Reduce memory usage title: Reduce memory usage
- local: optimization/speed-memory-optims - local: optimization/speed-memory-optims
title: Compile and offloading quantized models title: Compiling and offloading quantized models
- title: Community optimizations - title: Community optimizations
sections: sections:
- local: optimization/pruna - local: optimization/pruna
......
...@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o ...@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
--> -->
# Compile and offloading quantized models # Compiling and offloading quantized models
Optimizing models often involves trade-offs between [inference speed](./fp16) and [memory-usage](./memory). For instance, while [caching](./cache) can boost inference speed, it also increases memory consumption since it needs to store the outputs of intermediate attention layers. A more balanced optimization strategy combines quantizing a model, [torch.compile](./fp16#torchcompile) and various [offloading methods](./memory#offloading). Optimizing models often involves trade-offs between [inference speed](./fp16) and [memory-usage](./memory). For instance, while [caching](./cache) can boost inference speed, it also increases memory consumption since it needs to store the outputs of intermediate attention layers. A more balanced optimization strategy combines quantizing a model, [torch.compile](./fp16#torchcompile) and various [offloading methods](./memory#offloading).
...@@ -28,7 +28,8 @@ The table below provides a comparison of optimization strategy combinations and ...@@ -28,7 +28,8 @@ The table below provides a comparison of optimization strategy combinations and
| quantization | 32.602 | 14.9453 | | quantization | 32.602 | 14.9453 |
| quantization, torch.compile | 25.847 | 14.9448 | | quantization, torch.compile | 25.847 | 14.9448 |
| quantization, torch.compile, model CPU offloading | 32.312 | 12.2369 | | quantization, torch.compile, model CPU offloading | 32.312 | 12.2369 |
<small>These results are benchmarked on Flux with a RTX 4090. The transformer and text_encoder components are quantized. Refer to the [benchmarking script](https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d) if you're interested in evaluating your own model.</small>
<small>These results are benchmarked on Flux with a RTX 4090. The transformer and text_encoder components are quantized. Refer to the <a href="https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d">benchmarking script</a> if you're interested in evaluating your own model.</small>
This guide will show you how to compile and offload a quantized model with [bitsandbytes](../quantization/bitsandbytes#torchcompile). Make sure you are using [PyTorch nightly](https://pytorch.org/get-started/locally/) and the latest version of bitsandbytes. This guide will show you how to compile and offload a quantized model with [bitsandbytes](../quantization/bitsandbytes#torchcompile). Make sure you are using [PyTorch nightly](https://pytorch.org/get-started/locally/) and the latest version of bitsandbytes.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment