Commit 2726b309 authored by helloyongyang's avatar helloyongyang
Browse files

update doc

parent f37f15b2
# 特征缓存
# Feature Caching
xxx
## Cache Acceleration Algorithm
- Cache reuse is an important acceleration algorithm in the inference process of diffusion models.
- Its core idea is to skip redundant computations at certain time steps by reusing historical cache results to improve inference efficiency.
- The key to the algorithm is how to decide at which time steps to perform cache reuse, usually based on dynamic judgment of model state changes or error thresholds.
- During inference, key content such as intermediate features, residuals, and attention outputs need to be cached. When entering reusable time steps, directly use the cached content and reconstruct the current output through approximation methods like Taylor expansion, thereby reducing repetitive calculations and achieving efficient inference.
## TeaCache
The core idea of `TeaCache` is to accumulate the **relative L1** distance between adjacent time step inputs, and when the cumulative distance reaches a set threshold, determine that the current time step can perform cache reuse.
- Specifically, the algorithm calculates the relative L1 distance between the current input and the previous step input at each inference step, and accumulates it.
- When the cumulative distance exceeds the threshold, indicating that the model state has changed sufficiently, it directly reuses the most recently cached content, skipping some redundant computations. This can significantly reduce the number of forward computations of the model and improve inference speed.
In actual effect, TeaCache achieves significant acceleration while ensuring generation quality. The video comparison before and after acceleration is as follows:
| Before Acceleration | After Acceleration |
|:------:|:------:|
| Single H200 inference time: 58s | Single H200 inference time: 17.9s |
| ![Effect before acceleration](../../../../assets/gifs/1.gif) | ![Effect after acceleration](../../../../assets/gifs/2.gif) |
- Speedup ratio: **3.24**
- Reference paper: [https://arxiv.org/abs/2411.19108](https://arxiv.org/abs/2411.19108)
## TaylorSeer Cache
The core of `TaylorSeer Cache` lies in using Taylor formula to recalculate cached content as residual compensation for cache reuse time steps. The specific approach is that at cache reuse time steps, not only simply reuse historical cache, but also approximate reconstruction of current output through Taylor expansion. This can further improve output accuracy while reducing computational load. Taylor expansion can effectively capture subtle changes in model state, compensating for errors brought by cache reuse, thereby ensuring generation quality while accelerating. `TaylorSeer Cache` is suitable for scenarios with high requirements for output precision, and can further improve model inference performance on the basis of cache reuse.
| Before Acceleration | After Acceleration |
|:------:|:------:|
| Single H200 inference time: 57.7s | Single H200 inference time: 41.3s |
| ![Effect before acceleration](../../../../assets/gifs/3.gif) | ![Effect after acceleration](../../../../assets/gifs/4.gif) |
- Speedup ratio: **1.39**
- Reference paper: [https://arxiv.org/abs/2503.06923](https://arxiv.org/abs/2503.06923)
## AdaCache
The core idea of `AdaCache` is to dynamically adjust the step size of cache reuse based on partial cached content in specified block chunks.
- The algorithm analyzes feature differences between two adjacent time steps within specific blocks, and adaptively decides the next cache reuse time step interval based on the difference magnitude.
- When model state changes are small, the step size automatically increases, reducing cache update frequency; when state changes are large, the step size decreases to ensure output quality.
This allows flexible adjustment of cache strategies based on dynamic changes in the actual inference process, achieving more efficient acceleration and better generation effects. AdaCache is suitable for application scenarios with high requirements for both inference speed and generation quality.
| Before Acceleration | After Acceleration |
|:------:|:------:|
| Single H200 inference time: 227s | Single H200 inference time: 83s |
| ![Effect before acceleration](../../../../assets/gifs/5.gif) | ![Effect after acceleration](../../../../assets/gifs/6.gif) |
- Speedup ratio: **2.73**
- Reference paper: [https://arxiv.org/abs/2411.02397](https://arxiv.org/abs/2411.02397)
## CustomCache
`CustomCache` combines the advantages of `TeaCache` and `TaylorSeer Cache`.
- It combines the real-time and rationality of `TeaCache` in cache decision-making, determining when to perform cache reuse through dynamic thresholds.
- At the same time, it utilizes `TaylorSeer`'s Taylor expansion method to make use of cached content.
This not only efficiently determines the timing of cache reuse, but also maximizes the utilization of cached content, improving output accuracy and generation quality. Actual tests show that `CustomCache` generates video quality superior to using `TeaCache`, `TaylorSeer Cache`, or `AdaCache` alone across multiple content generation tasks, making it one of the currently optimal comprehensive performance cache acceleration algorithms.
| Before Acceleration | After Acceleration |
|:------:|:------:|
| Single H200 inference time: 57.9s | Single H200 inference time: 16.6s |
| ![Effect before acceleration](../../../../assets/gifs/7.gif) | ![Effect after acceleration](../../../../assets/gifs/8.gif) |
- Speedup ratio: **3.49**
# 特征缓存
xxx
## 缓存加速算法
- 在扩散模型的推理过程中,缓存复用是一种重要的加速算法。
- 其核心思想是在部分时间步跳过冗余计算,通过复用历史缓存结果提升推理效率。
- 算法的关键在于如何决策在哪些时间步进行缓存复用,通常基于模型状态变化或误差阈值动态判断。
- 在推理过程中,需要缓存如中间特征、残差、注意力输出等关键内容。当进入可复用时间步时,直接利用已缓存的内容,通过泰勒展开等近似方法重构当前输出,从而减少重复计算,实现高效推理。
## TeaCache
`TeaCache`的核心思想是通过对相邻时间步输入的**相对L1**距离进行累加,当累计距离达到设定阈值时,判定当前时间步可以进行缓存复用。
- 具体来说,算法在每一步推理时计算当前输入与上一步输入的相对L1距离,并将其累加。
- 当累计距离超过阈值,说明模型状态发生了足够的变化,则直接复用最近一次缓存的内容,跳过部分冗余计算。这样可以显著减少模型的前向计算次数,提高推理速度。
实际效果上,TeaCache 在保证生成质量的前提下,实现了明显的加速。加速前后的视频对比如下:  
| 加速前 | 加速后 |
|:------:|:------:|
| 单卡H200推理耗时:58s | 单卡H200推理耗时:17.9s |
| ![加速前效果](../../../../assets/gifs/1.gif) | ![加速后效果](../../../../assets/gifs/2.gif) |
- 加速比为:**3.24**
- 参考论文:[https://arxiv.org/abs/2411.19108](https://arxiv.org/abs/2411.19108)
## TaylorSeer Cache
`TaylorSeer Cache`的核心在于利用泰勒公式对缓存内容进行再次计算,作为缓存复用时间步的残差补偿。具体做法是在缓存复用的时间步,不仅简单地复用历史缓存,还通过泰勒展开对当前输出进行近似重构。这样可以在减少计算量的同时,进一步提升输出的准确性。泰勒展开能够有效捕捉模型状态的微小变化,使得缓存复用带来的误差得到补偿,从而在加速的同时保证生成质量。`TaylorSeer Cache`适用于对输出精度要求较高的场景,能够在缓存复用的基础上进一步提升模型推理的表现。
| 加速前 | 加速后 |
|:------:|:------:|
| 单卡H200推理耗时:57.7s | 单卡H200推理耗时:41.3s |
| ![加速前效果](../../../../assets/gifs/3.gif) | ![加速后效果](../../../../assets/gifs/4.gif) |
- 加速比为:**1.39**
- 参考论文:[https://arxiv.org/abs/2503.06923](https://arxiv.org/abs/2503.06923)
## AdaCache
`AdaCache`的核心思想是根据指定block块中的部分缓存内容,动态调整缓存复用的步长。
- 算法会分析相邻两个时间步在特定 block 内的特征差异,根据差异大小自适应地决定下一个缓存复用的时间步间隔。
- 当模型状态变化较小时,步长自动加大,减少缓存更新频率;当状态变化较大时,步长缩小,保证输出质量。
这样可以根据实际推理过程中的动态变化,灵活调整缓存策略,实现更高效的加速和更优的生成效果。AdaCache 适合对推理速度和生成质量都有较高要求的应用场景。
| 加速前 | 加速后 |
|:------:|:------:|
| 单卡H200推理耗时:227s | 单卡H200推理耗时:83s |
| ![加速前效果](../../../../assets/gifs/5.gif) | ![加速后效果](../../../../assets/gifs/6.gif) |
- 加速比为:**2.73**
- 参考论文:[https://arxiv.org/abs/2411.02397](https://arxiv.org/abs/2411.02397)
## CustomCache
`CustomCache`综合了`TeaCache``TaylorSeer Cache`的优势。
- 它结合了`TeaCache`在缓存决策上的实时性和合理性,通过动态阈值判断何时进行缓存复用.
- 同时利用`TaylorSeer`的泰勒展开方法对已缓存内容进行利用。
这样不仅能够高效地决定缓存复用的时机,还能最大程度地利用缓存内容,提升输出的准确性和生成质量。实际测试表明,`CustomCache`在多个内容生成任务上,生成的视频质量优于单独使用`TeaCache、TaylorSeer Cache``AdaCache`的方案,是目前综合性能最优的缓存加速算法之一。
| 加速前 | 加速后 |
|:------:|:------:|
| 单卡H200推理耗时:57.9s | 单卡H200推理耗时:16.6s |
| ![加速前效果](../../../../assets/gifs/7.gif) | ![加速后效果](../../../../assets/gifs/8.gif) |
- 加速比为:**3.49**
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment