• Lei Wang's avatar
    [Bugfix] Updated autotune usage in the examples to align with the latest changes (#309) · 66c7f6a1
    Lei Wang authored
    * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
    
    - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
    - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
    
    * lintfix
    
    * [Refactor] Clean up includes in gemm_sm89.h
    
    - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
    - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
    
    * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings
    
    - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
    - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
    - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
    - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.
    
    * lint fix
    
    * typofix
    
    * [Refactor] Update matmul and flashattn function calls to return structured results
    
    - Modified the matmul and flashattn function calls to return a single object containing latency, configuration, and reference latency, improving code clarity and reducing the number of returned variables.
    - Updated all relevant instances in benchmark and example scripts to accommodate the new return structure, ensuring consistent usage across the codebase.
    
    * lint fix
    66c7f6a1
example_gqa_decode.py 21 KB