• Lei Wang's avatar
    [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757) · 8eab7755
    Lei Wang authored
    
    
    * [Enhancement] Introduce finalize_reducer operator and layout reducer support
    
    - Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations.
    - Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management.
    - Updated `setup.py` to include logging for build directory paths, improving build process visibility.
    - Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations.
    - Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts.
    - Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance.
    
    * Refactor code formatting and improve readability in multiple files
    
    - Cleaned up whitespace in `setup.py` to enhance logging clarity.
    - Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability.
    - Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability.
    - Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity.
    
    * Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability.
    
    * [Enhancement] Disable reuse of small arrays in shared memory allocation
    
    - Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management.
    
    * Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability.
    
    * Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability.
    
    * bug fix
    
    * Add thread checks workaround for replicated cases
    
    * Remove the is_one check
    
    * fix lint error
    
    * lint fix
    
    * Update autotune tests to use smaller matrix sizes for improved performance and reliability
    
    * [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods
    
    - Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure.
    - Enhanced layout inference and cloning methods in FinalizeReducerOpNode.
    - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main.
    - Adjusted header inclusions for improved organization and clarity across multiple files.
    
    * [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py
    
    - Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently.
    - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization.
    
    * [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution
    
    - Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations.
    - Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus.
    
    ---------
    Co-authored-by: default avatarHuanqi Cao <caohuanqi@deepseek.com>
    Co-authored-by: default avatarFreebase6912 <amid-gauze-racing@duck.com>
    8eab7755
parallel.h 4.23 KB