1. 20 Oct, 2025 1 commit
  2. 13 Sep, 2025 1 commit
  3. 31 Aug, 2025 2 commits
    • coderabbitai[bot]'s avatar
      📝 Add docstrings to `reducer_0825` (#772) · 9a869396
      coderabbitai[bot] authored
      * 📝 Add docstrings to `reducer_0825`
      
      Docstrings generation was requested by @LeiWang1999.
      
      * https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118
      
      
      
      The following files were modified:
      
      * `setup.py`
      * `src/op/builtin.h`
      * `src/op/finalize_reducer.cc`
      * `src/op/finalize_reducer.h`
      * `src/op/parallel.cc`
      * `src/op/parallel.h`
      * `src/op/reduce.cc`
      * `src/target/codegen_cuda.cc`
      * `src/tl_templates/cuda/common.h`
      * `src/transform/layout_inference.cc`
      * `src/transform/layout_reducer.cc`
      * `src/transform/layout_reducer.h`
      * `src/transform/merge_shared_memory_allocations.cc`
      * `src/transform/storage_access.cc`
      * `src/transform/warp_specialized_rewriter.cc`
      * `testing/python/autotune/test_tilelang_autotune_with_inputs.py`
      * `tilelang/engine/phase.py`
      * `tilelang/language/customize.py`
      * `tilelang/language/reduce.py`
      * `tilelang/transform/__init__.py`
      
      * lint fix
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      9a869396
    • Lei Wang's avatar
      [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757) · 8eab7755
      Lei Wang authored
      
      
      * [Enhancement] Introduce finalize_reducer operator and layout reducer support
      
      - Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations.
      - Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management.
      - Updated `setup.py` to include logging for build directory paths, improving build process visibility.
      - Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations.
      - Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts.
      - Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance.
      
      * Refactor code formatting and improve readability in multiple files
      
      - Cleaned up whitespace in `setup.py` to enhance logging clarity.
      - Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability.
      - Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability.
      - Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity.
      
      * Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability.
      
      * [Enhancement] Disable reuse of small arrays in shared memory allocation
      
      - Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management.
      
      * Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability.
      
      * Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability.
      
      * bug fix
      
      * Add thread checks workaround for replicated cases
      
      * Remove the is_one check
      
      * fix lint error
      
      * lint fix
      
      * Update autotune tests to use smaller matrix sizes for improved performance and reliability
      
      * [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods
      
      - Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure.
      - Enhanced layout inference and cloning methods in FinalizeReducerOpNode.
      - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main.
      - Adjusted header inclusions for improved organization and clarity across multiple files.
      
      * [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py
      
      - Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently.
      - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization.
      
      * [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution
      
      - Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations.
      - Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus.
      
      ---------
      Co-authored-by: default avatarHuanqi Cao <caohuanqi@deepseek.com>
      Co-authored-by: default avatarFreebase6912 <amid-gauze-racing@duck.com>
      8eab7755
  4. 21 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Refactor barrier management (#744) · cb37bfef
      Lei Wang authored
      * Introduce Barrier
      
      * Enhance CUDA kernel with new barrier management and post-processing support
      
      - Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers.
      - Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure.
      - Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency.
      - Introduced additional print statements for debugging in the lowering phase of the TileLang engine.
      - Enhanced the overall structure and readability of the codebase.
      
      * Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic.
      
      * Enhance barrier management in TileLang
      
      - Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework.
      - Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory.
      - Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code.
      - Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine.
      - Removed deprecated memory scope handling code to enhance clarity and maintainability.
      
      * lint fix
      
      * lint fix
      
      * Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability.
      
      * Refactor logging in JITKernel to improve kernel compilation tracking
      
      - Removed unused import of `torch.backends` in the example file.
      - Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging.
      - Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function.
      
      * Refactor dequantization tests and update barrier function
      
      - Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite.
      - Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management.
      
      * Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed.
      
      * Fix typos in rasterization parameters and update import path for cached module
      
      - Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage.
      - Updated the import statement for the `cached` module to reflect the new path in the cache submodule.
      - Added `StridedTensor` import in the language module for enhanced tensor functionality.
      
      * Update ci.yml
      cb37bfef
  5. 13 Jul, 2025 1 commit
    • Lei Wang's avatar
      [AutoTune] Support `with set_autotune_inputs` to set auto tuning input tensors (#632) · eec47592
      Lei Wang authored
      * [Refactor] Simplify and modularize autotuner implementation
      
      - Removed unused imports and extensive code sections from the autotuner module to enhance readability and maintainability.
      - Modularized the code by introducing new imports for autotuning and capturing functionalities, streamlining the overall structure.
      - Improved logging setup and removed redundant timeout handling functions, focusing on core autotuning logic.
      - Updated the AutoTuner class to better utilize the new modular structure, ensuring efficient performance during auto-tuning processes.
      
      * [Refactor] Clean up and enhance capture and tuner modules
      
      - Improved code readability by removing unnecessary blank lines and organizing imports in `capture.py` and `tuner.py`.
      - Enhanced logging in the `AutoTuner` class to provide clearer warnings regarding the usage of `supply_prog` in the context of auto-tuning.
      - Streamlined the `CaptureStack` class for better thread-local context management.
      
      * lint fix
      
      * [Refactor] Simplify configuration and autotuning logic in blocksparse GEMM example
      
      - Updated `get_configs` function to reduce the number of configurations, enhancing performance and clarity.
      - Removed the `get_best_config` function, integrating its logic directly into the `blocksparse_matmul` function with the `@autotune` decorator for streamlined autotuning.
      - Adjusted the main function to directly utilize the autotuned kernel, simplifying the overall structure and improving readability.
      - Deleted obsolete test file for autotuning decorator, cleaning up the codebase.
      
      * [Refactor] Improve code formatting and readability in autotune test file
      
      - Reformatted the `matmul` function and `get_configs` function for better readability by adjusting line breaks and indentation.
      - Fixed a typo in the `enable_rasteration` parameter name to ensure consistency.
      - Cleaned up unnecessary blank lines to enhance overall code clarity.
      
      * Update example_blocksparse_gemm.py
      
      * Update capture.py
      eec47592