- 31 Aug, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Introduce finalize_reducer operator and layout reducer support - Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations. - Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management. - Updated `setup.py` to include logging for build directory paths, improving build process visibility. - Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations. - Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts. - Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance. * Refactor code formatting and improve readability in multiple files - Cleaned up whitespace in `setup.py` to enhance logging clarity. - Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability. - Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability. - Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity. * Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability. * [Enhancement] Disable reuse of small arrays in shared memory allocation - Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management. * Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability. * Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability. * bug fix * Add thread checks workaround for replicated cases * Remove the is_one check * fix lint error * lint fix * Update autotune tests to use smaller matrix sizes for improved performance and reliability * [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods - Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure. - Enhanced layout inference and cloning methods in FinalizeReducerOpNode. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main. - Adjusted header inclusions for improved organization and clarity across multiple files. * [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py - Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization. * [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution - Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations. - Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus. --------- Co-authored-by:
Huanqi Cao <caohuanqi@deepseek.com> Co-authored-by:
Freebase6912 <amid-gauze-racing@duck.com>
-
- 21 Aug, 2025 1 commit
-
-
Lei Wang authored
* Introduce Barrier * Enhance CUDA kernel with new barrier management and post-processing support - Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers. - Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure. - Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency. - Introduced additional print statements for debugging in the lowering phase of the TileLang engine. - Enhanced the overall structure and readability of the codebase. * Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic. * Enhance barrier management in TileLang - Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework. - Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory. - Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code. - Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine. - Removed deprecated memory scope handling code to enhance clarity and maintainability. * lint fix * lint fix * Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability. * Refactor logging in JITKernel to improve kernel compilation tracking - Removed unused import of `torch.backends` in the example file. - Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging. - Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function. * Refactor dequantization tests and update barrier function - Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite. - Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management. * Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed. * Fix typos in rasterization parameters and update import path for cached module - Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage. - Updated the import statement for the `cached` module to reflect the new path in the cache submodule. - Added `StridedTensor` import in the language module for enhanced tensor functionality. * Update ci.yml
-
- 13 Jul, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Simplify and modularize autotuner implementation - Removed unused imports and extensive code sections from the autotuner module to enhance readability and maintainability. - Modularized the code by introducing new imports for autotuning and capturing functionalities, streamlining the overall structure. - Improved logging setup and removed redundant timeout handling functions, focusing on core autotuning logic. - Updated the AutoTuner class to better utilize the new modular structure, ensuring efficient performance during auto-tuning processes. * [Refactor] Clean up and enhance capture and tuner modules - Improved code readability by removing unnecessary blank lines and organizing imports in `capture.py` and `tuner.py`. - Enhanced logging in the `AutoTuner` class to provide clearer warnings regarding the usage of `supply_prog` in the context of auto-tuning. - Streamlined the `CaptureStack` class for better thread-local context management. * lint fix * [Refactor] Simplify configuration and autotuning logic in blocksparse GEMM example - Updated `get_configs` function to reduce the number of configurations, enhancing performance and clarity. - Removed the `get_best_config` function, integrating its logic directly into the `blocksparse_matmul` function with the `@autotune` decorator for streamlined autotuning. - Adjusted the main function to directly utilize the autotuned kernel, simplifying the overall structure and improving readability. - Deleted obsolete test file for autotuning decorator, cleaning up the codebase. * [Refactor] Improve code formatting and readability in autotune test file - Reformatted the `matmul` function and `get_configs` function for better readability by adjusting line breaks and indentation. - Fixed a typo in the `enable_rasteration` parameter name to ensure consistency. - Cleaned up unnecessary blank lines to enhance overall code clarity. * Update example_blocksparse_gemm.py * Update capture.py
-