1. 10 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Improve layout equality checks and error messaging (#471) · c2480907
      Lei Wang authored
      * [Refactor] Simplify buffer_region_to_tile_region function in copy.py
      
      * Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability.
      * Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents.
      
      * [Refactor] Improve layout equality checks and error messaging
      
      * Updated the `IsEqual` method in `FragmentNode` to ensure consistent evaluation of thread ranges.
      * Enhanced error messaging in `ParallelOp::InferLayout` to include source buffer information for better debugging.
      * Adjusted `ReduceOp::InferLayout` to set thread range during layout condensation, improving layout inference accuracy.
      
      * lintfix
      
      * [Refactor] Rename SetThreadRange to BindThreadRange for clarity
      
      * Updated the `SetThreadRange` method in `FragmentNode` and related classes to `BindThreadRange`, improving method naming consistency and clarity.
      * Adjusted all references to the renamed method across the codebase, ensuring proper functionality and maintaining existing behavior.
      * Enhanced layout equality checks to handle thread ranges more robustly in `IsEqual` method.
      * Updated layout inference methods in `Gemm`, `ParallelOp`, and `ReduceOp` to utilize the new method name, ensuring seamless integration with the updated API.
      
      * [Refactor] Update BindThreadRange usage across layout inference methods
      
      * Modified the implementation of `BindThreadRange` in `FragmentNode` to create a new object instance, enhancing thread range binding functionality.
      * Updated all references to `BindThreadRange` in layout inference methods across `Gemm`, `ParallelOp`, and `ReduceOp` to ensure consistency with the new implementation.
      * Adjusted the return statements in various layout inference functions to utilize the updated method, maintaining existing behavior while improving clarity.
      
      * lint fix
      c2480907
  2. 08 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Update barrier functions and add new example for GEMM with warp specialization (#456) · a91bc2a9
      Lei Wang authored
      * Add example for warp specialization with flash attention
      
      * Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang.
      * Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance.
      * Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics.
      * Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation.
      
      * Add memory barrier functions to builtin.py
      
      * Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization.
      * Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters.
      * The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1.
      * Improved code organization and readability by adding blank lines for better separation of logical sections.
      
      * Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py
      
      * Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`.
      * Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability.
      
      * [Refactor] Update barrier functions and add new example for GEMM with warp specialization
      
      * Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency.
      * Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation.
      * Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations.
      * Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios.
      * Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations.
      
      * lint fix
      
      * Update loop variable checks in SIMT loop and buffer region validation
      
      * Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling.
      * Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging.
      
      * lint fix
      
      * test fix
      a91bc2a9
  3. 06 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Add new examples for warp specialization and TMA integration (#448) · b5faf25a
      Lei Wang authored
      * [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic
      
      * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
      * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
      
      * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files
      
      * Updated function names from CamelCase to snake_case for better consistency across the codebase.
      * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
      * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.
      
      * [Refactor] Rename operations to snake_case for consistency
      
      * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
      * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
      * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.
      
      * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier
      
      * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
      * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
      * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
      * Enhanced the TileLang API with new methods for retrieving block and thread extents.
      * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
      * Improved layout inference and kernel launch logic for better performance and clarity.
      
      * [Refactor] Clean up code formatting and improve readability
      
      * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
      * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
      * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
      * Ensured consistent spacing and formatting across multiple files to enhance overall code readability.
      
      * lint fix
      
      * [Refactor] Update mbarrier functions for improved clarity and consistency
      
      * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
      * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
      * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
      * Added detailed docstrings to clarify usage examples for memory barrier functions.
      
      * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
      
      * [Feature] Add examples for warp specialization and TMA barrier integration
      
      * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
      * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
      * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
      * Updated the `phase.py` to include TMA barrier injection in the optimization process.
      * Improved documentation and comments for better clarity on usage and functionality.
      
      * [Feature] Add example for warp specialization in GEMM with TMA barriers
      
      * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
      * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
      * Enhanced documentation and comments for clarity on usage and functionality.
      
      * lint fix
      
      * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection
      
      * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
      * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
      * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
      * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.
      
      * lint fix
      
      * [Feature] Add new examples for warp specialization and TMA integration
      
      * Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`.
      * Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples.
      * Updated the TileLang API to support these examples and improve kernel compilation and testing processes.
      * Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities.
      
      * lint fix
      
      * Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.
      b5faf25a
  4. 03 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Separate warp specialize rewriter and tma barrier injector pass (#447) · fce16b00
      Lei Wang authored
      * [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic
      
      * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
      * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
      
      * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files
      
      * Updated function names from CamelCase to snake_case for better consistency across the codebase.
      * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
      * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.
      
      * [Refactor] Rename operations to snake_case for consistency
      
      * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
      * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
      * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.
      
      * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier
      
      * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
      * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
      * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
      * Enhanced the TileLang API with new methods for retrieving block and thread extents.
      * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
      * Improved layout inference and kernel launch logic for better performance and clarity.
      
      * [Refactor] Clean up code formatting and improve readability
      
      * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
      * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
      * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
      * Ensured consistent spacing and formatting across multiple files to enhance overall code readability.
      
      * lint fix
      
      * [Refactor] Update mbarrier functions for improved clarity and consistency
      
      * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
      * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
      * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
      * Added detailed docstrings to clarify usage examples for memory barrier functions.
      
      * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
      
      * [Feature] Add examples for warp specialization and TMA barrier integration
      
      * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
      * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
      * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
      * Updated the `phase.py` to include TMA barrier injection in the optimization process.
      * Improved documentation and comments for better clarity on usage and functionality.
      
      * [Feature] Add example for warp specialization in GEMM with TMA barriers
      
      * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
      * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
      * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
      * Enhanced documentation and comments for clarity on usage and functionality.
      
      * lint fix
      
      * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection
      
      * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
      * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
      * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
      * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.
      
      * lint fix
      fce16b00
  5. 22 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support Auto Layout Inference and Parallelism with variable constraint (#417) · 73a6cb8b
      Lei Wang authored
      * [Enhancement] Introduce thread range management in layout and operation handling
      
      * Added `SetThreadRange` method to `FragmentNode` for managing thread ranges.
      * Updated `LayoutNode::Inverse` to provide more informative error messages.
      * Refactored layout inference and operation lowering to utilize `thread_bounds` instead of `block_size`, enhancing flexibility for thread management.
      * Introduced new tests for tilelang operations to validate thread range functionality and ensure correctness in parallel execution scenarios.
      
      * lint fix
      
      * [Refactor] Improve thread variable handling in layout inference and operation lowering
      
      * Removed workaround for undefined thread_var in layout inference, ensuring proper handling of thread bounds.
      * Updated logic to define thread bounds based on the presence of thread_var, enhancing robustness in thread management.
      * Refactored thread_var initialization in lower_tile_op to maintain consistency across the codebase.
      
      * [Refactor] Update thread variable handling in layout inference and operation lowering
      
      * Refactored thread variable checks to ensure bounds are only accessed when defined, improving safety and clarity.
      * Initialized thread_var with a default range to prevent undefined behavior.
      * Updated logic in lower_tile_op to align with new thread variable handling, enhancing consistency across the codebase.
      73a6cb8b
  6. 16 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Introduce a smarter warp partition strategy (#396) · ca730c0a
      Lei Wang authored
      * make it python 3.8- happy
      
      * [Enhancement] Improve loop partitioning and vectorization logic in layout inference and loop vectorization
      
      - Enhanced the VisitStmt_ method to support local buffer handling in parallel loops, allowing for register usage without explicit thread binding.
      - Updated loop vectorization logic to simplify expressions and ensure accurate vector size calculations, improving performance and clarity in the vectorization process.
      
      * lint fix
      
      * [Refactor] Update warp size checks and enhance warp partitioning logic in GEMM
      
      - Changed warp_n size check from 16 to 8 in gemm_layouts.cc to improve compatibility with specific configurations.
      - Refactored warp partitioning logic in gemm.cc to prioritize N dimension for better performance based on aspect ratio.
      - Introduced a new CompileArgs dataclass in autotuner to streamline compile argument management and improve code clarity.
      
      * lint fix
      
      * [Enhancement] Initialize jit_compile in AutoTuner class
      
      - Added initialization for jit_compile attribute in the AutoTuner class to ensure it is set to None by default.
      - Updated the assignment logic for jit_compile to prevent overwriting an existing compile function, enhancing the flexibility of the AutoTuner's compilation process.
      ca730c0a
  7. 09 Apr, 2025 2 commits
    • Lei Wang's avatar
      [Bugfix] Fix compilation issues for amd cdna element size check (#364) · d627fd58
      Lei Wang authored
      * [Refactor] Update AutoTuner run method and timeout handling
      
      - Modified the `run` method to reduce the default timeout from 100 to 30 seconds for improved responsiveness.
      - Changed the `get_input_tensors_supply` call to disable output generation, enhancing performance during tensor supply retrieval.
      - Refactored the latency measurement to streamline the benchmarking process, ensuring proper timeout handling with `ThreadPoolExecutor`.
      - Added logging for timeout occurrences to aid in debugging and performance analysis.
      
      * bug fix
      
      * lint fix
      d627fd58
    • Lei Wang's avatar
      [AMD] Implement Deepseek MLA for AMD (#363) · e3065f0b
      Lei Wang authored
      * [Bugfix] Correct dynamic shared memory size error handling in HIP wrapper
      
      - Updated the error handling logic in `PREDEF_ATTRIBUTE_SET_DYNAMIC_MEMORY_HIP` to check if the dynamic shared memory size exceeds the maximum limit of 65536.
      - Improved error message clarity by specifying the function name and the attempted size, ensuring better debugging information.
      - Ensured the function returns 0 upon successful setting of the dynamic shared memory size.
      
      * [Add] Implement example for MLA decoding with AMD support
      
      - Introduced a new example script `example_mla_decode_amd.py` demonstrating the use of the flash attention mechanism with AMD hardware.
      - Implemented functions for attention calculation, including support for split processing and combining outputs.
      - Added command-line argument parsing for customizable input parameters such as batch size, number of heads, and dimensions.
      - Included a reference implementation for validation against the Tile-AI output, ensuring correctness of the implementation.
      - Enhanced performance profiling and output comparison for debugging and optimization purposes.
      
      * lint fix
      e3065f0b
  8. 07 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Fix Transposed Fragment Layout for amd GEMM_RS matrix core (#346) · 0acb8586
      Lei Wang authored
      * [Refactor] Update GEMM Fragment Layout and Improve Matrix Multiplication Functionality
      
      - Adjusted the layout configuration in `gemm_layouts.cc` to correct the repetition parameters for warp and block layouts, enhancing the efficiency of the GEMM fragment generation.
      - Refactored the `matmul_rs` function in `test_tilelang_test_amd.py` to improve readability by restructuring the function signature and ensuring consistent formatting.
      - Updated the test execution call to run the new `test_gemm_rs_f16f32f32_nt` function, enhancing test coverage for the GEMM functionality.
      
      * lint fix
      
      * bugfix
      0acb8586
  9. 06 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support index bit width configuration (#343) · 70546adc
      Lei Wang authored
      
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      * Enhance variable creation by associating data types in IR and layout files, and introduce ExpandIndexDataType transformation
      
      - Updated variable creation in `ir.cc`, `gemm_layouts.cc`, and `elem.cc` to include data types for better type safety.
      - Added a new transformation `ExpandIndexDataType` to promote integer types to int64 where necessary, improving compatibility and performance.
      - Integrated the new transformation into the optimization pipeline in `phase.py`.
      - Documented the new transformation in `__init__.py` for clarity.
      
      * lint fix
      
      * Add configuration option for index bitwidth and remove ExpandIndexDataType transformation
      
      - Introduced a new pass configuration option `kConfigIndexBitwidth` to allow customization of index bitwidth.
      - Updated the optimization pipeline in `phase.py` to utilize the new configuration option instead of the removed `ExpandIndexDataType` transformation.
      - Documented the new configuration option in the JIT compilation function's parameters for clarity.
      - Removed the `ExpandIndexDataType` transformation implementation from the codebase to streamline the transformation process.
      
      * lint fix
      
      * Refactor index bitwidth configuration handling
      
      - Updated the `ConfigIndexBitwidth` pass to only apply the bitwidth transformation if the configuration option is defined, preventing potential errors with undefined values.
      - Changed the default value of `tl.config_index_bitwidth` in the JIT compilation function's parameters from 32 to None for better clarity and flexibility.
      
      * lint fix
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      70546adc
  10. 24 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Improve flash attention example and layout comparison logic (#270) · 5f5bf53c
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      5f5bf53c
  11. 20 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180
      Lei Wang authored
      * remove llvm build
      
      * [Refactor] Update kernel compilation and profiling in examples
      
      - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
      - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
      - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.
      
      * lint fix
      
      * License Update
      
      * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files
      
      - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
      - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.
      
      * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files
      
      - Improved comment alignment and readability in `cuda.h`.
      - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix
      
      * License update
      
      * [Enhancement] Update JITKernel to use artifact for kernel source
      
      - Assigned the generated artifact to `self.artifact` for better management.
      - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.
      
      * lint fix
      
      * Add @tilelang.testing.requires_llvm decorator to vectorization tests
      
      * Enhance setup.py and env.py for library management
      
      - Added functionality to remove original files after copying in CMakeBuild.
      - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.
      
      * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py
      
      * Refactor CMakeBuild file handling in setup.py
      
      - Added a check to ensure the target library directory exists before copying .so files.
      - Improved the logic for creating the target directory and copying files to enhance robustness.
      
      * bugfix
      
      * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.
      
      * lint fix
      
      * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.
      
      * lint fix
      
      * Add support for C target in device code generation
      
      - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.
      
      * [Enhancement] Implement auto-clear cache feature based on environment variable
      
      * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
      * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
      * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.
      
      * [Refactor] Update kernel invocation and import paths in tests and cache
      
      * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
      * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
      * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.
      
      * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py
      
      * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
      * Enhanced overall code formatting to align with project standards.
      
      * [Enhancement] Add bfloat16 test case and improve kernel caching logic
      
      * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
      * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
      * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
      * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
      * Improved code formatting and readability across several files.
      
      * lint fix
      
      * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
      f2e99180
  12. 18 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Refactor for Better Layout Conflict Handling (#240) · 2a286ae6
      Lei Wang authored
      * [Feature] Add reduce_max functionality and corresponding tests
      
      * Introduced a new test file for the reduce_max operation in the tilelang language module.
      * Implemented the reduce_max functionality using T.prim_func, including local memory allocation and result copying.
      * Added tests for various input sizes and data types to ensure correctness of the reduce_max implementation.
      * Enhanced profiling assertions to validate the output against reference implementations.
      
      * Fix whitespace issues in reduce_max test file for improved readability
      
      * [Refactor] Update DebugOutput methods to return strings instead of void
      
      * Modified DebugOutput methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to return std::string instead of void, enhancing usability for logging and debugging.
      * Updated corresponding header files to reflect the new return types.
      * Improved layout inference error messages by incorporating DebugOutput for better clarity in layout conflicts.
      
      * lint fix
      
      * Fix typo in matmul function: changed loop from T.Parallel to T.grid for correct parallel execution in webgpu code generation tests.
      
      * [Enhancement] Improve layout inference conflict handling in ParallelOp
      
      * Updated the layout inference logic in ParallelOp to better handle conflicts for local.fragment buffers.
      * Added checks to ensure that layout conflicts are reported only when both source and destination buffers are defined, improving clarity in error messages.
      * Enhanced the overall robustness of the layout inference process by addressing specific cases where conflicts may arise.
      
      * [Feature] Add IsEqual methods for layout comparison
      
      * Introduced IsEqual methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to facilitate structural equality checks, allowing for optional index comparison.
      * Enhanced layout inference logic in Copy and ParallelOp to utilize the new IsEqual methods for better conflict detection in local.fragment layouts.
      * Improved error messages for layout conflicts to provide clearer guidance on potential issues.houm
      
      * [Refactor] Update profiler usage in benchmark_nsa_fwd.py and improve layout inference in elem.cc and parallel.cc
      
      * Modified the profiler call in benchmark_nsa_fwd.py to streamline latency measurement.
      * Updated layout inference logic in elem.cc and parallel.cc to use const pointers for FragmentNode, enhancing type safety and clarity.
      * Improved error messages in layout conflict checks to provide better guidance on potential issues.
      
      * [Refactor] Clean up pointer formatting in layout inference files
      
      * Standardized pointer formatting for FragmentNode in elem.cc and parallel.cc to improve code readability.
      * Minor adjustments to error message formatting in layout conflict checks for better clarity.
      2a286ae6
  13. 16 Mar, 2025 1 commit
  14. 27 Feb, 2025 1 commit
    • Lei Wang's avatar
      [JIT] Enhance cython/ctypes wrapper for tma descriptor (#126) · 7b74bb01
      Lei Wang authored
      
      
      * refactor code
      
      * enhance tutorial
      
      * Enhance error handling and code generation in CUDA and TileLang components
      
      This commit introduces several improvements across multiple files:
      - Added more informative error messages in GEMM layout checks
      - Updated CUDA codegen to support more flexible function signature generation
      - Improved TMA descriptor initialization and kernel dispatch logic
      - Refined library generation and source code parsing utilities
      - Enhanced error handling in various adapter and wrapper classes
      
      * Add thread tag validation for warp specialization
      
      Introduce a ThreadTagChecker to validate that a PrimFunc only uses threadIdx.x before applying warp specialization. This prevents unintended transformations on kernels with complex thread binding and provides a clear warning to users about potential issues with warp specialization.
      
      * Update TileLang Profiling and Compilation in Flash Decoding Examples
      
      Refactor the profiling and compilation workflow in two flash decoding example scripts:
      - Replace `tilelang.lower()` and `tilelang.Profiler()` with `tilelang.compile()`
      - Simplify profiler initialization using `get_profiler()`
      - Update method calls to use the new profiler and compiled kernel objects
      - Maintain existing performance benchmarking and validation logic
      
      * Refactor and clean up code formatting in TileLang testing and adapter modules
      
      This commit includes several code style and formatting improvements:
      - Adjust whitespace and line breaks in test files
      - Improve code formatting in CUDA source wrapper and adapter utilities
      - Enhance readability of function calls and argument handling
      - Remove unnecessary whitespace and standardize indentation
      - Simplify function signatures and argument parsing
      
      * Refactor CUDA codegen and improve code formatting
      
      This commit includes several improvements to CUDA code generation and formatting:
      - Enhance function signature generation in CodeGenTileLangCUDA
      - Improve code formatting and readability in CUDA-related files
      - Simplify parameter handling and type annotations
      - Clean up whitespace and line breaks in codegen and layout files
      
      ---------
      Co-authored-by: default avatarUbuntu <dlisuser@h100testl730RPS.xu5snccwrbtejcqqalluoku5hb.xx.internal.cloudapp.net>
      7b74bb01
  15. 09 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Tools] Introduce `plot_layout` to visualize the fragment layout (#68) · f9b6a92e
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      
      * Add BF16 support to matrix multiplication and introduce corresponding test cases
      
      * Add a blank line for improved readability in BF16 GEMM test
      
      * Update acknowledgements in README to include supervision by Zhi Yang at Peking University
      
      * enhance acknowledgement
      
      * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives
      
      * Update subproject commit for TVM dependency
      
      * Update subproject commit for TVM dependency
      
      * Add int4_t type and functions for packing char values in CUDA common header
      
      * Add plot_layout example and implement GetForwardVars method in layout classes
      
      * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files
      
      * Fix formatting by removing unnecessary line break in layout.h
      
      * Refactor make_int4 function for improved readability by adjusting parameter formatting
      f9b6a92e
  16. 25 Jan, 2025 1 commit
  17. 11 Jan, 2025 2 commits
    • Lei Wang's avatar
      [Lint] Overall Typo and Linting Fixes (#13) · fa511857
      Lei Wang authored
      * README.md fixed
      
      * update test ci
      
      * Lint and Typo Fix
      
      * Clang Format Lint Fix
      fa511857
    • Lei Wang's avatar
      [Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c
      Lei Wang authored
      
      
      * Add format.sh script for code formatting and linting
      
      * docs update
      
      * center align the title
      
      * lint fix
      
      * add ignore
      
      * Add .gitignore for 3rdparty directory
      
      * Add requirements-dev.txt, requirements-test.txt, and requirements.txt
      
      * 3rdparty
      
      * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h
      
      * Refactor CMakeLists.txt and include statements
      
      - Update CMakeLists.txt to use a newer version of CMake and add project name
      - Remove unnecessary include directories
      
      Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc
      
      - Update include paths to use relative paths instead of absolute paths
      
      * Update submodule for 3rdparty/tvm
      
      * update
      
      * load dll first
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * git keep update
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * refactor code structure
      
      * Update Readme
      
      * CMakeLists Customized
      
      * update readme
      
      * update README
      
      * update readme
      
      * update usage
      
      * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import
      
      * annotate lower transform global func with `transform` prefix
      
      * Migrate Simplify Pass from tilelang tvm branch
      
      * enhance system environment handling with __init__ and CMake
      
      * Initial commit
      
      * CODE_OF_CONDUCT.md committed
      
      * LICENSE committed
      
      * README.md committed
      
      * SECURITY.md committed
      
      * SUPPORT.md committed
      
      * CODE_OF_CONDUCT Commit
      
      * LICENSE Commit
      
      * SECURITY Commit
      
      * SUPPORT Commit
      
      * Modify Support
      
      * Update README.md
      
      * security ci update
      
      * remove examples
      
      * Update and implement clang-format
      
      * add composable kernel components
      
      * Migrate from latest update
      
      * submodule update
      
      * Test update
      
      * Update License
      
      * Spell check
      
      * lint fix
      
      * add clang-tidy to apply static analysis for c source
      
      * update tilelang examples
      
      * Update Install Docs
      
      * Refactor filetree
      
      * Enhance Install
      
      * conflict resloved
      
      * annotate_version
      
      * Initial Update
      
      * test fix
      
      * install
      
      * Implement setup.py
      
      * lint fix
      
      * Separate Init
      
      * Separate test
      
      * docker file commit
      
      * add logo
      
      * Update Readme and Examples
      
      * update readme
      
      * update logo
      
      * Implement AMD Installation
      
      * Add License
      
      * Update AMD MI300x Benchmark
      
      * update README
      
      * update mi300 benchmark scripts
      
      * update ignore
      
      * enhance build scirpt
      
      * update image
      
      * enhance setup.py to remove duplicated libraries
      
      * remove debug files
      
      * update readme
      
      * update image
      
      * update gemm examples
      
      * update flashattention README
      
      * readme update
      
      * add cmake into requirements
      
      * libinfo fix
      
      * auto update submodule
      
      * lint fix
      
      * Fix AMD Build and Test
      
      * Update check for transpose attribute for CDNA Arch
      
      * typo fix for amd
      
      * Implement Matmul Benchmark
      
      * Refactor Code
      
      * [TypoFix] Fix GEMM Example
      
      * [Docs] Init Linear Attention README
      
      * [TYPO] Typo fix
      
      * [Lint] Lint Fix
      
      * enhance example with intrinsics
      
      * [Enhancement] Improve Buffer Collection during IR Parser
      
      * [Dev] Introduce Current classmethod to get current frame
      
      * submodule update
      
      * fake test pass update
      
      * support thread_extent_api
      
      * code optimize
      
      * Add GEMM function implementation for matrix multiplication
      
      * Update logging format to reflect TileLang in logger messages
      
      * Refactor CMakeLists.txt for improved readability and set default build type to Release
      
      * Support Gemm SS Primitives Implementation
      
      * [README] Upload Tile Language Logo (#5)
      
      * update logo
      
      * Update README.md to enhance formatting and center the title
      
      ---------
      Co-authored-by: default avatarmicrosoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
      Co-authored-by: default avatarMicrosoft Open Source <microsoftopensource@users.noreply.github.com>
      Co-authored-by: default avatarYu Cheng <yu.cheng@pku.edu.cn>
      57ab687c