1. 16 Jun, 2025 2 commits
    • Yu Cheng's avatar
      [CI] Modify test requirements and CI workflow (#578) · 4c24a69e
      Yu Cheng authored
      * [CI] Modify test requirements and CI workflow
      
      - Replaced `flash-attn` with `packaging` and `wheel` in `requirements-test.txt` to ensure proper package management.
      - Updated the CI workflow to install `flash-attn` without build isolation, improving the installation process.
      
      * [CI] remove redundant packages
      
      * [CI] Update test requirements and CI workflow
      
      - Added `flash-attn` to `requirements-test.txt` to ensure it is included in the testing environment.
      - Modified the CI workflow to install packages from `requirements-test.txt` with `PIP_NO_BUILD_ISOLATION=1`, improving the installation process.
      4c24a69e
    • Lei Wang's avatar
      [Refactor] Phaseout tf32 Casting from GEMM Templates (#573) · 9ba8b480
      Lei Wang authored
      * [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic
      
      - Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
      - Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
      - Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
      - Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.
      
      * lint fix
      
      * [Refactor] Update GEMM Layout Functions and Inference Logic
      
      - Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature.
      - Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture.
      - Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection.
      - Adjusted related layout functions to ensure consistent behavior across different architectures.
      
      * [Refactor] Remove tf32 Casting Logic from GEMM Templates
      
      - Eliminated the `cast_float_to_tf32` function from `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` templates to streamline the code.
      - Removed conditional casting logic for float32 to tfloat32 conversion, enhancing clarity and maintainability.
      - Updated relevant sections in GEMM operations to reflect the removal of casting, ensuring consistent behavior across templates.
      - Adjusted tensor view handling to improve performance and accuracy in matrix operations.
      
      * Update bulk_copy.cc
      
      * Fix profiler initialization in GEMM test by removing TensorSupplyType argument for improved flexibility.
      9ba8b480
  2. 13 Jun, 2025 4 commits
  3. 11 Jun, 2025 4 commits
    • Lei Wang's avatar
      [Feature] Implement Swizzle 32B (#566) · ae9668a8
      Lei Wang authored
      * [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic
      
      - Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
      - Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
      - Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
      - Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.
      
      * lint fix
      
      * [Refactor] Update GEMM Layout Functions and Inference Logic
      
      - Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature.
      - Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture.
      - Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection.
      - Adjusted related layout functions to ensure consistent behavior across different architectures.
      
      * Update bulk_copy.cc
      
      * Update __init__.py
      ae9668a8
    • Lei Wang's avatar
      [Bugfix] Add `__tune_params` into key hash for autotuning (#565) · ae386a7b
      Lei Wang authored
      * [Enhancement] Update AutoTuner and Profiler for improved kernel handling and output validation
      
      - Modified AutoTuner to store cache in a dedicated "autotuner" directory.
      - Enhanced kernel source code saving logic in AutotuneResult and AutoTunerCache to check for None before writing.
      - Updated Profiler to handle None outputs gracefully during tensor comparisons, improving robustness in output validation.
      
      * lint fix
      ae386a7b
    • Yu Cheng's avatar
      [Refactor] Improve dtype handling in KernelParam class (#564) · 59172ff6
      Yu Cheng authored
      - Updated the dtype handling logic in the KernelParam class to enhance clarity and maintainability. The dtype string is now modified only if it starts with "torch.", simplifying the return statement for boolean type checks.
      59172ff6
    • Yu Cheng's avatar
      [Feature] Introduce Persistent Loop and Update GEMM Example (#563) · e7b97be2
      Yu Cheng authored
      * [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation
      
      - Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid.
      - Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code.
      - Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels.
      - Updated relevant documentation and comments to reflect the addition of new features and usage instructions.
      
      * [Example] Add MLA Decode With Persistent Threadblock Example
      
      * [Feature] Introduce Persistent Loop and Update GEMM Example
      
      - Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution.
      - Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication.
      - Introduced a `loop_break` intrinsic for better control flow within persistent loops.
      - Updated relevant files to support the new features, including changes in code generation and language interface.
      
      * lint fix
      e7b97be2
  4. 09 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Optimize debug info for auto tuning (#560) · e5e36dbf
      Lei Wang authored
      
      
      * [Enhancement] Update AutoTuner and JIT compilation arguments
      
      * Added functionality to return compile arguments in the JIT implementation, enhancing the autotuner's caching capabilities.
      * Modified `CompileArgs` and `AutotuneResult` classes to support optional `out_idx` parameter, improving flexibility in compile argument handling.
      * Refactored the `_AutoTunerImplementation` to utilize the new compile arguments, ensuring better integration and performance during tuning processes.
      
      * Update tilelang/autotuner/param.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * remove redundant comments
      
      * Refactor kernel source retrieval and logging levels
      
      * Updated `AutotuneResult` to use `kernel.get_kernel_source()` instead of `kernel.adapter.get_kernel_source()`.
      * Changed logging level in `KernelCache` from `ERROR` to `DEBUG` for improved verbosity during kernel caching operations.
      * Removed unnecessary verbose logging in JIT compilation process to streamline output.
      
      * Merge branch 'main' of https://github.com/tile-ai/tilelang
      
       into bugfix_autotune_0604
      
      * lint fix
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      e5e36dbf
  5. 07 Jun, 2025 2 commits
    • Yu Cheng's avatar
      [Feature] Support persistent kernels and add persistent GEMM examples (#559) · 225aca61
      Yu Cheng authored
      * [Enhancement] Fix multi-version buffer index in nested-loop
      
      * [Feature] Support persistent kernels and add persistent GEMM example
      
      * lint fix
      
      * lint fix
      
      * [CI] Remove test_tilelang_transform_annotate_device_regions.py
      225aca61
    • Lei Wang's avatar
      [Bugfix] Add tf32 casting to GEMM templates (#556) · 8cc8db52
      Lei Wang authored
      * Add tf32 casting functionality to GEMM templates
      
      - Introduced a `cast_float_to_tf32` function to convert float32 values to tfloat32 format across gemm_sm80, gemm_sm89, and gemm_sm90 templates.
      - Implemented conditional casting in relevant sections of the GEMM operations to ensure compatibility with tfloat32 types.
      - Enhanced the handling of tensor views to support the new casting logic, improving performance and accuracy in matrix operations.
      
      * lint fix
      
      * Refactor tfloat32 casting logic in GEMM templates
      
      - Replaced the `is_tfloat32` boolean with `need_tfloat32_cast` to improve clarity and accuracy in determining when to cast float32 to tfloat32.
      - Updated relevant sections in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to utilize the new casting logic, enhancing compatibility with tfloat32 types.
      - Ensured consistent application of casting across tensor views, improving performance and correctness in matrix operations.
      
      * Refactor GEMM template functions for improved readability
      
      - Simplified the function signature of `body_rs` in both `gemm_sm80` and `gemm_sm90` templates for better clarity.
      - Adjusted the casting logic in `gemm_sm90` to ensure consistent application of `cast_float_to_tf32` across tensor views, enhancing performance and maintainability.
      
      * Enhance tf32 casting logic in GEMM templates
      
      - Updated the `cast_float_to_tf32` function in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to conditionally apply the casting only if the input is finite, improving robustness.
      - Simplified the `need_tfloat32_cast` logic to clarify the conditions under which tfloat32 casting is required, enhancing code readability and maintainability.
      
      * Refactor GEMM template functions and layout inference logic
      
      - Removed the `cast_float_to_tf32` function from `gemm_sm90` and updated the `body_sr` function to streamline the casting process for tensor views, enhancing code clarity and maintainability.
      - Improved layout inference in `layout_inference.cc` by adding checks for the layout map's definition, ensuring robustness in handling layout annotations.
      - Simplified the handling of layout maps in the `annotate_layout` function, allowing for more flexible layout definitions and error handling.
      8cc8db52
  6. 06 Jun, 2025 1 commit
  7. 05 Jun, 2025 3 commits
    • Gabriel Wu's avatar
      [Enhancement] Add nvrtc execution backend (#461) · 17f7394f
      Gabriel Wu authored
      
      
      * [wip] feat: add nvrtc backend
      
      * [wip] fix: handle out_idx
      
      * [wip] refactor: move lib logic to libgen
      
      * feat: cache for nvrtc backend
      
      * fmt: run format
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: get kernel source
      
      * refactor: speedup pyimport
      
      * Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning.
      
      * Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity.
      
      * Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates.
      
      * fix tl_templates
      
      * ensure CUDA context
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      17f7394f
    • Zhengju Tang's avatar
      [CI] Add FusedMoE example (#555) · 88c622c9
      Zhengju Tang authored
      
      
      * [CI] Add FusedMoE example
      
      * Lint
      
      * Fix import bug
      
      * Fix comment bug
      
      * Update example_fusedmoe_torch.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      88c622c9
    • Lei Wang's avatar
      [Release] Bump Version to 0.1.5 (#551) · cc5e9f7c
      Lei Wang authored
      * Update VERSION to 0.1.5
      
      * Add DEBUG_MODE support in setup.py and update CMake build type; enhance pypi.Dockerfile with git installation
      cc5e9f7c
  8. 04 Jun, 2025 6 commits
    • alex_xiao's avatar
      [CI]Add norm and layout_plot (#534) · c9e503be
      alex_xiao authored
      
      
      * [CI]Add norm and layout_plot
      
      * fix lint
      
      * Remove obsolete test files for RMS normalization and plot layout, streamlining the testing suite.
      
      * Add make_mma_load_base_layout function to create MMA result layouts
      
      - Introduced a new function `make_mma_load_base_layout` for generating layout functions for storing MMA results in fragment buffers.
      - Added detailed docstring explaining parameters, return values, and potential exceptions.
      - Implemented logic for handling different data types and matrix configurations, including assertions for input validation.
      - Defined internal functions for mapping fragment indices to threads and local indices, enhancing the layout functionality.
      
      * Enhance MMA load test with additional imports and functionality
      
      - Added imports for `tilelang.language`, `Literal`, `Callable`, `DataType`, `IndexMap`, and `get_mma_micro_size` to support extended functionality.
      - Improved the `make_mma_load_base_layout` function by ensuring it can handle various data types and configurations.
      - Updated the test function `test_mma_load_base_layout` to validate the layout for float16 matrix A.
      
      * Fix formatting in test_fragment_mma_load_a.py by adding a blank line for improved readability.
      
      * Add RMS normalization functions to test_rms_norm.py
      
      - Introduced `rms_norm` and `rms_norm_splitk` functions for RMS normalization, enhancing the testing capabilities.
      - Implemented kernel functions with shared memory allocation and parallel processing for improved performance.
      - Updated the test function to validate the new RMS normalization implementations.
      
      * Add reference program for RMS normalization in test_rms_norm.py
      
      - Introduced `ref_program` function to provide a reference implementation for RMS normalization.
      - This addition enhances the testing framework by allowing comparisons against a known reference output.
      
      * Enhance RMS normalization tests with additional imports and formatting
      
      - Added import for `tilelang.language` to support extended functionality in `test_rms_norm.py`.
      - Improved code readability by adding blank lines for better separation of code sections.
      
      * Update RMS normalization test parameters and enhance layout plotting
      
      - Increased matrix dimensions in `test_rms_norm` to 8192 for improved performance testing.
      - Removed obsolete test functions in `test_fragment_mma_load_a.py` to streamline the test suite.
      - Enhanced layout plotting functionality by ensuring proper visualization of base, warp, and block layouts in `test_fragment_mma_load_a.py`.
      
      * Refactor RMS normalization test parameters and improve layout plotting readability
      
      - Simplified the parameters in `test_rms_norm` by removing `blk_k` for clarity.
      - Enhanced code readability in `test_fragment_mma_load_a.py` by adjusting the formatting of the `block_layout` definition and removing the unused `warp_cols` variable.
      
      * Enhance RMS normalization with split-k implementation and additional profiling
      
      - Added a new function `test_rms_norm_splitk` to test the split-k variant of RMS normalization.
      - Updated the main RMS normalization script to include profiling for the split-k implementation.
      - Ensured all checks pass with appropriate latency measurements for both reference and tile-lang implementations.
      
      * Remove obsolete test file `test_fragment_mma_load_a.py` to streamline the test suite.
      
      * Refactor `rms_norm.py` to streamline benchmarking output and remove redundant code. Comment out the `plot_layout` call in `fragment_mma_load_a.py` for clarity.
      
      * Refactor `test_rms_norm.py` by removing redundant test function `test_rms_norm_splitk` to streamline the test suite and improve clarity.
      
      ---------
      Co-authored-by: default avatarYour Name <you@example.com>
      c9e503be
    • Tong WU's avatar
      [CI] Add linear attention examples to CI (#552) · eec07578
      Tong WU authored
      * Add linear attention examples.
      
      * Add license
      
      * Remove comments
      
      * Run yapf and ruff
      eec07578
    • Lei Wang's avatar
      [Autotune] Remove the out_idx argument from the autotune cache (#553) · 5fbfb80b
      Lei Wang authored
      
      
      * [Enhancement] Update AutoTuner and JIT compilation arguments
      
      * Added functionality to return compile arguments in the JIT implementation, enhancing the autotuner's caching capabilities.
      * Modified `CompileArgs` and `AutotuneResult` classes to support optional `out_idx` parameter, improving flexibility in compile argument handling.
      * Refactored the `_AutoTunerImplementation` to utilize the new compile arguments, ensuring better integration and performance during tuning processes.
      
      * Update tilelang/autotuner/param.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * remove redundant comments
      
      * Update tilelang/jit/__init__.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      5fbfb80b
    • Lei Wang's avatar
      [Bugfix] Enhance layout inference pass for flexibility (#550) · 444b7c4e
      Lei Wang authored
      * Enhance Layout
      
      * strict update
      
      * lint fix
      
      * Refactor layout inference by removing unnecessary logging statements in `parallel.cc` and `layout_inference.cc`. This cleanup enhances code readability and reduces log clutter during layout inference steps.
      
      * lint fix
      
      * Refactor file copying logic in setup.py to simplify directory creation and file copying process. Removed unnecessary existence check before copying source files to the target directory.
      444b7c4e
    • Lei Wang's avatar
      [AMD][Enhancement] Add support for Vectorized FP8 DataPacking (#542) · 319bc6b1
      Lei Wang authored
      * [Enhancement] Add support for new FP8 types in HIP code generation
      
      * Updated `PrintConst` function in `codegen_hip.cc` to handle `float8_e4m3fnuz` type.
      * Introduced new functions in `hip_fp8.h` for creating FP8 types, including `make_fp8_e4_4_t` and `make_fp8_e4_8_t`, enhancing type handling for FP8 data structures.
      * Improved overall compatibility and performance for FP8 data types in HIP.
      
      * workaround for competition
      
      * enhance autotune
      
      * autotune cache fix
      
      * Implement validation for unused keys in AutoTuner configuration
      
      * Added a check in the AutoTuner class to raise a ValueError if there are unused keys in the configuration, enhancing error handling and ensuring configuration integrity.
      
      * lint fix
      
      * revert changes of threads
      
      * Update pipelining in `example_mla_decode.py` to improve performance
      
      * Changed the number of stages in the pipelined loop from 0 to 2, enhancing the efficiency of the attention mechanism in the decoding process.
      
      * Enhance Cython kernel validation by adding tensor attribute checks
      
      * Updated the `CythonKernelWrapper` to include dedicated methods for validating tensor device, dtype, and static shape.
      * Modified the `forward` method to utilize these new validation methods, improving error handling and ensuring input integrity.
      * Updated the `lambda_forward` function in `CythonKernelAdapter` to reflect changes in validation parameters.
      319bc6b1
    • Lei Wang's avatar
      [Refactor] Include several examples into ci (#531) · 3ca3a8af
      Lei Wang authored
      * Remove unused 2D continuous cumulative sum example and related functions from the cumsum module.
      
      * lint fix
      
      * fix split k example
      
      * Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation
      
      * Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice
      3ca3a8af
  9. 03 Jun, 2025 2 commits
  10. 02 Jun, 2025 1 commit
  11. 01 Jun, 2025 1 commit
    • Lei Wang's avatar
      [AMD] Support float8 matrix core (#537) · 5872e647
      Lei Wang authored
      
      
      * [Enhancement] Add support for FP8 types in CUDA and HIP code generation
      
      * Updated `GetFP8Type` function in `codegen_cuda.cc` and `codegen_hip.cc` to handle new FP8 types, including `kFloat8_e4m3fnuz`.
      * Introduced a new header file `hip_fp8.h` for FP8 type definitions in HIP.
      * Modified type mappings in `dlpack.py` and `mfma_macro_generator.py` to accommodate new FP8 types.
      * Enhanced type handling in `TLHIPSourceWrapper` and `tensor.py` for better integration with FP8 types.
      * Added necessary includes and logic to support FP8 in the code generation process, improving performance and compatibility with FP8 data types.
      
      * lint fix
      
      * Update src/target/codegen_hip.cc
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * Update tilelang/intrinsics/mfma_macro_generator.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * workaround
      
      * fix
      
      * Update submodule TVM to latest commit 587028ffebfff0ded520f8f90d62f0f6b165906c
      
      * bug fix
      
      * Refactor tilelang matrix multiplication to support transposition and packing options. Adjusted shared memory shapes and loading logic for A and B matrices. Updated test cases to validate new functionality.
      
      * Refactor assertion function for tilelang matrix multiplication to improve readability by formatting parameters and aligning code. Cleaned up whitespace in intrinsic layout functions for consistency.
      
      * Update bfloat16 type definitions in common.h and gemm.h for consistency. Changed __hip_bfloat16 to hip_bfloat16 and updated MfmaTraits specialization accordingly.
      
      * lint fix
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      5872e647
  12. 31 May, 2025 1 commit
  13. 29 May, 2025 1 commit
    • Lei Wang's avatar
      [Language] Support `T.annotate_l2_hit_ratio` via `cudaStreamSetAttribute` (#539) · a65f481e
      Lei Wang authored
      * Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code
      
      * Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process.
      * Cleaned up unnecessary whitespace in the function for improved readability.
      * Enhanced the overall clarity and maintainability of the code.
      
      * Refactor AllocateNode handling in vectorize_loop.cc
      
      * Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic.
      * Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability.
      * Improved overall readability by eliminating unnecessary comments and code related to extent handling.
      
      * Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing.
      
      * Add buffer allocation planning step in OptimizeForTarget function
      
      * Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process.
      * This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management.
      
      * Update submodule TVM to latest commit db50d4e, ensuring alignment with upstream changes.
      
      * Add L2 persistent annotation support and related functionality
      
      * Introduced a new file `lower_l2_persistent_annotation.cc` to handle the lowering of L2 persistent annotations.
      * Added functions to annotate L2 hit ratios for buffers, ensuring compatibility with global buffer requirements.
      * Updated the `LowerAndLegalize` function to include the new L2 persistent map lowering step.
      * Enhanced CUDA driver with a function to retrieve the maximum size of the persisting L2 cache.
      * Modified the `TLCUDASourceWrapper` class to integrate L2 persistent map handling during kernel launches.
      
      These changes improve the framework's ability to manage L2 cache optimizations, enhancing performance for CUDA applications.
      
      * lint fix
      a65f481e
  14. 28 May, 2025 4 commits
    • Lei Wang's avatar
      [Refactor] Disable legacy vectorization for buffer allocation (#535) · e71c7a17
      Lei Wang authored
      * Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code
      
      * Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process.
      * Cleaned up unnecessary whitespace in the function for improved readability.
      * Enhanced the overall clarity and maintainability of the code.
      
      * Refactor AllocateNode handling in vectorize_loop.cc
      
      * Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic.
      * Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability.
      * Improved overall readability by eliminating unnecessary comments and code related to extent handling.
      
      * Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing.
      
      * Add buffer allocation planning step in OptimizeForTarget function
      
      * Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process.
      * This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management.
      e71c7a17
    • yyttt6's avatar
      8af5eb77
    • Lei Wang's avatar
      [Autotune] Introduce cache mechanism for auto tuner (#527) · 7171aff6
      Lei Wang authored
      * [Enhancement] Add commit ID to versioning and improve logging initialization
      
      * Updated `get_tilelang_version` to include an optional commit ID in the version string.
      * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
      * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
      * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
      * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.
      
      * [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching
      
      * Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability.
      * Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations.
      * Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking.
      * Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process.
      * Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity.
      * Minor adjustments in related modules to support the new caching and profiling features.
      
      * [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules
      
      * Consolidated import statements and removed unnecessary line breaks for better readability.
      * Standardized function argument formatting across the AutoTuner and CompileArgs classes.
      * Enhanced consistency in the use of whitespace and indentation throughout the codebase.
      * Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability.
      * Ensured that all changes adhere to the project's coding style guidelines.
      
      * [Refactor] Remove redundant type hints in AutoTuner modules
      
      * Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`.
      * Improved code readability and maintainability by streamlining type imports across the AutoTuner module.
      
      * [Refactor] Update AutoTuner configuration for improved profiling and target detection
      
      * Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings.
      * Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection.
      * Removed redundant target specifications in certain instances to streamline the configuration process.
      * Improved overall clarity and maintainability of the autotuning logic in various example scripts.
      
      * [Refactor] Simplify code formatting and improve readability in example scripts
      
      * Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity.
      * Removed unnecessary line breaks and standardized argument placement across multiple files.
      * Enhanced overall code readability and maintainability in autotuning examples and performance scripts.
      
      * [Refactor] Update JIT decorator usage across multiple files
      
      * Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code.
      * Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase.
      * Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions.
      
      * [Refactor] Standardize JIT decorator formatting across benchmark and example scripts
      
      * Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks.
      * Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts.
      * Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.
      7171aff6
    • Lei Wang's avatar
      [Refactor] Refactor convolution example to streamline configuration and remove unused code (#530) · 09581e4e
      Lei Wang authored
      
      
      * Refactor convolution example to streamline configuration and remove unused code
      
      * Updated the `check_hopper` function to properly check for CUDA availability and compute capability.
      * Removed the `get_configs` and `get_best_config` functions, simplifying the example by eliminating unused autotuning logic.
      * Adjusted argument parsing in the `main` function to directly compile the convolution kernel without autotuning options.
      * Cleaned up the code for better readability and maintainability.
      
      * Update examples/convolution/example_convolution.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      09581e4e
  15. 27 May, 2025 2 commits
    • Leslin's avatar
      [CI] Add gemm and gemm_fp8 example to CI (#516) · ee4e708d
      Leslin authored
      * [CI] Add gemm and gemm_fp8 example to CI
      
      * Fix lint via format.sh
      
      * Resolved the issues with profiler API usage and parse_args
      ee4e708d
    • Yu Cheng's avatar
      [Enhancement] Add warp specialization attribute handling in IR and rewriter (#518) · 41bc15cb
      Yu Cheng authored
      * Introduced an `AttrFrame` for warp specialization in the IR, enhancing the handling of warp-specific optimizations.
      * Refactored the `VisitStmt_` method in `warp_specialized_rewriter.cc` to check for the new warp specialization attribute, improving the detection of warp specialization conditions.
      * Removed outdated code related to condition checks in `IfThenElseNode`, streamlining the specialization logic.
      41bc15cb
  16. 26 May, 2025 4 commits
    • Lei Wang's avatar
      [Enhancement] Add commit ID to versioning and improve logging initialization (#524) · 62a8d7f0
      Lei Wang authored
      * Updated `get_tilelang_version` to include an optional commit ID in the version string.
      * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
      * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
      * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
      * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.
      62a8d7f0
    • Lei Wang's avatar
      [Refactor] Reorganize Thread Synchronization Steps to make sure global... · 41c51d07
      Lei Wang authored
      [Refactor] Reorganize Thread Synchronization Steps to make sure global synchronization can be correctly lowered (#521)
      
      * [Refactor] Reorganize Thread Synchronization Steps in OptimizeForTarget Function
      
      * Removed redundant thread synchronization steps for "global" and "shared" memory, streamlining the optimization process.
      * Reintroduced necessary synchronization for "shared" and "shared.dyn" after the injection of PTX async copy, ensuring correct memory access patterns.
      * Enhanced overall clarity and maintainability of the OptimizeForTarget function by restructuring the order of operations.
      
      * [Refactor] Reorder Thread Synchronization and PTX Async Copy in OptimizeForTarget Function
      
      * Removed redundant global thread synchronization step and adjusted the order of operations for shared memory synchronization.
      * Ensured that the PTX async copy injection occurs after the global thread sync, improving memory access validity.
      * Enhanced clarity and maintainability of the OptimizeForTarget function by restructuring synchronization steps.
      41c51d07
    • Lei Wang's avatar
      [Enhancement] Add atomicAdd for FLOAT16x2 and FLOAT16x4 (#522) · 46798f25
      Lei Wang authored
      * [Enhancement] Add atomic addition functions for FLOAT16x2 and FLOAT16x4 in CUDA
      
      * Introduced `AtomicAddx2` and `AtomicAddx4` functions for performing atomic addition operations on double-width float types in CUDA.
      * Updated `customize.py` to include the new `atomic_addx4` function for external calls.
      * Modified `__init__.py` to export the new atomic addition function, ensuring accessibility in the module.
      
      * lint fix
      46798f25
    • Lei Wang's avatar
      [Refactor] Replace default fp8 dtype with cute to perform fast cast (#520) · 6addc509
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      
      * Add global barrier state and expectation handling in CUDA code generation
      
      * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
      * Updated `Finish` method to declare the global barrier state if needed.
      * Implemented handling for `EvaluateNode` to initialize the barrier expectation.
      * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
      * Enhanced CUDA FP8 type definitions for better alignment and structure.
      
      * Enhance CUDA FP8 type handling and debug printing
      
      * Updated `cuda_fp8.h` to replace NVidia's FP8 types with Cute's FP8 types for better compatibility and structure.
      * Added specializations for `debug_print_var` and `debug_print_buffer_value` functions to support the new FP8 types, improving debugging capabilities for these data types.
      * Updated `debug.h` to include the new `cuda_fp8.h` header for access to the FP8 type definitions.
      
      * Refactor CUDA code generation to remove unnecessary managed qualifier for global barrier state
      
      * Updated the `Finish` method in `codegen_cuda.cc` to declare the global barrier state without the `__managed__` qualifier, simplifying the declaration.
      * Added a new `sync_global` function in `builtin.py` to synchronize all threads in a block, enhancing synchronization capabilities in the TileLang framework.
      
      * Remove deprecated CUDA kernel and Python script for FP8 E4M3 casting
      
      * Deleted the `cast_to_fp8_e4m3_kernel` CUDA kernel implementation and its corresponding Python script, streamlining the codebase by removing unused components related to FP8 E4M3 type casting.
      * This cleanup enhances maintainability and reduces potential confusion regarding obsolete code.
      
      * lint fix
      6addc509
  17. 25 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support auto synchronization for global memory access (#519) · 623edf4c
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      
      * Add global barrier state and expectation handling in CUDA code generation
      
      * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
      * Updated `Finish` method to declare the global barrier state if needed.
      * Implemented handling for `EvaluateNode` to initialize the barrier expectation.
      * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
      * Enhanced CUDA FP8 type definitions for better alignment and structure.
      623edf4c