1. 23 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741) · 6b125028
      Lei Wang authored
      * Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency.
      
      * Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency.
      
      * Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability.
      
      * Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality.
      
      * typo fix
      
      * Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues.
      
      * lint fix
      
      * test fix
      
      * Enhance CI configuration by adding verbose output to pip install command for better visibility during installation.
      
      * use ninja instead of make
      
      * Add CMake configuration step for Ninja build system in setup.py
      
      * Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja.
      
      * Enhance CI configuration by adding verbose output to pytest commands for improved test visibility.
      
      * Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks.
      
      * Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly.
      
      * Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes.
      
      * Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions.
      
      * bugfix
      
      * Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp.
      
      * Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts.
      
      * Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality.
      6b125028
  2. 20 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] align shared memory allocations (#583) · fecc8336
      Lei Wang authored
      * [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper
      
      - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation.
      - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter.
      - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions.
      
      * [Feature] Implement dynamic shared memory allocation alignment
      
      - Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency.
      - Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations.
      - Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures.
      
      * [Enhancement] Update dtype and argument defaults in GEMM autotuning example
      
      - Changed data type from `float16` to `bfloat16` for improved precision in computations.
      - Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process.
      
      * [Enhancement] Improve thread range computation in storage access
      
      - Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking.
      - Updated `AccessEntry` structure to include `thread_range`.
      - Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits.
      - Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency.
      
      * [Refactor] Update comments for clarity in dynamic shared memory allocation alignment
      
      - Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding.
      - Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`.
      - Added a blank line for improved readability in `thread_storage_sync.cc`.
      
      * [Refactor] Enhance storage access analysis and thread range computation
      
      - Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits.
      - Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency.
      - Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations.
      
      * [Enhancement] Update default matrix dimensions in GEMM autotune example
      
      - Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking.
      
      * typo fix
      
      * enhancement
      
      * [Fix] Add conflict detection for buffer index size mismatch in thread storage sync
      
      - Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.
      fecc8336
  3. 11 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Language] Introduce `T.any_of` and `T.all_of` to reduce a bool arrary (#371) · c4638d65
      Lei Wang authored
      
      
      * [Enhancement] Introduce logical operations `any_of` and `all_of` for buffer checks
      
      - Added new logical operations `any_of` and `all_of` to the TileLang language interface, allowing users to check conditions across buffer elements.
      - Implemented corresponding intrinsic calls for CUDA, enhancing the functionality of the TileLang framework.
      - Updated the `allocate.py` to handle boolean types correctly in shared memory allocations.
      - Introduced tests for the new logical operations to ensure correctness and performance.
      Co-authored-by: default avatarZhiwen Mo <zhiwen.mo25@ic.ac.uk>
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zhiwen.mo25@ic.ac.uk>
      c4638d65
  4. 14 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Separate tilelang Pass Thread Sync (with Hopper support) from tvm (#85) · ec84188f
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      
      * [Build] Add tox configuration and local distribution script for multi-Python version support
      
      * [Build] Improve git submodule update function with better error handling
      
      * [Build] Update LLVM configuration path in ROCm installation script
      
      * [Build] Add .tox/ to .gitignore for tox testing environment
      
      * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt
      
      * [Cleanup] Remove unused TVM runtime error codes header
      
      * [Cleanup] Fix TVM grid constant type reference in CUDA module
      
      * [Cleanup] Remove unused customized_code function from IR module
      
      * [Feature] Add TileLang thread synchronization and storage access analysis passes
      
      * [Build] Reorder DLL search path directories for more flexible library loading
      
      * [Refactor] Improve thread synchronization and library path handling
      
      - Rename ThreadSync and TileLangThreadSync functions in C++ code
      - Update Python docstring for ThreadSync with more detailed description
      - Reorder library path detection in tilelang environment setup
      - Minor comment and code cleanup in CUDA and warp specialization modules
      
      * [Refactor] Improve thread synchronization code style and formatting
      
      - Standardize pointer type spacing in storage_access.h and storage_access.cc
      - Update whitespace and indentation in thread_storage_sync.cc
      - Reorder include statements in thread_partial_sync.cc
      - Minor code formatting improvements across thread synchronization files
      
      * [Refactor] Fix global function registration for ThreadSync
      
      - Correct global function registration to use ThreadSync instead of TileLangThreadSync
      - Update TVM global registration to match recent refactoring efforts
      
      * [Refactor] Simplify ThreadSync global function registration
      
      - Remove unnecessary whitespace in global function registration
      - Compact the TVM global registration line for ThreadSync
      ec84188f