1. 20 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] align shared memory allocations (#583) · fecc8336
      Lei Wang authored
      * [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper
      
      - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation.
      - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter.
      - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions.
      
      * [Feature] Implement dynamic shared memory allocation alignment
      
      - Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency.
      - Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations.
      - Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures.
      
      * [Enhancement] Update dtype and argument defaults in GEMM autotuning example
      
      - Changed data type from `float16` to `bfloat16` for improved precision in computations.
      - Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process.
      
      * [Enhancement] Improve thread range computation in storage access
      
      - Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking.
      - Updated `AccessEntry` structure to include `thread_range`.
      - Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits.
      - Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency.
      
      * [Refactor] Update comments for clarity in dynamic shared memory allocation alignment
      
      - Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding.
      - Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`.
      - Added a blank line for improved readability in `thread_storage_sync.cc`.
      
      * [Refactor] Enhance storage access analysis and thread range computation
      
      - Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits.
      - Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency.
      - Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations.
      
      * [Enhancement] Update default matrix dimensions in GEMM autotune example
      
      - Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking.
      
      * typo fix
      
      * enhancement
      
      * [Fix] Add conflict detection for buffer index size mismatch in thread storage sync
      
      - Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.
      fecc8336
  2. 11 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Language] Introduce `T.any_of` and `T.all_of` to reduce a bool arrary (#371) · c4638d65
      Lei Wang authored
      
      
      * [Enhancement] Introduce logical operations `any_of` and `all_of` for buffer checks
      
      - Added new logical operations `any_of` and `all_of` to the TileLang language interface, allowing users to check conditions across buffer elements.
      - Implemented corresponding intrinsic calls for CUDA, enhancing the functionality of the TileLang framework.
      - Updated the `allocate.py` to handle boolean types correctly in shared memory allocations.
      - Introduced tests for the new logical operations to ensure correctness and performance.
      Co-authored-by: default avatarZhiwen Mo <zhiwen.mo25@ic.ac.uk>
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zhiwen.mo25@ic.ac.uk>
      c4638d65
  3. 14 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Separate tilelang Pass Thread Sync (with Hopper support) from tvm (#85) · ec84188f
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      
      * [Build] Add tox configuration and local distribution script for multi-Python version support
      
      * [Build] Improve git submodule update function with better error handling
      
      * [Build] Update LLVM configuration path in ROCm installation script
      
      * [Build] Add .tox/ to .gitignore for tox testing environment
      
      * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt
      
      * [Cleanup] Remove unused TVM runtime error codes header
      
      * [Cleanup] Fix TVM grid constant type reference in CUDA module
      
      * [Cleanup] Remove unused customized_code function from IR module
      
      * [Feature] Add TileLang thread synchronization and storage access analysis passes
      
      * [Build] Reorder DLL search path directories for more flexible library loading
      
      * [Refactor] Improve thread synchronization and library path handling
      
      - Rename ThreadSync and TileLangThreadSync functions in C++ code
      - Update Python docstring for ThreadSync with more detailed description
      - Reorder library path detection in tilelang environment setup
      - Minor comment and code cleanup in CUDA and warp specialization modules
      
      * [Refactor] Improve thread synchronization code style and formatting
      
      - Standardize pointer type spacing in storage_access.h and storage_access.cc
      - Update whitespace and indentation in thread_storage_sync.cc
      - Reorder include statements in thread_partial_sync.cc
      - Minor code formatting improvements across thread synchronization files
      
      * [Refactor] Fix global function registration for ThreadSync
      
      - Correct global function registration to use ThreadSync instead of TileLangThreadSync
      - Update TVM global registration to match recent refactoring efforts
      
      * [Refactor] Simplify ThreadSync global function registration
      
      - Remove unnecessary whitespace in global function registration
      - Compact the TVM global registration line for ThreadSync
      ec84188f