• Lei Wang's avatar
    [Enhancement] align shared memory allocations (#583) · fecc8336
    Lei Wang authored
    * [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper
    
    - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation.
    - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter.
    - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions.
    
    * [Feature] Implement dynamic shared memory allocation alignment
    
    - Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency.
    - Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations.
    - Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures.
    
    * [Enhancement] Update dtype and argument defaults in GEMM autotuning example
    
    - Changed data type from `float16` to `bfloat16` for improved precision in computations.
    - Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process.
    
    * [Enhancement] Improve thread range computation in storage access
    
    - Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking.
    - Updated `AccessEntry` structure to include `thread_range`.
    - Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits.
    - Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency.
    
    * [Refactor] Update comments for clarity in dynamic shared memory allocation alignment
    
    - Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding.
    - Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`.
    - Added a blank line for improved readability in `thread_storage_sync.cc`.
    
    * [Refactor] Enhance storage access analysis and thread range computation
    
    - Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits.
    - Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency.
    - Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations.
    
    * [Enhancement] Update default matrix dimensions in GEMM autotune example
    
    - Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking.
    
    * typo fix
    
    * enhancement
    
    * [Fix] Add conflict detection for buffer index size mismatch in thread storage sync
    
    - Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.
    fecc8336
thread_storage_sync.cc 18.2 KB