"src/vscode:/vscode.git/clone" did not exist on "2832520418fe81185b1a44f8421ec32f2efc1714"
[Enhancement] align shared memory allocations (#583)
* [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation. - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter. - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions. * [Feature] Implement dynamic shared memory allocation alignment - Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency. - Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations. - Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures. * [Enhancement] Update dtype and argument defaults in GEMM autotuning example - Changed data type from `float16` to `bfloat16` for improved precision in computations. - Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process. * [Enhancement] Improve thread range computation in storage access - Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking. - Updated `AccessEntry` structure to include `thread_range`. - Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits. - Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency. * [Refactor] Update comments for clarity in dynamic shared memory allocation alignment - Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding. - Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`. - Added a blank line for improved readability in `thread_storage_sync.cc`. * [Refactor] Enhance storage access analysis and thread range computation - Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits. - Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency. - Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations. * [Enhancement] Update default matrix dimensions in GEMM autotune example - Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking. * typo fix * enhancement * [Fix] Add conflict detection for buffer index size mismatch in thread storage sync - Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.
Showing
Please register or sign in to comment