- 23 Jun, 2025 1 commit
-
-
Jianqiao Lu authored
* feat: add a easy version for online softmax * fix: set x & y to fragment memory to load data from global memory * feat: apply format check * Add License --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 20 Jun, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation. - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter. - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions. * [Feature] Implement dynamic shared memory allocation alignment - Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency. - Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations. - Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures. * [Enhancement] Update dtype and argument defaults in GEMM autotuning example - Changed data type from `float16` to `bfloat16` for improved precision in computations. - Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process. * [Enhancement] Improve thread range computation in storage access - Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking. - Updated `AccessEntry` structure to include `thread_range`. - Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits. - Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency. * [Refactor] Update comments for clarity in dynamic shared memory allocation alignment - Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding. - Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`. - Added a blank line for improved readability in `thread_storage_sync.cc`. * [Refactor] Enhance storage access analysis and thread range computation - Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits. - Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency. - Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations. * [Enhancement] Update default matrix dimensions in GEMM autotune example - Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking. * typo fix * enhancement * [Fix] Add conflict detection for buffer index size mismatch in thread storage sync - Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.
-
- 16 Jun, 2025 2 commits
-
-
徐畅 authored
* [CI] Add flash_decoding example to CI * Add output of ref latency * format example_gqa_decode.py * [BugFix] Fix precision issue in GQA decode when block_N exceeds seqlen/num_split * format example_gqa_decode.py
-
Tong WU authored
* Update FLA import path for `prepare_token_indices` * Update FLA import path for `prepare_token_indices` * Compare versions via packaging.version.parse
-
- 13 Jun, 2025 1 commit
-
-
Lei Wang authored
Fix assertion in GQA backward example to ensure correct tensor comparison for gradient validation (#568)
-
- 11 Jun, 2025 1 commit
-
-
Yu Cheng authored
* [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation - Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid. - Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code. - Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels. - Updated relevant documentation and comments to reflect the addition of new features and usage instructions. * [Example] Add MLA Decode With Persistent Threadblock Example * [Feature] Introduce Persistent Loop and Update GEMM Example - Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution. - Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication. - Introduced a `loop_break` intrinsic for better control flow within persistent loops. - Updated relevant files to support the new features, including changes in code generation and language interface. * lint fix
-
- 07 Jun, 2025 1 commit
-
-
Yu Cheng authored
* [Enhancement] Fix multi-version buffer index in nested-loop * [Feature] Support persistent kernels and add persistent GEMM example * lint fix * lint fix * [CI] Remove test_tilelang_transform_annotate_device_regions.py
-
- 06 Jun, 2025 1 commit
-
-
xs-keju authored
* [CI] Add CI test for flash_attention examples * Update example_gqa_fwd_bshd.py * Update example_mha_fwd_bshd_wgmma_pipelined.py * [CI] Added conditional annotations for tests in flash_attention * [CI] Added conditional annotations for tests in flash_attention --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 05 Jun, 2025 1 commit
-
-
Zhengju Tang authored
* [CI] Add FusedMoE example * Lint * Fix import bug * Fix comment bug * Update example_fusedmoe_torch.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 04 Jun, 2025 3 commits
-
-
alex_xiao authored
* [CI]Add norm and layout_plot * fix lint * Remove obsolete test files for RMS normalization and plot layout, streamlining the testing suite. * Add make_mma_load_base_layout function to create MMA result layouts - Introduced a new function `make_mma_load_base_layout` for generating layout functions for storing MMA results in fragment buffers. - Added detailed docstring explaining parameters, return values, and potential exceptions. - Implemented logic for handling different data types and matrix configurations, including assertions for input validation. - Defined internal functions for mapping fragment indices to threads and local indices, enhancing the layout functionality. * Enhance MMA load test with additional imports and functionality - Added imports for `tilelang.language`, `Literal`, `Callable`, `DataType`, `IndexMap`, and `get_mma_micro_size` to support extended functionality. - Improved the `make_mma_load_base_layout` function by ensuring it can handle various data types and configurations. - Updated the test function `test_mma_load_base_layout` to validate the layout for float16 matrix A. * Fix formatting in test_fragment_mma_load_a.py by adding a blank line for improved readability. * Add RMS normalization functions to test_rms_norm.py - Introduced `rms_norm` and `rms_norm_splitk` functions for RMS normalization, enhancing the testing capabilities. - Implemented kernel functions with shared memory allocation and parallel processing for improved performance. - Updated the test function to validate the new RMS normalization implementations. * Add reference program for RMS normalization in test_rms_norm.py - Introduced `ref_program` function to provide a reference implementation for RMS normalization. - This addition enhances the testing framework by allowing comparisons against a known reference output. * Enhance RMS normalization tests with additional imports and formatting - Added import for `tilelang.language` to support extended functionality in `test_rms_norm.py`. - Improved code readability by adding blank lines for better separation of code sections. * Update RMS normalization test parameters and enhance layout plotting - Increased matrix dimensions in `test_rms_norm` to 8192 for improved performance testing. - Removed obsolete test functions in `test_fragment_mma_load_a.py` to streamline the test suite. - Enhanced layout plotting functionality by ensuring proper visualization of base, warp, and block layouts in `test_fragment_mma_load_a.py`. * Refactor RMS normalization test parameters and improve layout plotting readability - Simplified the parameters in `test_rms_norm` by removing `blk_k` for clarity. - Enhanced code readability in `test_fragment_mma_load_a.py` by adjusting the formatting of the `block_layout` definition and removing the unused `warp_cols` variable. * Enhance RMS normalization with split-k implementation and additional profiling - Added a new function `test_rms_norm_splitk` to test the split-k variant of RMS normalization. - Updated the main RMS normalization script to include profiling for the split-k implementation. - Ensured all checks pass with appropriate latency measurements for both reference and tile-lang implementations. * Remove obsolete test file `test_fragment_mma_load_a.py` to streamline the test suite. * Refactor `rms_norm.py` to streamline benchmarking output and remove redundant code. Comment out the `plot_layout` call in `fragment_mma_load_a.py` for clarity. * Refactor `test_rms_norm.py` by removing redundant test function `test_rms_norm_splitk` to streamline the test suite and improve clarity. --------- Co-authored-by:Your Name <you@example.com>
-
Tong WU authored
* Add linear attention examples. * Add license * Remove comments * Run yapf and ruff
-
Lei Wang authored
* Remove unused 2D continuous cumulative sum example and related functions from the cumsum module. * lint fix * fix split k example * Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation * Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice
-
- 03 Jun, 2025 1 commit
-
-
Tong WU authored
* [CI] Add hadamard example to CI * Run yapf and ruff * Run yapf and ruff
-
- 01 Jun, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add support for FP8 types in CUDA and HIP code generation * Updated `GetFP8Type` function in `codegen_cuda.cc` and `codegen_hip.cc` to handle new FP8 types, including `kFloat8_e4m3fnuz`. * Introduced a new header file `hip_fp8.h` for FP8 type definitions in HIP. * Modified type mappings in `dlpack.py` and `mfma_macro_generator.py` to accommodate new FP8 types. * Enhanced type handling in `TLHIPSourceWrapper` and `tensor.py` for better integration with FP8 types. * Added necessary includes and logic to support FP8 in the code generation process, improving performance and compatibility with FP8 data types. * lint fix * Update src/target/codegen_hip.cc Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update tilelang/intrinsics/mfma_macro_generator.py Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * workaround * fix * Update submodule TVM to latest commit 587028ffebfff0ded520f8f90d62f0f6b165906c * bug fix * Refactor tilelang matrix multiplication to support transposition and packing options. Adjusted shared memory shapes and loading logic for A and B matrices. Updated test cases to validate new functionality. * Refactor assertion function for tilelang matrix multiplication to improve readability by formatting parameters and aligning code. Cleaned up whitespace in intrinsic layout functions for consistency. * Update bfloat16 type definitions in common.h and gemm.h for consistency. Changed __hip_bfloat16 to hip_bfloat16 and updated MfmaTraits specialization accordingly. * lint fix --------- Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
- 28 May, 2025 3 commits
-
-
yyttt6 authored
-
Lei Wang authored
* [Enhancement] Add commit ID to versioning and improve logging initialization * Updated `get_tilelang_version` to include an optional commit ID in the version string. * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process. * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash. * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity. * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging. * [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching * Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability. * Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations. * Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking. * Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process. * Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity. * Minor adjustments in related modules to support the new caching and profiling features. * [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules * Consolidated import statements and removed unnecessary line breaks for better readability. * Standardized function argument formatting across the AutoTuner and CompileArgs classes. * Enhanced consistency in the use of whitespace and indentation throughout the codebase. * Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability. * Ensured that all changes adhere to the project's coding style guidelines. * [Refactor] Remove redundant type hints in AutoTuner modules * Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`. * Improved code readability and maintainability by streamlining type imports across the AutoTuner module. * [Refactor] Update AutoTuner configuration for improved profiling and target detection * Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings. * Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection. * Removed redundant target specifications in certain instances to streamline the configuration process. * Improved overall clarity and maintainability of the autotuning logic in various example scripts. * [Refactor] Simplify code formatting and improve readability in example scripts * Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity. * Removed unnecessary line breaks and standardized argument placement across multiple files. * Enhanced overall code readability and maintainability in autotuning examples and performance scripts. * [Refactor] Update JIT decorator usage across multiple files * Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code. * Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase. * Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions. * [Refactor] Standardize JIT decorator formatting across benchmark and example scripts * Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks. * Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts. * Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.
-
Lei Wang authored
* Refactor convolution example to streamline configuration and remove unused code * Updated the `check_hopper` function to properly check for CUDA availability and compute capability. * Removed the `get_configs` and `get_best_config` functions, simplifying the example by eliminating unused autotuning logic. * Adjusted argument parsing in the `main` function to directly compile the convolution kernel without autotuning options. * Cleaned up the code for better readability and maintainability. * Update examples/convolution/example_convolution.py Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
- 27 May, 2025 1 commit
-
-
Leslin authored
* [CI] Add gemm and gemm_fp8 example to CI * Fix lint via format.sh * Resolved the issues with profiler API usage and parse_args
-
- 24 May, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516) * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions. * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity. * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process. * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability. * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies. * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions. * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise. * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios. * Add dynamic matrix multiplication example and corresponding test * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling. * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example. * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness. * lint fix * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device. * Updated the `__init__.py` file to include the new function in the module exports. * lint fix
-
- 23 May, 2025 3 commits
-
-
Taoyu Zhu authored
* fix deepgemm example * fix deepgemm example * make format * Update example_deepgemm_fp8_2xAcc.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Yu Cheng authored
* Introduced `example_grouped_gemm_fwd.py` and `example_grouped_gemm_bwd.py` to demonstrate grouped matrix multiplication with forward and backward operations. * Implemented functions for grouped GEMM, input construction, and validation against PyTorch's implementation. * Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions. * Included a test function to validate the functionality with various input scenarios.
-
Yu Cheng authored
* Introduced a new example script `example_grouped_gemm.py` demonstrating grouped matrix multiplication using TileLang and PyTorch. * Implemented functions for performing grouped GEMM, constructing inputs, and validating results against PyTorch's implementation. * Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions. * Included a test function to validate the grouped GEMM functionality with various input scenarios.
-
- 18 May, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Update JIT kernel functions and streamline GEMM tests * Renamed and refactored matmul and run_gemm functions to matmul_kernel_jit and run_gemm_kernel_jit for clarity. * Removed redundant JIT decorator from the matmul function, ensuring it is applied only to the kernel function. * Updated test function names to reflect changes in the kernel functions, enhancing consistency and readability. * Cleaned up commented-out code and unnecessary imports to improve overall code quality. * Update main function call in GEMM test to use tilelang testing framework * Update README and example scripts to include JIT decorator comments * Added comments in README.md and various example scripts to indicate the use of the @tilelang.jit decorator for returning torch functions. * Removed redundant comments that previously instructed to add the decorator, streamlining the documentation and improving clarity. * Update GEMM test parameters for improved performance * Set num_stages to 0 and adjusted matrix dimensions in test functions to enhance performance and consistency across GEMM tests in test_tilelang_kernel_gemm.py.
-
- 17 May, 2025 1 commit
-
-
Lei Wang authored
* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully. * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management. * Add merge shared memory allocations pass and related configurations - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage. - Registered configuration options for debugging and controlling the merging behavior. - Updated relevant files to integrate the new pass into the TileLang engine and transform modules. - Adjusted import paths and added documentation for the new functionality. * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels. * lint fix
-
- 16 May, 2025 2 commits
-
-
Yu Cheng authored
* [Refactor] Update example_mla_decode.py and add tests for block_sparse_attn_tilelang * Refactor example_mla_decode.py to define a main function for better structure and clarity. * Introduce test_example_mla_decode.py to validate the functionality of example_mla_decode. * Refactor block_sparse_attn_tilelang.py to define a main function and add test_block_sparse_attn_tilelang.py for testing. * Ensure all new test files are integrated with tilelang testing framework. * [Test] Enhance test_example_mla_decode with argument mocking * Update test_example_mla_decode.py to mock sys.argv for better test isolation. * Ensure the main function of example_mla_decode is called with the correct arguments during testing.
-
Lei Wang authored
* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully. * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management. * Add merge shared memory allocations pass and related configurations - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage. - Registered configuration options for debugging and controlling the merging behavior. - Updated relevant files to integrate the new pass into the TileLang engine and transform modules. - Adjusted import paths and added documentation for the new functionality. * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
-
- 14 May, 2025 1 commit
-
-
Lei Wang authored
[Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple (#494) * Remove deprecated example_dequant_gemm.py and add DataType import in __init__.py * lint fix * lint fix * Refactor dequantization examples to use tilelang imports and update data type handling in quantization utilities * lint fix
-
- 13 May, 2025 1 commit
-
-
徐畅 authored
* [CI] Add flash_decoding example to CI * Add output of ref latency * format example_gqa_decode.py
-
- 10 May, 2025 3 commits
-
-
yyttt6 authored
* yes * [Bugfix] fix the unexpected keyword error of autotune * format * test * [CI] Add Analyzer and blocksparse_attention examples to CI * format * try * try * try * try * t * format * d --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Wenhao Xie authored
* add convolution example to CI * lint fix * Update test_example_convolution.py * fix bug --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Wenhao Xie authored
* add convolution example to CI * lint fix * Update test_example_convolution.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 09 May, 2025 2 commits
-
-
Zhengju Tang authored
* [Refactor] Enhance TMA barrier validation and support for additional architectures (#463) * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`. * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic. * [CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI. * Lint --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Cunxiao Ni authored
* [CI] Add elementwise and gemv examples to CI. * fix lint * test * fix gemv lint * fix lint
-
- 08 May, 2025 2 commits
-
-
Lei Wang authored
[Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py (#457) * Refactored barrier functions to use new signatures for improved clarity and consistency. * Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively. * Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.
-
Lei Wang authored
* Add example for warp specialization with flash attention * Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang. * Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance. * Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics. * Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation. * Add memory barrier functions to builtin.py * Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization. * Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters. * The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1. * Improved code organization and readability by adding blank lines for better separation of logical sections. * Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py * Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`. * Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability. * [Refactor] Update barrier functions and add new example for GEMM with warp specialization * Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency. * Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation. * Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations. * Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios. * Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations. * lint fix * Update loop variable checks in SIMT loop and buffer region validation * Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling. * Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging. * lint fix * test fix
-
- 06 May, 2025 2 commits
-
-
Lei Wang authored
* [Enhancement] Introduce pass_configs parameter for kernel compilation * Added a new `pass_configs` parameter to the `tilelang.compile` function to allow for more flexible kernel compilation configurations. * Updated related classes and methods to accommodate the new parameter, ensuring compatibility across the codebase. * Enhanced the `torch_assert_close` function to include customizable tensor names for better debugging output. * Refactored input handling in example scripts to streamline the process of obtaining inputs for kernel execution. * lint fix
-
Lei Wang authored
* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability. * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management. * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files * Updated function names from CamelCase to snake_case for better consistency across the codebase. * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc. * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions. * [Refactor] Rename operations to snake_case for consistency * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others. * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability. * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions. * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang. * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames. * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions. * Enhanced the TileLang API with new methods for retrieving block and thread extents. * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation. * Improved layout inference and kernel launch logic for better performance and clarity. * [Refactor] Clean up code formatting and improve readability * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`. * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity. * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs. * Ensured consistent spacing and formatting across multiple files to enhance overall code readability. * lint fix * [Refactor] Update mbarrier functions for improved clarity and consistency * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability. * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity. * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code. * Added detailed docstrings to clarify usage examples for memory barrier functions. * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections. * [Feature] Add examples for warp specialization and TMA barrier integration * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers. * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance. * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch. * Updated the `phase.py` to include TMA barrier injection in the optimization process. * Improved documentation and comments for better clarity on usage and functionality. * [Feature] Add example for warp specialization in GEMM with TMA barriers * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers. * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance. * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation. * Enhanced documentation and comments for clarity on usage and functionality. * lint fix * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement. * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results. * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis. * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness. * lint fix * [Feature] Add new examples for warp specialization and TMA integration * Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`. * Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance. * Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples. * Updated the TileLang API to support these examples and improve kernel compilation and testing processes. * Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities. * lint fix * Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.
-
- 03 May, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability. * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management. * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files * Updated function names from CamelCase to snake_case for better consistency across the codebase. * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc. * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions. * [Refactor] Rename operations to snake_case for consistency * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others. * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability. * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions. * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang. * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames. * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions. * Enhanced the TileLang API with new methods for retrieving block and thread extents. * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation. * Improved layout inference and kernel launch logic for better performance and clarity. * [Refactor] Clean up code formatting and improve readability * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`. * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity. * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs. * Ensured consistent spacing and formatting across multiple files to enhance overall code readability. * lint fix * [Refactor] Update mbarrier functions for improved clarity and consistency * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability. * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity. * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code. * Added detailed docstrings to clarify usage examples for memory barrier functions. * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections. * [Feature] Add examples for warp specialization and TMA barrier integration * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers. * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance. * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch. * Updated the `phase.py` to include TMA barrier injection in the optimization process. * Improved documentation and comments for better clarity on usage and functionality. * [Feature] Add example for warp specialization in GEMM with TMA barriers * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers. * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance. * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation. * Enhanced documentation and comments for clarity on usage and functionality. * lint fix * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement. * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results. * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis. * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness. * lint fix
-
- 30 Apr, 2025 2 commits
-
-
Lei Wang authored
* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability. * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management. * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files * Updated function names from CamelCase to snake_case for better consistency across the codebase. * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc. * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions. * [Refactor] Rename operations to snake_case for consistency * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others. * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability. * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions. * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang. * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames. * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions. * Enhanced the TileLang API with new methods for retrieving block and thread extents. * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation. * Improved layout inference and kernel launch logic for better performance and clarity. * [Refactor] Clean up code formatting and improve readability * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`. * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity. * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs. * Ensured consistent spacing and formatting across multiple files to enhance overall code readability. * lint fix * [Refactor] Update mbarrier functions for improved clarity and consistency * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability. * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity. * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code. * Added detailed docstrings to clarify usage examples for memory barrier functions. * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
-
dependabot[bot] authored
Bumps [transformers](https://github.com/huggingface/transformers) from 4.48.0 to 4.50.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.48.0...v4.50.0 ) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.50.0 dependency-type: direct:production ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-