- 25 Jul, 2025 1 commit
-
-
徐畅 authored
-
- 24 Jul, 2025 2 commits
-
-
Lei Wang authored
* [Enhancement] Improve buffer conflict detection in thread storage synchronization - Added a new boolean variable `range_is_overlap` to accurately determine if buffer indices overlap, enhancing the conflict detection logic in `thread_storage_sync.cc`. - Updated the return logic to reflect the overlap status, ensuring correct conflict resolution based on buffer index comparisons. - Removed an unnecessary comment in `OptimizeForTarget` to streamline the code and improve clarity. * example fix * enhancement * improve ci
-
Zhengju Tang authored
[BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking (#653) * [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking * Lint * test fix * Update CI workflow to install dependencies without user site packages - Modified the installation commands in the CI workflow to include the `--no-user` flag for both `requirements-dev.txt` and `requirements-test.txt`, ensuring that packages are installed in the virtual environment rather than the user site directory. * Update CI workflow to install pip without user site packages - Added the `--no-user` flag to the pip installation command in the CI workflow for both development and testing dependencies, ensuring that packages are installed within the virtual environment. * Update requirements-test.txt * reduce ci problem size, * Refactor example_mla_decode.py for consistent formatting and remove unused imports in test_example_mla_decode.py --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 23 Jul, 2025 3 commits
-
-
Zhang Jason authored
Co-authored-by:zhangnju <ningzhan@SMC-SC-DI08-33.dh144.dcgpu>
-
Wenhao Xie authored
[Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control (#656) * [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control * lint fix * upd * lint fix * fix typo * update typing * update the use case of compile flags * ci fix * fix * Fix CI workflow to correctly activate virtual environment from shared cache directory * use local cache * fix * fix * fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Wenhao Xie authored
* fix CI bugs in hopper * lint fix * Update bulk_copy.cc * Refactor bulk copy logic in LowerBulkCopy function - Removed unnecessary blank lines for improved code readability. - Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides. - Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings. * test fix * ci fix * Update flash-attention dependencies and clean up example code - Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`. - Removed unused imports and commented-out code in various example files to enhance readability and maintainability. - Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`. - Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity. - Deleted the `example_mha_inference.py` file as it is no longer needed. * Update CI workflow to remove `--user` flag from pip install commands - Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment. * Update CI workflow to include `--no-user` flag in pip install commands - Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment. * Update CI workflow to include `--no-user` flag in pip install command for wheel mode - Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment. * test fix * avoid conflict with system environments * test fix * add commnets --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
- 16 Jul, 2025 2 commits
-
-
YizhaoGao authored
* Add paged block-sparse flash-decoding kernel * Update example_tilelang_sparse_gqa_decode_paged.py * lint fix --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* [Enhancement] Improve memory access condition checks in GlobalMemChecker - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations. - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis. * lintfix * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation. - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function. * [Refactor] Update barrier and clear operations in warp specialization examples - Replaced `mbarrier_wait_parity` and `mbarrier_arrive` with `barrier_wait` and `barrier_arrive` for improved clarity and consistency in synchronization. - Adjusted the order of `clear` operations for local fragments in `example_warp_specialize_gemm_copy_1_gemm_0` to enhance parallel execution efficiency. * [Enhancement] Implement thread partial synchronization and improve shared memory allocation handling - Added support for thread partial barrier synchronization in CUDA, allowing for more flexible thread management. - Enhanced the `MergeSharedMemoryAllocations` function to accept alignment bytes, improving memory allocation efficiency based on target requirements. - Updated the `Lower` methods in `Copy` and `Fill` classes to include conditional predicates for thread execution, ensuring better control over thread behavior. - Refactored the `print` function to include warp group and warp IDs for more detailed debugging output. - Improved the handling of dynamic shared memory allocations in the `LowerAndLegalize` function to align with target-specific requirements. * [Enhancement] Add support for disabling TMA in Copy operations - Introduced a new `disable_tma` parameter in the `Copy` class to control thread memory access behavior. - Updated the `Lower` method to conditionally execute bulk copy operations based on the `disable_tma` flag. - Enhanced the `copy` function to accept the `disable_tma` argument, allowing for more flexible memory copy operations. - Improved handling of `coalesced_width` to ensure it defaults to -1 when not provided, enhancing robustness in memory operations. * [Refactor] Clean up whitespace and formatting in multiple files - Removed unnecessary blank lines and adjusted line breaks for improved code readability in `example_mla_decode.py`, `example_warp_specialize_gemm_copy_gemm_0_1.py`, `phase.py`, and `copy.py`. - Ensured consistent formatting across functions to enhance maintainability and clarity of the codebase. * [Enhancement] Refactor flash attention implementation for improved performance and configurability - Split the shared memory allocations for query and key-value pairs to optimize memory usage. - Introduced command-line arguments for batch size, number of heads, and dimensions, enhancing flexibility in running the example. - Updated kernel execution parameters to improve thread management and synchronization. - Enhanced the overall structure of the flash attention function for better readability and maintainability. * fix * Update layout inference in ParallelOp to account for thread bounds; remove debug print in OptimizeForTarget * Refactor barrier handling and update example configurations - Replaced commented-out barrier creation with new barrier allocation in GEMM example. - Updated kernel configuration in warp specialization example to include async copy settings. - Enhanced barrier management in the phase optimization process to improve synchronization handling. - Introduced new barrier allocation function for better memory management in shared contexts. * Refactor barrier handling in LowerAndLegalize and OptimizeForTarget - Reintroduced barrier lowering in OptimizeForTarget to enhance synchronization. - Removed commented-out barrier lowering in LowerAndLegalize for cleaner code. - Added exit() call in OptimizeForTarget to halt execution after barrier lowering. * Enhance CMake configuration and clean up example scripts - Enabled compile command export in CMakeLists.txt for better build integration. - Removed unnecessary print statement in the warp specialization example. - Cleaned up commented-out code in GEMM example for improved readability. - Updated barrier handling in shared memory allocation transformations for better synchronization. * Refactor barrier handling in warp specialization examples - Replaced commented-out mbarrier code with new barrier allocation using T.alloc_barrier for improved synchronization. - Updated barrier wait and arrive calls to align with the new allocation method across multiple example scripts. - Enhanced code readability by removing unnecessary comments and ensuring consistent barrier management. * Update lower_shared_barrier.cc * Update phase.py * Update warp specialization example and Cython wrapper - Removed commented-out pass configuration options in the warp specialization example for clarity. - Added functionality to write the generated kernel source to a file named "kernel.cu". - Enhanced Cython wrapper to support boolean type conversion for improved type handling. * Add storage synchronization call in shared barrier transformation - Introduced a new evaluation statement to call the TVM storage sync function with "shared" as an argument, enhancing synchronization in the shared barrier handling process. * remove debug files * Remove kernel source output to file in warp specialization example * remove comments * Refactor tensor handling and update test execution in TileLang - Changed `Buffer` to `Tensor` in `customize.py` for better type consistency. - Updated `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to use `tir.BufferLoad` instead of `BufferLoad`. - Commented out the main testing function in `test_tilelang_language_reshape.py` and replaced it with a direct call to `run_reshape_smem` for streamlined testing. - Removed unnecessary NVCC compiler flags in `libgen.py` to reduce verbosity. * Update test_tilelang_language_reshape.py
-
- 15 Jul, 2025 3 commits
-
-
Yu Cheng authored
[Dev] Update benchmark and decoding scripts to refine condition checks and optimize tensor operations (#637) - Enhanced the condition in `compare_ab` to ensure baseline checks align with target exclusions. - Removed unnecessary tensor allocation in `mla_decode_tilelang`, optimizing memory usage and improving performance by directly using shared tensors in GEMM operations.
-
Lei Wang authored
* [Enhancement] Add argument simplification option to StmtSimplifier - Introduced a new `simplify_arguments` flag in the `StmtSimplifier::Apply` method to control argument simplification behavior. - Updated the `Simplify` function to accept the new flag, allowing for enhanced flexibility in the simplification process. - Adjusted the `LowerAndLegalize` and `_Simplify` functions to utilize the new argument, ensuring consistent behavior across the codebase. - Added comments to clarify the purpose of the new flag and its impact on simplification logic. * lint fix * [Enhancement] Improve layout inference and reduce operation handling - Updated `ParallelOp::InferLayout` to check for pure buffer stores, enhancing layout inference logic. - Modified `ReduceOp::Lower` to include all threads in the AllReduce operation, improving performance on specific architectures. - Added a TODO comment in `AllReduce` to consider merging synchronization barriers for optimization. * lint fix * [Enhancement] Add input validation for GEMM parameters - Introduced checks to ensure that the dimensions M and N are divisible by their respective warp sizes (kMPerWarp and kNPerWarp) in the Gemm::ComputeWarpPartition method. - Added informative error messages to assist in debugging when the input parameters do not meet the required conditions. * bug fix
-
Yuqing Xia authored
-
- 13 Jul, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Simplify and modularize autotuner implementation - Removed unused imports and extensive code sections from the autotuner module to enhance readability and maintainability. - Modularized the code by introducing new imports for autotuning and capturing functionalities, streamlining the overall structure. - Improved logging setup and removed redundant timeout handling functions, focusing on core autotuning logic. - Updated the AutoTuner class to better utilize the new modular structure, ensuring efficient performance during auto-tuning processes. * [Refactor] Clean up and enhance capture and tuner modules - Improved code readability by removing unnecessary blank lines and organizing imports in `capture.py` and `tuner.py`. - Enhanced logging in the `AutoTuner` class to provide clearer warnings regarding the usage of `supply_prog` in the context of auto-tuning. - Streamlined the `CaptureStack` class for better thread-local context management. * lint fix * [Refactor] Simplify configuration and autotuning logic in blocksparse GEMM example - Updated `get_configs` function to reduce the number of configurations, enhancing performance and clarity. - Removed the `get_best_config` function, integrating its logic directly into the `blocksparse_matmul` function with the `@autotune` decorator for streamlined autotuning. - Adjusted the main function to directly utilize the autotuned kernel, simplifying the overall structure and improving readability. - Deleted obsolete test file for autotuning decorator, cleaning up the codebase. * [Refactor] Improve code formatting and readability in autotune test file - Reformatted the `matmul` function and `get_configs` function for better readability by adjusting line breaks and indentation. - Fixed a typo in the `enable_rasteration` parameter name to ensure consistency. - Cleaned up unnecessary blank lines to enhance overall code clarity. * Update example_blocksparse_gemm.py * Update capture.py
-
- 12 Jul, 2025 1 commit
-
-
Kadir Nar authored
-
- 10 Jul, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Enhance expression handling in utils.py and update wrapper to use pythonic_expr - Added support for additional TIR expressions (FloorDiv, Min, Max, Add, Sub, FloorMod) in the pythonic_expr function to improve string representation. - Replaced the deprecated legalize_c function calls in TLCUDASourceWrapper and TLCPUSourceWrapper with pythonic_expr for better expression handling in kernel launch code. * [Refactor] Simplify expression handling in pythonic_expr function - Consolidated binary and min/max operation handling in the pythonic_expr function to improve readability and maintainability. - Replaced individual checks for binary operations with a mapping approach, streamlining the code and enhancing performance in expression representation. * [Enhancement] Improve expression representation in pythonic_expr function - Added operator precedence handling to the pythonic_expr function, enhancing the conversion of TVM PrimExpr to Python-style strings. - Updated the visitor logic to intelligently add parentheses based on operator precedence, improving the accuracy of expression representation. - Included a docstring for better clarity on the function's purpose and usage. * test fix
-
- 09 Jul, 2025 2 commits
-
-
Tong WU authored
* Tune linear attention examples on H100 * Add retnet fwd kernel * fix lint
-
dependabot[bot] authored
Bumps [transformers](https://github.com/huggingface/transformers) from 4.51.0 to 4.52.1. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.51.0...v4.52.1 ) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.52.1 dependency-type: direct:production ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 08 Jul, 2025 2 commits
-
-
Lei Wang authored
* [Refactor] Update tilelang kernel functions and remove unused imports - Refactored the `flashattn_fwd`, `flashattn_bwd_preprocess`, and `flashattn_bwd_postprocess` functions to utilize direct kernel calls instead of cached versions, improving clarity and performance. - Added `@tilelang.jit` decorators with specified output indices to enhance kernel compilation. - Removed unused import of `cached` from `tilelang`, streamlining the code. - Commented out the main testing function call in `test_tilelang_kernel_mha_bwd.py` for potential future use. * [Refactor] Simplify configuration generation in benchmark and example scripts - Refactored the `get_configs` functions in multiple benchmark and example scripts to utilize a dictionary-based approach for parameter configuration, improving readability and maintainability. - Updated the `flashattn` and `chunk_scan_fwd` functions to directly accept configuration parameters, enhancing flexibility in kernel tuning. - Removed redundant code and streamlined the configuration generation process across various files, ensuring consistency in how configurations are defined and utilized. * [Refactor] Update configuration handling in benchmark scripts - Refactored the `get_configs` functions in benchmark scripts to accept a variable argument list, improving flexibility in configuration management. - Enhanced the `matmul` and `flashattn` functions to utilize the updated configuration approach, streamlining parameter handling for kernel tuning. - Added `@autotune` decorators to relevant functions, ensuring consistent autotuning behavior across benchmarks. - Cleaned up redundant code and improved overall readability in the affected files. * [Refactor] Clean up formatting and update subproject commit - Updated the subproject commit reference in the TVM directory to indicate a dirty state. - Removed unnecessary blank lines and improved formatting in the `benchmark_matmul` and `benchmark_matmul_fp8` scripts for better readability. - Streamlined the function definitions in the `flashattn` example script to enhance clarity and maintainability. * [Refactor] Update AutoTuner configuration handling - Modified the AutoTuner class to check if kernel parameters are set before processing tunable arguments, improving robustness in configuration handling. - Enhanced the logic for skipping compilation when tunable parameters are already provided, ensuring efficient use of resources. - Updated comments for clarity and maintainability. * lint fix * Update TVM subproject commit to indicate dirty state and modify MHA backward test cases - Updated the subproject commit reference in the TVM directory to reflect a dirty state. - Adjusted the `test_mha_bwd` function to use a new configuration for the MHA backward tests, changing the context size from 128 to 256. - Uncommented the main testing function call for potential execution.
-
dependabot[bot] authored
Bumps [transformers](https://github.com/huggingface/transformers) from 4.50.0 to 4.51.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.50.0...v4.51.0 ) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.51.0 dependency-type: direct:production ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 03 Jul, 2025 1 commit
-
-
botbw authored
* [experimental] add a draft gemm_sp * [3rdparty] bump cutlass to v3.9.3 * [lint] run format.sh * [chore] rebase * [chore] use abs path * [gemm_sp] add metadata layout * [ci] add more example * [lint] run format.sh * [chore] polish * [chore] move gemm_sp to experimental * [chore] polish * [lint] run format.sh * [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test * Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy. * Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process. * Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality. * Implement Test * [Enhancement] Update GEMM SP and SM89 templates for improved functionality * Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture. * Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets. * Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types. * Added conditional inclusion for GEMM SP header based on CUDA architecture version. * lint fix * [gemm_sp] support more layout and data types * Enhancement: sync T.gemm_sp's layout inference with T.gemm * Enhancement: support more block_k in compress util * [Enhancement] enable block_k=64 * [Lint] run format.sh * [Enhancement] compressor support more dtype * Enhancement: enable block_K=32 * [Lint] format.sh * [Fixbug] fix shape * Refactor: sync gemm * [Enhancement] enable transpose * [Enhancement] enable fp8_e4m3 * [Enhancement] enable int8 * [Lint] run format.sh * [Benchmark] add gemm_sp benchmark * [Example] fix 256 threads hang * [CI] fix ci * [Chore] resolve gemini feedback * [Benchmark] increase search space * [Lint] format * [CI] skip sparse tensor core related tests as only sm90 is supported * [CI] pass local run * Update gemm_sm89.h * lint fix * lint fix * [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags - Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation. - Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag. - Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected. - Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch. - Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module. * Update test_compress_utils.py --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 01 Jul, 2025 1 commit
-
-
Lei Wang authored
- Added a line break in `quickstart.py` for better readability. - Simplified the JIT kernel compilation in `quickstart.py` by removing the unused execution backend option. - Modified `example_elementwise_add.py` to disable cache for `tilelang` and optimized the element-wise addition kernel by utilizing shared memory for input tensors, improving performance. - Updated default values for matrix dimensions and block sizes in the argument parser to enhance usability.
-
- 30 Jun, 2025 1 commit
-
-
Nathan Chen authored
-
- 25 Jun, 2025 1 commit
-
-
Cunxiao Ni authored
* [Example] Update kernel compilation in examples to use @tilelang.jit - Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead. - Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability. - Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed. * format * Update example_tilelang_sparse_gqa_decode_varlen_indice.py * Update example_dequant_gemm_fine_grained.py * Update example_gemm_autotune.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 23 Jun, 2025 1 commit
-
-
Jianqiao Lu authored
* feat: add a easy version for online softmax * fix: set x & y to fragment memory to load data from global memory * feat: apply format check * Add License --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 20 Jun, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation. - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter. - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions. * [Feature] Implement dynamic shared memory allocation alignment - Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency. - Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations. - Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures. * [Enhancement] Update dtype and argument defaults in GEMM autotuning example - Changed data type from `float16` to `bfloat16` for improved precision in computations. - Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process. * [Enhancement] Improve thread range computation in storage access - Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking. - Updated `AccessEntry` structure to include `thread_range`. - Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits. - Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency. * [Refactor] Update comments for clarity in dynamic shared memory allocation alignment - Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding. - Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`. - Added a blank line for improved readability in `thread_storage_sync.cc`. * [Refactor] Enhance storage access analysis and thread range computation - Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits. - Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency. - Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations. * [Enhancement] Update default matrix dimensions in GEMM autotune example - Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking. * typo fix * enhancement * [Fix] Add conflict detection for buffer index size mismatch in thread storage sync - Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.
-
- 16 Jun, 2025 2 commits
-
-
徐畅 authored
* [CI] Add flash_decoding example to CI * Add output of ref latency * format example_gqa_decode.py * [BugFix] Fix precision issue in GQA decode when block_N exceeds seqlen/num_split * format example_gqa_decode.py
-
Tong WU authored
* Update FLA import path for `prepare_token_indices` * Update FLA import path for `prepare_token_indices` * Compare versions via packaging.version.parse
-
- 13 Jun, 2025 1 commit
-
-
Lei Wang authored
Fix assertion in GQA backward example to ensure correct tensor comparison for gradient validation (#568)
-
- 11 Jun, 2025 1 commit
-
-
Yu Cheng authored
* [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation - Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid. - Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code. - Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels. - Updated relevant documentation and comments to reflect the addition of new features and usage instructions. * [Example] Add MLA Decode With Persistent Threadblock Example * [Feature] Introduce Persistent Loop and Update GEMM Example - Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution. - Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication. - Introduced a `loop_break` intrinsic for better control flow within persistent loops. - Updated relevant files to support the new features, including changes in code generation and language interface. * lint fix
-
- 07 Jun, 2025 1 commit
-
-
Yu Cheng authored
* [Enhancement] Fix multi-version buffer index in nested-loop * [Feature] Support persistent kernels and add persistent GEMM example * lint fix * lint fix * [CI] Remove test_tilelang_transform_annotate_device_regions.py
-
- 06 Jun, 2025 1 commit
-
-
xs-keju authored
* [CI] Add CI test for flash_attention examples * Update example_gqa_fwd_bshd.py * Update example_mha_fwd_bshd_wgmma_pipelined.py * [CI] Added conditional annotations for tests in flash_attention * [CI] Added conditional annotations for tests in flash_attention --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 05 Jun, 2025 1 commit
-
-
Zhengju Tang authored
* [CI] Add FusedMoE example * Lint * Fix import bug * Fix comment bug * Update example_fusedmoe_torch.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 04 Jun, 2025 3 commits
-
-
alex_xiao authored
* [CI]Add norm and layout_plot * fix lint * Remove obsolete test files for RMS normalization and plot layout, streamlining the testing suite. * Add make_mma_load_base_layout function to create MMA result layouts - Introduced a new function `make_mma_load_base_layout` for generating layout functions for storing MMA results in fragment buffers. - Added detailed docstring explaining parameters, return values, and potential exceptions. - Implemented logic for handling different data types and matrix configurations, including assertions for input validation. - Defined internal functions for mapping fragment indices to threads and local indices, enhancing the layout functionality. * Enhance MMA load test with additional imports and functionality - Added imports for `tilelang.language`, `Literal`, `Callable`, `DataType`, `IndexMap`, and `get_mma_micro_size` to support extended functionality. - Improved the `make_mma_load_base_layout` function by ensuring it can handle various data types and configurations. - Updated the test function `test_mma_load_base_layout` to validate the layout for float16 matrix A. * Fix formatting in test_fragment_mma_load_a.py by adding a blank line for improved readability. * Add RMS normalization functions to test_rms_norm.py - Introduced `rms_norm` and `rms_norm_splitk` functions for RMS normalization, enhancing the testing capabilities. - Implemented kernel functions with shared memory allocation and parallel processing for improved performance. - Updated the test function to validate the new RMS normalization implementations. * Add reference program for RMS normalization in test_rms_norm.py - Introduced `ref_program` function to provide a reference implementation for RMS normalization. - This addition enhances the testing framework by allowing comparisons against a known reference output. * Enhance RMS normalization tests with additional imports and formatting - Added import for `tilelang.language` to support extended functionality in `test_rms_norm.py`. - Improved code readability by adding blank lines for better separation of code sections. * Update RMS normalization test parameters and enhance layout plotting - Increased matrix dimensions in `test_rms_norm` to 8192 for improved performance testing. - Removed obsolete test functions in `test_fragment_mma_load_a.py` to streamline the test suite. - Enhanced layout plotting functionality by ensuring proper visualization of base, warp, and block layouts in `test_fragment_mma_load_a.py`. * Refactor RMS normalization test parameters and improve layout plotting readability - Simplified the parameters in `test_rms_norm` by removing `blk_k` for clarity. - Enhanced code readability in `test_fragment_mma_load_a.py` by adjusting the formatting of the `block_layout` definition and removing the unused `warp_cols` variable. * Enhance RMS normalization with split-k implementation and additional profiling - Added a new function `test_rms_norm_splitk` to test the split-k variant of RMS normalization. - Updated the main RMS normalization script to include profiling for the split-k implementation. - Ensured all checks pass with appropriate latency measurements for both reference and tile-lang implementations. * Remove obsolete test file `test_fragment_mma_load_a.py` to streamline the test suite. * Refactor `rms_norm.py` to streamline benchmarking output and remove redundant code. Comment out the `plot_layout` call in `fragment_mma_load_a.py` for clarity. * Refactor `test_rms_norm.py` by removing redundant test function `test_rms_norm_splitk` to streamline the test suite and improve clarity. --------- Co-authored-by:Your Name <you@example.com>
-
Tong WU authored
* Add linear attention examples. * Add license * Remove comments * Run yapf and ruff
-
Lei Wang authored
* Remove unused 2D continuous cumulative sum example and related functions from the cumsum module. * lint fix * fix split k example * Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation * Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice
-
- 03 Jun, 2025 1 commit
-
-
Tong WU authored
* [CI] Add hadamard example to CI * Run yapf and ruff * Run yapf and ruff
-
- 01 Jun, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add support for FP8 types in CUDA and HIP code generation * Updated `GetFP8Type` function in `codegen_cuda.cc` and `codegen_hip.cc` to handle new FP8 types, including `kFloat8_e4m3fnuz`. * Introduced a new header file `hip_fp8.h` for FP8 type definitions in HIP. * Modified type mappings in `dlpack.py` and `mfma_macro_generator.py` to accommodate new FP8 types. * Enhanced type handling in `TLHIPSourceWrapper` and `tensor.py` for better integration with FP8 types. * Added necessary includes and logic to support FP8 in the code generation process, improving performance and compatibility with FP8 data types. * lint fix * Update src/target/codegen_hip.cc Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update tilelang/intrinsics/mfma_macro_generator.py Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * workaround * fix * Update submodule TVM to latest commit 587028ffebfff0ded520f8f90d62f0f6b165906c * bug fix * Refactor tilelang matrix multiplication to support transposition and packing options. Adjusted shared memory shapes and loading logic for A and B matrices. Updated test cases to validate new functionality. * Refactor assertion function for tilelang matrix multiplication to improve readability by formatting parameters and aligning code. Cleaned up whitespace in intrinsic layout functions for consistency. * Update bfloat16 type definitions in common.h and gemm.h for consistency. Changed __hip_bfloat16 to hip_bfloat16 and updated MfmaTraits specialization accordingly. * lint fix --------- Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
- 28 May, 2025 3 commits
-
-
yyttt6 authored
-
Lei Wang authored
* [Enhancement] Add commit ID to versioning and improve logging initialization * Updated `get_tilelang_version` to include an optional commit ID in the version string. * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process. * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash. * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity. * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging. * [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching * Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability. * Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations. * Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking. * Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process. * Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity. * Minor adjustments in related modules to support the new caching and profiling features. * [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules * Consolidated import statements and removed unnecessary line breaks for better readability. * Standardized function argument formatting across the AutoTuner and CompileArgs classes. * Enhanced consistency in the use of whitespace and indentation throughout the codebase. * Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability. * Ensured that all changes adhere to the project's coding style guidelines. * [Refactor] Remove redundant type hints in AutoTuner modules * Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`. * Improved code readability and maintainability by streamlining type imports across the AutoTuner module. * [Refactor] Update AutoTuner configuration for improved profiling and target detection * Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings. * Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection. * Removed redundant target specifications in certain instances to streamline the configuration process. * Improved overall clarity and maintainability of the autotuning logic in various example scripts. * [Refactor] Simplify code formatting and improve readability in example scripts * Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity. * Removed unnecessary line breaks and standardized argument placement across multiple files. * Enhanced overall code readability and maintainability in autotuning examples and performance scripts. * [Refactor] Update JIT decorator usage across multiple files * Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code. * Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase. * Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions. * [Refactor] Standardize JIT decorator formatting across benchmark and example scripts * Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks. * Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts. * Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.
-
Lei Wang authored
* Refactor convolution example to streamline configuration and remove unused code * Updated the `check_hopper` function to properly check for CUDA availability and compute capability. * Removed the `get_configs` and `get_best_config` functions, simplifying the example by eliminating unused autotuning logic. * Adjusted argument parsing in the `main` function to directly compile the convolution kernel without autotuning options. * Cleaned up the code for better readability and maintainability. * Update examples/convolution/example_convolution.py Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
- 27 May, 2025 1 commit
-
-
Leslin authored
* [CI] Add gemm and gemm_fp8 example to CI * Fix lint via format.sh * Resolved the issues with profiler API usage and parse_args
-