"testing/vscode:/vscode.git/clone" did not exist on "3de9f13cb110a4c2db3d7333631c7adead3b9778"
  1. 08 Jul, 2025 1 commit
  2. 03 Jul, 2025 1 commit
    • botbw's avatar
      [Experimental][Language] add `T.GEMM_SP` for sm90 sparse tensor core (#526) · be44758c
      botbw authored
      
      
      * [experimental] add a draft gemm_sp
      
      * [3rdparty] bump cutlass to v3.9.3
      
      * [lint] run format.sh
      
      * [chore] rebase
      
      * [chore] use abs path
      
      * [gemm_sp] add metadata layout
      
      * [ci] add more example
      
      * [lint] run format.sh
      
      * [chore] polish
      
      * [chore] move gemm_sp to experimental
      
      * [chore] polish
      
      * [lint] run format.sh
      
      * [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test
      
      * Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy.
      * Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process.
      * Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality.
      
      * Implement Test
      
      * [Enhancement] Update GEMM SP and SM89 templates for improved functionality
      
      * Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture.
      * Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets.
      * Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types.
      * Added conditional inclusion for GEMM SP header based on CUDA architecture version.
      
      * lint fix
      
      * [gemm_sp] support more layout and data types
      
      * Enhancement: sync T.gemm_sp's layout inference with T.gemm
      
      * Enhancement: support more block_k in compress util
      
      * [Enhancement] enable block_k=64
      
      * [Lint] run format.sh
      
      * [Enhancement] compressor support more dtype
      
      * Enhancement: enable block_K=32
      
      * [Lint] format.sh
      
      * [Fixbug] fix shape
      
      * Refactor: sync gemm
      
      * [Enhancement] enable transpose
      
      * [Enhancement] enable fp8_e4m3
      
      * [Enhancement] enable int8
      
      * [Lint] run format.sh
      
      * [Benchmark] add gemm_sp benchmark
      
      * [Example] fix 256 threads hang
      
      * [CI] fix ci
      
      * [Chore] resolve gemini feedback
      
      * [Benchmark] increase search space
      
      * [Lint] format
      
      * [CI] skip sparse tensor core related tests as only sm90 is supported
      
      * [CI] pass local run
      
      * Update gemm_sm89.h
      
      * lint fix
      
      * lint fix
      
      * [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags
      
      - Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation.
      - Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag.
      - Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected.
      - Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch.
      - Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module.
      
      * Update test_compress_utils.py
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      be44758c
  3. 01 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support tf32 gemm_rs (#607) · 0ff81755
      Lei Wang authored
      - Added a line break in `quickstart.py` for better readability.
      - Simplified the JIT kernel compilation in `quickstart.py` by removing the unused execution backend option.
      - Modified `example_elementwise_add.py` to disable cache for `tilelang` and optimized the element-wise addition kernel by utilizing shared memory for input tensors, improving performance.
      - Updated default values for matrix dimensions and block sizes in the argument parser to enhance usability.
      0ff81755
  4. 30 Jun, 2025 1 commit
  5. 25 Jun, 2025 1 commit
    • Cunxiao Ni's avatar
      [Example] Update examples to use @tilelang.jit (#597) · 3db18726
      Cunxiao Ni authored
      
      
      * [Example] Update kernel compilation in examples to use @tilelang.jit
      
      - Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead.
      - Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability.
      - Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed.
      
      * format
      
      * Update example_tilelang_sparse_gqa_decode_varlen_indice.py
      
      * Update example_dequant_gemm_fine_grained.py
      
      * Update example_gemm_autotune.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      3db18726
  6. 23 Jun, 2025 1 commit
  7. 20 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] align shared memory allocations (#583) · fecc8336
      Lei Wang authored
      * [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper
      
      - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation.
      - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter.
      - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions.
      
      * [Feature] Implement dynamic shared memory allocation alignment
      
      - Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency.
      - Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations.
      - Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures.
      
      * [Enhancement] Update dtype and argument defaults in GEMM autotuning example
      
      - Changed data type from `float16` to `bfloat16` for improved precision in computations.
      - Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process.
      
      * [Enhancement] Improve thread range computation in storage access
      
      - Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking.
      - Updated `AccessEntry` structure to include `thread_range`.
      - Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits.
      - Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency.
      
      * [Refactor] Update comments for clarity in dynamic shared memory allocation alignment
      
      - Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding.
      - Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`.
      - Added a blank line for improved readability in `thread_storage_sync.cc`.
      
      * [Refactor] Enhance storage access analysis and thread range computation
      
      - Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits.
      - Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency.
      - Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations.
      
      * [Enhancement] Update default matrix dimensions in GEMM autotune example
      
      - Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking.
      
      * typo fix
      
      * enhancement
      
      * [Fix] Add conflict detection for buffer index size mismatch in thread storage sync
      
      - Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.
      fecc8336
  8. 16 Jun, 2025 2 commits
  9. 13 Jun, 2025 1 commit
  10. 11 Jun, 2025 1 commit
    • Yu Cheng's avatar
      [Feature] Introduce Persistent Loop and Update GEMM Example (#563) · e7b97be2
      Yu Cheng authored
      * [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation
      
      - Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid.
      - Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code.
      - Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels.
      - Updated relevant documentation and comments to reflect the addition of new features and usage instructions.
      
      * [Example] Add MLA Decode With Persistent Threadblock Example
      
      * [Feature] Introduce Persistent Loop and Update GEMM Example
      
      - Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution.
      - Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication.
      - Introduced a `loop_break` intrinsic for better control flow within persistent loops.
      - Updated relevant files to support the new features, including changes in code generation and language interface.
      
      * lint fix
      e7b97be2
  11. 07 Jun, 2025 1 commit
  12. 06 Jun, 2025 1 commit
  13. 05 Jun, 2025 1 commit
  14. 04 Jun, 2025 3 commits
    • alex_xiao's avatar
      [CI]Add norm and layout_plot (#534) · c9e503be
      alex_xiao authored
      
      
      * [CI]Add norm and layout_plot
      
      * fix lint
      
      * Remove obsolete test files for RMS normalization and plot layout, streamlining the testing suite.
      
      * Add make_mma_load_base_layout function to create MMA result layouts
      
      - Introduced a new function `make_mma_load_base_layout` for generating layout functions for storing MMA results in fragment buffers.
      - Added detailed docstring explaining parameters, return values, and potential exceptions.
      - Implemented logic for handling different data types and matrix configurations, including assertions for input validation.
      - Defined internal functions for mapping fragment indices to threads and local indices, enhancing the layout functionality.
      
      * Enhance MMA load test with additional imports and functionality
      
      - Added imports for `tilelang.language`, `Literal`, `Callable`, `DataType`, `IndexMap`, and `get_mma_micro_size` to support extended functionality.
      - Improved the `make_mma_load_base_layout` function by ensuring it can handle various data types and configurations.
      - Updated the test function `test_mma_load_base_layout` to validate the layout for float16 matrix A.
      
      * Fix formatting in test_fragment_mma_load_a.py by adding a blank line for improved readability.
      
      * Add RMS normalization functions to test_rms_norm.py
      
      - Introduced `rms_norm` and `rms_norm_splitk` functions for RMS normalization, enhancing the testing capabilities.
      - Implemented kernel functions with shared memory allocation and parallel processing for improved performance.
      - Updated the test function to validate the new RMS normalization implementations.
      
      * Add reference program for RMS normalization in test_rms_norm.py
      
      - Introduced `ref_program` function to provide a reference implementation for RMS normalization.
      - This addition enhances the testing framework by allowing comparisons against a known reference output.
      
      * Enhance RMS normalization tests with additional imports and formatting
      
      - Added import for `tilelang.language` to support extended functionality in `test_rms_norm.py`.
      - Improved code readability by adding blank lines for better separation of code sections.
      
      * Update RMS normalization test parameters and enhance layout plotting
      
      - Increased matrix dimensions in `test_rms_norm` to 8192 for improved performance testing.
      - Removed obsolete test functions in `test_fragment_mma_load_a.py` to streamline the test suite.
      - Enhanced layout plotting functionality by ensuring proper visualization of base, warp, and block layouts in `test_fragment_mma_load_a.py`.
      
      * Refactor RMS normalization test parameters and improve layout plotting readability
      
      - Simplified the parameters in `test_rms_norm` by removing `blk_k` for clarity.
      - Enhanced code readability in `test_fragment_mma_load_a.py` by adjusting the formatting of the `block_layout` definition and removing the unused `warp_cols` variable.
      
      * Enhance RMS normalization with split-k implementation and additional profiling
      
      - Added a new function `test_rms_norm_splitk` to test the split-k variant of RMS normalization.
      - Updated the main RMS normalization script to include profiling for the split-k implementation.
      - Ensured all checks pass with appropriate latency measurements for both reference and tile-lang implementations.
      
      * Remove obsolete test file `test_fragment_mma_load_a.py` to streamline the test suite.
      
      * Refactor `rms_norm.py` to streamline benchmarking output and remove redundant code. Comment out the `plot_layout` call in `fragment_mma_load_a.py` for clarity.
      
      * Refactor `test_rms_norm.py` by removing redundant test function `test_rms_norm_splitk` to streamline the test suite and improve clarity.
      
      ---------
      Co-authored-by: default avatarYour Name <you@example.com>
      c9e503be
    • Tong WU's avatar
      [CI] Add linear attention examples to CI (#552) · eec07578
      Tong WU authored
      * Add linear attention examples.
      
      * Add license
      
      * Remove comments
      
      * Run yapf and ruff
      eec07578
    • Lei Wang's avatar
      [Refactor] Include several examples into ci (#531) · 3ca3a8af
      Lei Wang authored
      * Remove unused 2D continuous cumulative sum example and related functions from the cumsum module.
      
      * lint fix
      
      * fix split k example
      
      * Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation
      
      * Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice
      3ca3a8af
  15. 03 Jun, 2025 1 commit
  16. 01 Jun, 2025 1 commit
    • Lei Wang's avatar
      [AMD] Support float8 matrix core (#537) · 5872e647
      Lei Wang authored
      
      
      * [Enhancement] Add support for FP8 types in CUDA and HIP code generation
      
      * Updated `GetFP8Type` function in `codegen_cuda.cc` and `codegen_hip.cc` to handle new FP8 types, including `kFloat8_e4m3fnuz`.
      * Introduced a new header file `hip_fp8.h` for FP8 type definitions in HIP.
      * Modified type mappings in `dlpack.py` and `mfma_macro_generator.py` to accommodate new FP8 types.
      * Enhanced type handling in `TLHIPSourceWrapper` and `tensor.py` for better integration with FP8 types.
      * Added necessary includes and logic to support FP8 in the code generation process, improving performance and compatibility with FP8 data types.
      
      * lint fix
      
      * Update src/target/codegen_hip.cc
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * Update tilelang/intrinsics/mfma_macro_generator.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * workaround
      
      * fix
      
      * Update submodule TVM to latest commit 587028ffebfff0ded520f8f90d62f0f6b165906c
      
      * bug fix
      
      * Refactor tilelang matrix multiplication to support transposition and packing options. Adjusted shared memory shapes and loading logic for A and B matrices. Updated test cases to validate new functionality.
      
      * Refactor assertion function for tilelang matrix multiplication to improve readability by formatting parameters and aligning code. Cleaned up whitespace in intrinsic layout functions for consistency.
      
      * Update bfloat16 type definitions in common.h and gemm.h for consistency. Changed __hip_bfloat16 to hip_bfloat16 and updated MfmaTraits specialization accordingly.
      
      * lint fix
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      5872e647
  17. 28 May, 2025 3 commits
    • yyttt6's avatar
      8af5eb77
    • Lei Wang's avatar
      [Autotune] Introduce cache mechanism for auto tuner (#527) · 7171aff6
      Lei Wang authored
      * [Enhancement] Add commit ID to versioning and improve logging initialization
      
      * Updated `get_tilelang_version` to include an optional commit ID in the version string.
      * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
      * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
      * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
      * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.
      
      * [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching
      
      * Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability.
      * Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations.
      * Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking.
      * Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process.
      * Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity.
      * Minor adjustments in related modules to support the new caching and profiling features.
      
      * [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules
      
      * Consolidated import statements and removed unnecessary line breaks for better readability.
      * Standardized function argument formatting across the AutoTuner and CompileArgs classes.
      * Enhanced consistency in the use of whitespace and indentation throughout the codebase.
      * Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability.
      * Ensured that all changes adhere to the project's coding style guidelines.
      
      * [Refactor] Remove redundant type hints in AutoTuner modules
      
      * Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`.
      * Improved code readability and maintainability by streamlining type imports across the AutoTuner module.
      
      * [Refactor] Update AutoTuner configuration for improved profiling and target detection
      
      * Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings.
      * Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection.
      * Removed redundant target specifications in certain instances to streamline the configuration process.
      * Improved overall clarity and maintainability of the autotuning logic in various example scripts.
      
      * [Refactor] Simplify code formatting and improve readability in example scripts
      
      * Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity.
      * Removed unnecessary line breaks and standardized argument placement across multiple files.
      * Enhanced overall code readability and maintainability in autotuning examples and performance scripts.
      
      * [Refactor] Update JIT decorator usage across multiple files
      
      * Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code.
      * Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase.
      * Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions.
      
      * [Refactor] Standardize JIT decorator formatting across benchmark and example scripts
      
      * Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks.
      * Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts.
      * Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.
      7171aff6
    • Lei Wang's avatar
      [Refactor] Refactor convolution example to streamline configuration and remove unused code (#530) · 09581e4e
      Lei Wang authored
      
      
      * Refactor convolution example to streamline configuration and remove unused code
      
      * Updated the `check_hopper` function to properly check for CUDA availability and compute capability.
      * Removed the `get_configs` and `get_best_config` functions, simplifying the example by eliminating unused autotuning logic.
      * Adjusted argument parsing in the `main` function to directly compile the convolution kernel without autotuning options.
      * Cleaned up the code for better readability and maintainability.
      
      * Update examples/convolution/example_convolution.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      09581e4e
  18. 27 May, 2025 1 commit
  19. 24 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Support auto index bitwidth casting (#517) · 6ad73f6f
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      6ad73f6f
  20. 23 May, 2025 3 commits
    • Taoyu Zhu's avatar
      Fix deepgemm exmaple (#513) · 0d1eab57
      Taoyu Zhu authored
      
      
      * fix deepgemm example
      
      * fix deepgemm example
      
      * make format
      
      * Update example_deepgemm_fp8_2xAcc.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      0d1eab57
    • Yu Cheng's avatar
      [Dev] Add grouped GEMM backward example scripts (#515) · de028927
      Yu Cheng authored
      * Introduced `example_grouped_gemm_fwd.py` and `example_grouped_gemm_bwd.py` to demonstrate grouped matrix multiplication with forward and backward operations.
      * Implemented functions for grouped GEMM, input construction, and validation against PyTorch's implementation.
      * Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions.
      * Included a test function to validate the functionality with various input scenarios.
      de028927
    • Yu Cheng's avatar
      [Dev] Add grouped GEMM example with TileLang and PyTorch integration (#514) · fb801940
      Yu Cheng authored
      * Introduced a new example script `example_grouped_gemm.py` demonstrating grouped matrix multiplication using TileLang and PyTorch.
      * Implemented functions for performing grouped GEMM, constructing inputs, and validating results against PyTorch's implementation.
      * Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions.
      * Included a test function to validate the grouped GEMM functionality with various input scenarios.
      fb801940
  21. 18 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] refactor `tilelang.jit` to support a faster and more flexible kernel cache (#501) · 25a50f1a
      Lei Wang authored
      * [Refactor] Update JIT kernel functions and streamline GEMM tests
      
      * Renamed and refactored matmul and run_gemm functions to matmul_kernel_jit and run_gemm_kernel_jit for clarity.
      * Removed redundant JIT decorator from the matmul function, ensuring it is applied only to the kernel function.
      * Updated test function names to reflect changes in the kernel functions, enhancing consistency and readability.
      * Cleaned up commented-out code and unnecessary imports to improve overall code quality.
      
      * Update main function call in GEMM test to use tilelang testing framework
      
      * Update README and example scripts to include JIT decorator comments
      
      * Added comments in README.md and various example scripts to indicate the use of the @tilelang.jit decorator for returning torch functions.
      * Removed redundant comments that previously instructed to add the decorator, streamlining the documentation and improving clarity.
      
      * Update GEMM test parameters for improved performance
      
      * Set num_stages to 0 and adjusted matrix dimensions in test functions to enhance performance and consistency across GEMM tests in test_tilelang_kernel_gemm.py.
      25a50f1a
  22. 17 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Fallback transposed_ldmatrix into `SM75_U16x4_LDSM_N` when warp_n is 8 (#498) · 68a3c4f3
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      
      * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
      
      * lint fix
      68a3c4f3
  23. 16 May, 2025 2 commits
    • Yu Cheng's avatar
      [Refactor] Update main function structure in example scripts and add tests (#475) · 73ae8087
      Yu Cheng authored
      * [Refactor] Update example_mla_decode.py and add tests for block_sparse_attn_tilelang
      
      * Refactor example_mla_decode.py to define a main function for better structure and clarity.
      * Introduce test_example_mla_decode.py to validate the functionality of example_mla_decode.
      * Refactor block_sparse_attn_tilelang.py to define a main function and add test_block_sparse_attn_tilelang.py for testing.
      * Ensure all new test files are integrated with tilelang testing framework.
      
      * [Test] Enhance test_example_mla_decode with argument mocking
      
      * Update test_example_mla_decode.py to mock sys.argv for better test isolation.
      * Ensure the main function of example_mla_decode is called with the correct arguments during testing.
      73ae8087
    • Lei Wang's avatar
      [Enhancement] Introduce flag to visualize shared memory merge plan (#496) · dca2fb48
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      dca2fb48
  24. 14 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Introduce quantize components of TileLang and add testing for... · cde1886f
      Lei Wang authored
      [Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple (#494)
      
      * Remove deprecated example_dequant_gemm.py and add DataType import in __init__.py
      
      * lint fix
      
      * lint fix
      
      * Refactor dequantization examples to use tilelang imports and update data type handling in quantization utilities
      
      * lint fix
      cde1886f
  25. 13 May, 2025 1 commit
  26. 10 May, 2025 3 commits
  27. 09 May, 2025 2 commits
  28. 08 May, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Update barrier functions and remove argparse in... · b0122d74
      Lei Wang authored
      [Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py (#457)
      
      * Refactored barrier functions to use new signatures for improved clarity and consistency.
      * Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
      * Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.
      b0122d74
    • Lei Wang's avatar
      [Refactor] Update barrier functions and add new example for GEMM with warp specialization (#456) · a91bc2a9
      Lei Wang authored
      * Add example for warp specialization with flash attention
      
      * Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang.
      * Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance.
      * Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics.
      * Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation.
      
      * Add memory barrier functions to builtin.py
      
      * Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization.
      * Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters.
      * The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1.
      * Improved code organization and readability by adding blank lines for better separation of logical sections.
      
      * Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py
      
      * Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`.
      * Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability.
      
      * [Refactor] Update barrier functions and add new example for GEMM with warp specialization
      
      * Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency.
      * Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation.
      * Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations.
      * Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios.
      * Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations.
      
      * lint fix
      
      * Update loop variable checks in SIMT loop and buffer region validation
      
      * Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling.
      * Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging.
      
      * lint fix
      
      * test fix
      a91bc2a9