1. 22 Apr, 2025 2 commits
    • Yu Cheng's avatar
      [Refactor] Enhance layout inference logic in ParallelOp (#420) · bf27e641
      Yu Cheng authored
      * Updated the layout inference in ParallelOp to improve the selection of source buffers for layout accuracy.
      * Introduced logic to choose the read source buffer based on the number of indices, ensuring more precise layout inference.
      * Refactored the loop handling to maintain clarity and improve the overall robustness of the layout inference process.
      bf27e641
    • Lei Wang's avatar
      [Enhancement] Support Auto Layout Inference and Parallelism with variable constraint (#417) · 73a6cb8b
      Lei Wang authored
      * [Enhancement] Introduce thread range management in layout and operation handling
      
      * Added `SetThreadRange` method to `FragmentNode` for managing thread ranges.
      * Updated `LayoutNode::Inverse` to provide more informative error messages.
      * Refactored layout inference and operation lowering to utilize `thread_bounds` instead of `block_size`, enhancing flexibility for thread management.
      * Introduced new tests for tilelang operations to validate thread range functionality and ensure correctness in parallel execution scenarios.
      
      * lint fix
      
      * [Refactor] Improve thread variable handling in layout inference and operation lowering
      
      * Removed workaround for undefined thread_var in layout inference, ensuring proper handling of thread bounds.
      * Updated logic to define thread bounds based on the presence of thread_var, enhancing robustness in thread management.
      * Refactored thread_var initialization in lower_tile_op to maintain consistency across the codebase.
      
      * [Refactor] Update thread variable handling in layout inference and operation lowering
      
      * Refactored thread variable checks to ensure bounds are only accessed when defined, improving safety and clarity.
      * Initialized thread_var with a default range to prevent undefined behavior.
      * Updated logic in lower_tile_op to align with new thread variable handling, enhancing consistency across the codebase.
      73a6cb8b
  2. 21 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Support larger than 256 box size tma copy (#413) · bf824406
      Lei Wang authored
      * [New Feature] Add FP8 Flash Attention Implementation (#412)
      
      * Introduce a new example script for FP8 Flash Attention in `example_mla_decode_kv_fp8.py`, showcasing the use of tilelang for efficient attention computation.
      * Implement the `flashattn` function with optimized memory management and kernel execution.
      * Include a reference program for comparison and performance evaluation.
      * Add command-line argument parsing for batch size, number of heads, and dimensions to facilitate testing and experimentation.
      * Enhance the overall structure and readability of the code.
      
      This addition aims to improve the performance of attention mechanisms in deep learning models by leveraging FP8 precision and optimized kernel execution.
      
      * lint fix
      
      * optimize quick start
      
      * lint fix
      bf824406
  3. 19 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Remove redundant recursive rewrite rule for FloorDiv in RewriteSimplifier (#408) · e8c2e794
      Lei Wang authored
      * Update TVM submodule and enhance vectorization logic in loop_vectorize.cc
      
      - Updated the TVM submodule to the latest commit.
      - Simplified the vectorization process by ensuring that the vectorized expression is simplified after vectorization, improving expression handling.
      - Added checks in loop_fusion_utils.h to prevent fusion of loops with non-power-of-2 extents, enhancing robustness in loop transformations.
      
      * lint fix
      e8c2e794
  4. 17 Apr, 2025 2 commits
    • Lei Wang's avatar
      [CI] Update CI configuration to run pytest with automatic parallelization (#393) · 6d3d4743
      Lei Wang authored
      * Update CI configuration to run pytest with automatic parallelization using the '-n auto' option.
      
      * Enhance Cython JIT Adapter Compilation Logic
      
      - Improved the locking mechanism during the compilation of the Cython JIT adapter to prevent race conditions.
      - Added checks to determine if another process has already compiled the library, reducing unnecessary recompilation.
      - Cleaned up the code by removing redundant imports and ensuring proper handling of temporary files during compilation failures.
      - Updated vectorization logic in loop_vectorize.cc to allow optional simplification of vectorized expressions.
      
      This update enhances performance and reliability in the JIT compilation process.
      
      * lint fix
      
      * Update CI configuration to run pytest with 4 parallel jobs instead of auto-detection
      
      * Add pytest markers for serial execution in MHA tests
      
      - Added @pytest.mark.serial to multiple MHA test functions to ensure they run sequentially.
      - This change improves test reliability by preventing potential race conditions during execution.
      
      * Update TVM submodule and enhance vectorization logic in loop_vectorize.cc
      
      - Updated the TVM submodule to the latest commit.
      - Modified the vectorization logic to include optional simplification of vectorized expressions and added checks to ensure the usage of vectorized variables, improving performance and reliability in expression handling.
      
      * Remove @pytest.mark.serial from multiple MHA test functions to allow parallel execution. This change enhances test performance by enabling concurrent test runs while maintaining reliability.
      
      * Remove tvm_simplify_test.py file, eliminating the test for expression simplification in TVM. This cleanup helps streamline the codebase by removing unused test cases.
      
      * Remove unused pytest import from test_tilelang_kernel_mha.py to streamline the test file.
      
      * lint fix
      
      * Update TVM submodule and refine vectorization logic in loop_vectorize.cc
      
      - Updated the TVM submodule to the latest commit.
      - Adjusted the return statements in loop_vectorize.cc to improve expression handling and ensure consistency in the visitor pattern.
      
      * Refactor vectorization logic in loop_vectorize.cc
      
      - Removed the check for the usage of the vectorized variable in the vectorization logic, simplifying the expression handling.
      - This change enhances the clarity and efficiency of the vectorization process.
      
      * Enhance vectorization checks in loop_vectorize.cc
      
      - Added a check to ensure the vectorized expression uses the vectorized variable, improving the robustness of the vectorization logic.
      - This change refines the expression handling and ensures that only valid vectorized expressions are processed.
      
      * Implement non-local buffer checks for loop vectorization in layout_inference.cc
      
      - Added logic to check for non-local buffer loads and stores before applying vectorization to loops. This enhancement ensures that vectorization is only applied when appropriate, improving the correctness of the loop transformations.
      
      * Refactor buffer handling in pipeline planning and layout inference
      
      - Renamed GlobalCopyPatternDetector to BufferRegionCollector for clarity and updated its logic to collect buffer read/write regions.
      - Enhanced the handling of conditional expressions in pipeline planning, allowing for better management of stages related to conditional statements.
      - Improved the processing of buffer regions during read/write operations, ensuring accurate tracking of buffer usage across different stages.
      
      * Refactor vectorization checks in loop_vectorize.cc
      
      - Removed the check for the usage of the vectorized variable in the vectorization logic, simplifying the expression handling.
      - This change enhances the clarity and efficiency of the vectorization process, ensuring that valid vectorized expressions are processed without unnecessary checks.
      6d3d4743
    • Zhengju Tang's avatar
  5. 16 Apr, 2025 4 commits
    • Oscar Savolainen's avatar
      Add preliminary support for bf16 for AMD (#388) · c091668f
      Oscar Savolainen authored
      
      
      * Add bf16 support for AMD in quickstart example
      
      * Reduced git diff
      
      * Move bf16 vector definition into common.h
      
      * Added unit tests for basic AMD bf16 matmul
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarOscarSavNS <oscar.savolainen@nscale.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      c091668f
    • Cunxiao Ni's avatar
      [Enhancement] Move T.any_of and T.all_of op registration from python into cpp (#398) · 7c266adf
      Cunxiao Ni authored
      * [Enhancement] Move T.any_of and T.all_of op registration from python into cpp
      
      * format
      
      * add license
      7c266adf
    • Zhengju Tang's avatar
      [BugFix] Address should aligned with access size in tail split (#401) · cffcf1c2
      Zhengju Tang authored
      * [BugFix] Address should aligned with access size in tail split
      
      * Lint
      
      * Lint
      cffcf1c2
    • Lei Wang's avatar
      [Enhancement] Introduce a smarter warp partition strategy (#396) · ca730c0a
      Lei Wang authored
      * make it python 3.8- happy
      
      * [Enhancement] Improve loop partitioning and vectorization logic in layout inference and loop vectorization
      
      - Enhanced the VisitStmt_ method to support local buffer handling in parallel loops, allowing for register usage without explicit thread binding.
      - Updated loop vectorization logic to simplify expressions and ensure accurate vector size calculations, improving performance and clarity in the vectorization process.
      
      * lint fix
      
      * [Refactor] Update warp size checks and enhance warp partitioning logic in GEMM
      
      - Changed warp_n size check from 16 to 8 in gemm_layouts.cc to improve compatibility with specific configurations.
      - Refactored warp partitioning logic in gemm.cc to prioritize N dimension for better performance based on aspect ratio.
      - Introduced a new CompileArgs dataclass in autotuner to streamline compile argument management and improve code clarity.
      
      * lint fix
      
      * [Enhancement] Initialize jit_compile in AutoTuner class
      
      - Added initialization for jit_compile attribute in the AutoTuner class to ensure it is set to None by default.
      - Updated the assignment logic for jit_compile to prevent overwriting an existing compile function, enhancing the flexibility of the AutoTuner's compilation process.
      ca730c0a
  6. 15 Apr, 2025 2 commits
    • Lei Wang's avatar
      [Bugfix] Support `T.Parallel` with local register assignment (#395) · 8c5b1341
      Lei Wang authored
      * make it python 3.8- happy
      
      * [Enhancement] Improve loop partitioning and vectorization logic in layout inference and loop vectorization
      
      - Enhanced the VisitStmt_ method to support local buffer handling in parallel loops, allowing for register usage without explicit thread binding.
      - Updated loop vectorization logic to simplify expressions and ensure accurate vector size calculations, improving performance and clarity in the vectorization process.
      
      * lint fix
      8c5b1341
    • Yu Cheng's avatar
      [Enhancement] Report Error Body in ParallelOp Layout Inference (#394) · 192a3995
      Yu Cheng authored
      Added detailed error messages in the InferLayout method to provide better context when layout conflicts occur. This includes the body of the operation that triggered the error, aiding in debugging and layout validation.
      192a3995
  7. 14 Apr, 2025 2 commits
    • Yu Cheng's avatar
      [Refactor] Refactor warp_specialized_rewriter to support multiple acquire/release patterns. (#391) · 44243542
      Yu Cheng authored
      Updated SyncPatternMap to use vectors for acquire and release, enhancing flexibility in handling synchronization patterns. Improved barrier handling logic for both producer and consumer cases, ensuring accurate synchronization in the pipeline.
      44243542
    • Lei Wang's avatar
      [Pipeline][Enhancement] Add copy_prepare stage to support mask and index caching (#392) · bf0032f8
      Lei Wang authored
      * [Enhancement][Pipeline] Improve pipeline stage information handling and copy stage detection
      
      - Added detailed documentation for the PipelineStageInfo structure to clarify its parameters.
      - Enhanced the VisitStmt_ method to handle annotations for pipeline order and stage more effectively.
      - Implemented logic to determine if a stage is used by a copy operation, adjusting the stage assignment accordingly.
      - Processed the tail copy stage to ensure correct ordering and stage assignment in the pipeline planning process.
      
      * lint fix
      bf0032f8
  8. 13 Apr, 2025 1 commit
  9. 12 Apr, 2025 3 commits
    • Lei Wang's avatar
      [Revert] Revert modifications for pass FlattenBuffer (#385) · 310fea95
      Lei Wang authored
      * fix
      
      * Update submodule TVM to latest commit and enhance FlattenBuffer pass in TileLang engine. Added boolean handling in buffer loading and improved address_of detection in flattening logic.
      
      * lint fix
      310fea95
    • Lei Wang's avatar
      [Enhancement][Pipeline] More precise copy code block detection in pipeline (#384) · abaacde5
      Lei Wang authored
      * Update legalize_safe_memory_access.cc
      
      * Add cache path handling and file locking in Cython adapter
      
      - Introduced a new cache path based on the code hash for the Cython JIT adapter, enhancing cache management.
      - Added a lock file mechanism to ensure safe access during cache operations, improving concurrency handling.
      - These changes aim to optimize the compilation process and prevent race conditions during library loading.
      
      * lint fix
      
      * refactor
      
      * refactor
      
      * Add GlobalCopyPatternDetector to identify global memory copy patterns
      
      - Introduced a new class, GlobalCopyPatternDetector, to detect specific memory copy patterns in statements.
      - Enhanced the PipelinePlanner to utilize this detector for determining copy stages based on global and local memory scopes.
      - Improved code clarity and maintainability by encapsulating detection logic within the new class.
      
      * Refactor copy stage detection logic in pipeline planning
      
      - Simplified the determination of copy stages by directly assigning the result of GlobalCopyPatternDetector to pinfo.copy_stage.
      - Removed redundant checks for read and write scopes, enhancing code clarity and maintainability.
      
      * lint fix
      abaacde5
    • Lei Wang's avatar
      [Refactor] Remove debug message in pass legalize_safe_memory_access (#381) · ad465a72
      Lei Wang authored
      * Update legalize_safe_memory_access.cc
      
      * Add cache path handling and file locking in Cython adapter
      
      - Introduced a new cache path based on the code hash for the Cython JIT adapter, enhancing cache management.
      - Added a lock file mechanism to ensure safe access during cache operations, improving concurrency handling.
      - These changes aim to optimize the compilation process and prevent race conditions during library loading.
      
      * lint fix
      ad465a72
  10. 11 Apr, 2025 2 commits
    • Lei Wang's avatar
      [Typo] Remove debug print (#373) · 137dab67
      Lei Wang authored
      * [Enhancement] Add variable check in GlobalMemChecker for safe memory access validation
      
      - Introduced a check in the GlobalMemChecker to determine if the index used in memory access has any variable components, enhancing the safety of memory access validation.
      - Updated the condition handling in store operations to ensure that only boolean conditions are processed, improving type safety and error handling in memory operations.
      
      * [Refactor] Rename VecAllocAccess to TLVecAllocAccess and enhance buffer access handling
      
      - Renamed the `VecAllocAccess` class to `TLVecAllocAccess` for clarity in its purpose.
      - Improved the handling of buffer access by mutating extents and rewriting access in the body, ensuring compatibility with vectorized operations.
      - Added a TODO comment to suggest moving this pass to occur before StorageFlatten/FlattenBuffer for better optimization.
      - Introduced a print statement in the phase optimization process for debugging purposes.
      
      * lint fix
      137dab67
    • Lei Wang's avatar
      [Language] Introduce `T.any_of` and `T.all_of` to reduce a bool arrary (#371) · c4638d65
      Lei Wang authored
      
      
      * [Enhancement] Introduce logical operations `any_of` and `all_of` for buffer checks
      
      - Added new logical operations `any_of` and `all_of` to the TileLang language interface, allowing users to check conditions across buffer elements.
      - Implemented corresponding intrinsic calls for CUDA, enhancing the functionality of the TileLang framework.
      - Updated the `allocate.py` to handle boolean types correctly in shared memory allocations.
      - Introduced tests for the new logical operations to ensure correctness and performance.
      Co-authored-by: default avatarZhiwen Mo <zhiwen.mo25@ic.ac.uk>
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zhiwen.mo25@ic.ac.uk>
      c4638d65
  11. 09 Apr, 2025 2 commits
    • Lei Wang's avatar
      [Bugfix] Fix compilation issues for amd cdna element size check (#364) · d627fd58
      Lei Wang authored
      * [Refactor] Update AutoTuner run method and timeout handling
      
      - Modified the `run` method to reduce the default timeout from 100 to 30 seconds for improved responsiveness.
      - Changed the `get_input_tensors_supply` call to disable output generation, enhancing performance during tensor supply retrieval.
      - Refactored the latency measurement to streamline the benchmarking process, ensuring proper timeout handling with `ThreadPoolExecutor`.
      - Added logging for timeout occurrences to aid in debugging and performance analysis.
      
      * bug fix
      
      * lint fix
      d627fd58
    • Lei Wang's avatar
      [AMD] Implement Deepseek MLA for AMD (#363) · e3065f0b
      Lei Wang authored
      * [Bugfix] Correct dynamic shared memory size error handling in HIP wrapper
      
      - Updated the error handling logic in `PREDEF_ATTRIBUTE_SET_DYNAMIC_MEMORY_HIP` to check if the dynamic shared memory size exceeds the maximum limit of 65536.
      - Improved error message clarity by specifying the function name and the attempted size, ensuring better debugging information.
      - Ensured the function returns 0 upon successful setting of the dynamic shared memory size.
      
      * [Add] Implement example for MLA decoding with AMD support
      
      - Introduced a new example script `example_mla_decode_amd.py` demonstrating the use of the flash attention mechanism with AMD hardware.
      - Implemented functions for attention calculation, including support for split processing and combining outputs.
      - Added command-line argument parsing for customizable input parameters such as batch size, number of heads, and dimensions.
      - Included a reference implementation for validation against the Tile-AI output, ensuring correctness of the implementation.
      - Enhanced performance profiling and output comparison for debugging and optimization purposes.
      
      * lint fix
      e3065f0b
  12. 08 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support pass config `disable_warp_specialize` to disable auto... · 7fdcedd0
      Lei Wang authored
      [Enhancement] Support pass config `disable_warp_specialize` to disable auto specialization on hopper (#357)
      
      * [Enhancement] Add warp specialization configuration option and update related functionality
      
      * [Add] Introduced a new pass configuration option `kDisableWarpSpecialized` to control warp specialization behavior.
      * [Refactor] Updated `WarpSpecializedRewriter` and `WSCodeEmitter` to utilize the new configuration option, allowing for more flexible optimization strategies.
      * [Update] Modified the optimization pipeline in `phase.py` to include pipeline planning when warp specialization is disabled, enhancing performance with async copy.
      * [Documentation] Updated JIT compilation parameters to reflect the new configuration option for better clarity.
      
      * lint fix
      
      * [Add] Implement test for GEMM with warp specialization configuration
      
      * Introduced a new test file `test_tilelang_pass_config_disable_warp_specialized.py` to validate the functionality of the warp specialization configuration option.
      * Added a `run_gemm` function to execute matrix multiplication tests with and without warp specialization, ensuring correctness through profiling against reference results.
      * Included a specific test case for GEMM with float16 data types, enhancing test coverage for the new configuration feature.
      
      * [Refactor] Improve formatting in test_tilelang_pass_config_disable_warp_specialized.py
      
      * Reformatted the `tilelang.compile` call in the `run_gemm` function for better readability by breaking it into multiple lines.
      * Added a blank line for improved code structure and clarity in the `test_gemm_f16f16f16_nn` function.
      7fdcedd0
  13. 07 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Fix Transposed Fragment Layout for amd GEMM_RS matrix core (#346) · 0acb8586
      Lei Wang authored
      * [Refactor] Update GEMM Fragment Layout and Improve Matrix Multiplication Functionality
      
      - Adjusted the layout configuration in `gemm_layouts.cc` to correct the repetition parameters for warp and block layouts, enhancing the efficiency of the GEMM fragment generation.
      - Refactored the `matmul_rs` function in `test_tilelang_test_amd.py` to improve readability by restructuring the function signature and ensuring consistent formatting.
      - Updated the test execution call to run the new `test_gemm_rs_f16f32f32_nt` function, enhancing test coverage for the GEMM functionality.
      
      * lint fix
      
      * bugfix
      0acb8586
  14. 06 Apr, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Support index bit width configuration (#343) · 70546adc
      Lei Wang authored
      
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      * Enhance variable creation by associating data types in IR and layout files, and introduce ExpandIndexDataType transformation
      
      - Updated variable creation in `ir.cc`, `gemm_layouts.cc`, and `elem.cc` to include data types for better type safety.
      - Added a new transformation `ExpandIndexDataType` to promote integer types to int64 where necessary, improving compatibility and performance.
      - Integrated the new transformation into the optimization pipeline in `phase.py`.
      - Documented the new transformation in `__init__.py` for clarity.
      
      * lint fix
      
      * Add configuration option for index bitwidth and remove ExpandIndexDataType transformation
      
      - Introduced a new pass configuration option `kConfigIndexBitwidth` to allow customization of index bitwidth.
      - Updated the optimization pipeline in `phase.py` to utilize the new configuration option instead of the removed `ExpandIndexDataType` transformation.
      - Documented the new configuration option in the JIT compilation function's parameters for clarity.
      - Removed the `ExpandIndexDataType` transformation implementation from the codebase to streamline the transformation process.
      
      * lint fix
      
      * Refactor index bitwidth configuration handling
      
      - Updated the `ConfigIndexBitwidth` pass to only apply the bitwidth transformation if the configuration option is defined, preventing potential errors with undefined values.
      - Changed the default value of `tl.config_index_bitwidth` in the JIT compilation function's parameters from 32 to None for better clarity and flexibility.
      
      * lint fix
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      70546adc
    • Lei Wang's avatar
      [Enhancement] Support region padding when convert buffer load to buffer region (#342) · 10804a0d
      Lei Wang authored
      * Enhance error checking in RegionOp and buffer_load_to_tile_region
      
      - Added detailed error messages to the index size check in `RegionOp` to aid debugging.
      - Implemented a check in `buffer_load_to_tile_region` to ensure the length of indices matches extents, with a fallback to expand extents if necessary. This improves robustness in handling buffer loads with mismatched dimensions.
      
      * lint fix
      10804a0d
  15. 05 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Enhance FP8/FP4 type handling in CUDA codegen (#323) · 89725f7f
      Lei Wang authored
      
      
      * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
      
      - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
      - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
      - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
      - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      * Update submodule and enhance FP8 type handling in CUDA codegen
      
      - Updated the TVM submodule to the latest commit.
      - Modified FP8 type handling in `codegen_cuda.cc` to use more descriptive type codes.
      - Improved constant printing for FP8 and bfloat16 types, ensuring correct representation in generated code.
      - Added error handling for missing configuration keys in the AutoTuner class.
      
      * lint fix
      
      * Remove print statement from example script
      
      * lint fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      89725f7f
  16. 04 Apr, 2025 3 commits
    • Lei Wang's avatar
      [Enhancement] Add new matrix multiplication functions and tests for GEMM with... · 9e5a757e
      Lei Wang authored
      [Enhancement] Add new matrix multiplication functions and tests for GEMM with transpose options (#331)
      
      - Introduced `matmul_rs` function for flexible matrix multiplication with optional transposition.
      - Added `run_gemm_rs` function to facilitate testing of the new matrix multiplication implementation.
      - Expanded test coverage for GEMM with additional cases for transposition configurations.
      - Corrected index usage in `gemm.h` to ensure proper matrix layout handling.
      
      These changes enhance the GEMM functionality and improve testing capabilities for various matrix configurations.
      9e5a757e
    • Zhengju Tang's avatar
      [Dynamic Symbolic] Adaptively vectorize with different condition expressions (#326) · 5ee58ec7
      Zhengju Tang authored
      
      
      * [Dynamic Symbolic] Adaptively vectorize with different condition expressions
      
      * Format
      
      * Format
      
      * Format
      
      * Format
      
      * Add MIT License headers to Python files
      
      * Simplify return statement in loop vectorization
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      5ee58ec7
    • Lei Wang's avatar
      [AMD] Adapt rocm and support `T.gemm` with transpose_b=False for amd backend (#327) · eab47249
      Lei Wang authored
      
      
      * [Enhancement] Update GEMM and ROCm Integration
      
      - Removed the restriction on transposing matrix B for CDNA in `gemm.cc`, allowing for more flexible matrix operations.
      - Added a new debug header file `debug.h` for enhanced debugging capabilities in ROCm kernels.
      - Updated `codegen_hip.cc` to include the new debug header and improved handling of float16 and bfloat16 types in vector element stores.
      - Refactored `rt_mod_hip.cc` to return a ROCM module directly from `BuildTileLangHIPWithoutCompile`, enhancing the module creation process.
      - Introduced a new ROCm utility in `rocm.py` for linking and managing ROCm paths, improving the build process for ROCm applications.
      - Updated tests to reflect changes in GEMM configurations and ensure compatibility with the new features.
      
      These changes enhance the flexibility and debugging capabilities of the GEMM operations and improve the integration with the ROCm backend.
      
      * [Fix] Corrected syntax error in pyproject.toml and improved error message formatting in rocm.py
      
      - Added missing quotation mark for "HSA" in the `select` section of `pyproject.toml`.
      - Simplified the error message formatting in `get_rocm_arch` function of `rocm.py` for better readability and consistency.
      
      * lint fix
      
      * Update tilelang/jit/adapter/wrapper.py
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      eab47249
  17. 03 Apr, 2025 2 commits
    • botbw's avatar
      [Bugfix] add a patch to fix T.abs on float16 (#325) · 2cec52aa
      botbw authored
      * [bug] fix T.abs on float16
      
      * [lint] lint
      2cec52aa
    • Yu Cheng's avatar
      [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support (#320) · 4b705eb2
      Yu Cheng authored
      * [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support
      
      * Added `example_per_token_cast_to_fp8.py` in examples/cast, providing token-wise FP8 quantization implementation.
      * Added `example_triton_cast_to_fp8.py` in examples/cast, providing Triton-based FP8 quantization implementation.
      * Added support for absolute maximum (absmax) reduction operation in reduce.cc and reduce.h.
      * Implemented `reduce_absmax` function in reduce.py, allowing absolute maximum reduction on input buffers.
      * Updated tilelang.language module to include the new `reduce_absmax` function.
      
      These changes enhance FP8 quantization capabilities and extend reduction operation support.
      
      * [Enhancement] Update per_token_cast_to_fp8 for improved FP8 quantization
      
      * Modified the `per_token_cast_to_fp8` function to support variable block sizes and improved memory layout annotations.
      * Adjusted the handling of absolute maximum values and scaling factors for better performance and accuracy.
      * Updated the main execution block to allow for larger matrix dimensions and refined the profiler setup for benchmarking.
      
      These changes enhance the flexibility and efficiency of the FP8 quantization process.
      
      * lint
      
      * [Dev] Update per_token_cast_fp8.py
      4b705eb2
  18. 01 Apr, 2025 2 commits
  19. 31 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Bugfix] Fix layout conflict issue for gqa decoding examples (#314) · 0fd82ed5
      Lei Wang authored
      * Remove logging statement from LoopVectorizerDynamic Substitute method for cleaner output.
      
      * Refactor flashattn example to improve CUDA configuration handling
      
      - Updated the `flashattn` function in `example_gqa_decode.py` to utilize a heuristic configuration based on CUDA device capabilities, enhancing compatibility with different architectures.
      - Replaced local variable allocations with more efficient constructs and removed unnecessary logging statements for cleaner output.
      - Adjusted the `do_bench` method call to streamline performance profiling.
      
      * lint fix
      0fd82ed5
    • Lei Wang's avatar
      [Bugfix] Fix dynamic axis with variable extent (#311) · c30904ea
      Lei Wang authored
      * [Enhancement] Improve error message for RampNode in CUDA codegen
      
      - Updated the error message in the VisitExpr_ method for RampNode to include the specific Ramp node and lane count when the lane count exceeds the limit of 4. This change enhances debugging by providing clearer context for the error.
      - Refactored the loop vectorization logic in loop_vectorize_dynamic.cc to improve readability and maintainability, ensuring that dynamic vectorization checks are performed correctly and efficiently.
      
      * lint fix
      c30904ea
  20. 30 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Add support for CUDA architecture 8.9 in GEMM template (#304) · edbb9b6d
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      edbb9b6d
  21. 29 Mar, 2025 1 commit
  22. 28 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h (#297) · 9ad9d9cd
      Lei Wang authored
      - Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
      - Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.
      9ad9d9cd
    • Lei Wang's avatar
      [Feature] Implement ParallelLoopTransformer for enhanced loop analysis (#295) · 5c8de061
      Lei Wang authored
      * [Feature] Implement ParallelLoopTransformer for enhanced loop analysis
      
      - Introduced the ParallelLoopTransformer class to improve the handling of parallel loops in layout inference.
      - Enhanced the analysis of loop variables and their extents, allowing for more accurate index range calculations.
      - Added a BufferAccessCollector to gather buffer access information, ensuring correct index mapping and condition handling.
      - Updated the LayoutInference pass to utilize the new transformer, improving overall performance and accuracy in loop transformations.
      
      * test fix
      
      * Fix typo in buffer variable documentation and enhance loop variable handling in layout inference. Added checks for related loop variables and improved condition handling for index mapping.
      
      * Refactor loop variable handling in layout inference. Updated loop index variable from `i` to `j` for clarity and improved condition handling for index mapping by replacing `indices[i]` with `index` in predicate construction.
      5c8de061