Commits · b10d49b2c0197d74a2c2864e57a2f67a9d880345 · OpenDAS / tilelang

"vscode:/vscode.git/clone" did not exist on "8e953fce608c26d112bd6c4220d730e9a6cdae4b"

13 Nov, 2025 1 commit
- [AMD] enable amd ci test & fix bug & fix dockerfile (#1244) · b10d49b2
  Jiaxing Ding authored Nov 13, 2025
  
  b10d49b2
15 Oct, 2025 1 commit

fix bug&add amd examples (#966) · 80665cd1

alex_xiao authored Oct 15, 2025



* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Enhance AMD example script and update CI workflows

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.

* Remove redundant tool cache cleanup step in AMD CI workflow

* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

* Add new AMD FlashAttention example and test script

- Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
- Added `test.sh` script to facilitate running the new example with specified parameters.
- Enhanced the overall structure and organization of the example for better clarity and usability.

* Update configurations in `example_amd_flash_attn_fwd.py` for autotuner

- Reduced the number of threads and `num_split_q` options for improved performance.
- Adjusted `panel_size` options to streamline configuration settings.

* Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217

* Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c

* Add example for AMD Flash Attention backward pass implementation

- Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
- Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
- Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
- Included reference implementation for validation against PyTorch's attention mechanism.

This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.

* Enhance AMD Flash Attention example with additional testing capabilities

- Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
- Improved the main function to allow for better parameter configuration and benchmarking.
- Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.

This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.

* Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

* Refactor HIP intrinsic rules to CUDA

- Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
- Adjusted include paths for better organization and clarity in the code structure.

* Update AMD CI workflow to uninstall specific PyTorch packages before installation

- Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
- Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.

* Remove unused shared memory allocations in AMD Flash Attention backward example

- Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
- This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.

* Remove unnecessary pip uninstall command from AMD CI workflow

- Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
- This change simplifies the CI process and reduces potential overhead during package management.

* Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules

- Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
- Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.

* Refactor formatting of HIP intrinsic rule registrations

- Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
- No functional changes were made; this update focuses on code style improvements to enhance maintainability.

* Update file name and documentation for HIP intrinsic rules

- Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
- Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.

* Enhance DispatchHIPShuffle function with clang-analyzer comments

- Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
- This change improves code clarity and maintains compliance with static analysis tools.

* lint fix

* fix

* Enhance autotuner configurations in example_amd_flash_attn_fwd.py by adding new block sizes, stages, and panel sizes. Update test script to use relative Python path and adjust parameters for consistency.

* Add backward attention example to test script

- Extended the test.sh script to include a new backward attention example using example_amd_flash_attn_bwd.py.
- Added parameters for batch size, context length, and head dimensions to ensure consistency with the forward example.
- Updated the command for the backward tile example to match the new configuration.

* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py

- Introduced new functions for forward and backward configurations to enhance autotuning capabilities.
- Updated the FlashAttention forward and backward functions to improve performance and maintainability.
- Adjusted test script parameters for consistency and clarity, including the addition of group handling.
- Enhanced the autotuner configurations by refining block sizes and stages for better performance tuning.
- Updated the main function to reflect changes in parameter names and types for better usability.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Updated the backward function to return additional outputs, including log-sum-exp (LSE) values for improved gradient calculations.
- Refined autotuner configurations by adding new block sizes and adjusting parameters for better performance tuning.
- Improved shared memory usage in the backward pass to optimize memory access patterns and enhance computational efficiency.
- Updated the main function to reflect changes in parameter handling and ensure consistency with the forward pass.
- Enhanced correctness checks in the main function to include LSE validation alongside gradient checks.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Introduced a scaling factor for improved numerical stability in gradient calculations.
- Optimized shared memory usage by adding new shared buffers for intermediate calculations.
- Refined the handling of tensor fragments to improve performance and maintainability.
- Updated the main function to ensure compatibility with the new output parameters for backward operations.
- Removed unnecessary parameters from the test script to streamline execution.

* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_mha_bwd.py

- Updated the forward and backward functions to improve numerical stability and performance.
- Enhanced shared memory usage by optimizing buffer allocations and reducing unnecessary parameters.
- Adjusted autotuner configurations for better performance tuning and compatibility with new output parameters.
- Added debugging and benchmarking functions for improved correctness verification and performance analysis.
- Updated the main function to reflect changes in parameter handling and ensure consistency across examples.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Updated scaling factor application for improved numerical stability in gradient calculations.
- Refined tensor handling to ensure consistency with forward pass operations.
- Optimized atomic operations for writing gradients to dK and dV using fp32 for better precision.
- Adjusted comments for clarity and alignment with standard implementation practices.

* Expand autotuner configurations in example_amd_flash_attn_bwd.py and update test.sh

- Increased the range of block sizes and stages for forward and backward configurations to enhance performance tuning.
- Adjusted the test script to include additional parameters for batch size and head dimensions, ensuring consistency with the forward example.
- Improved comments for clarity and alignment with the updated configurations.

* Enhance performance calculations and benchmarking in example_amd_flash_attn_bwd.py

- Updated FLOPs calculation to account for both forward and backward passes, clarifying the total computational cost.
- Modified benchmarking functions to evaluate the complete forward and backward performance of both reference and Tile-lang implementations.
- Improved comments for better understanding of the performance metrics and implementation details.
- Removed unnecessary parameter from test.sh to streamline execution.

* Remove forward attention test commands from test.sh and retain backward attention execution for streamlined testing.

* Refactor FlashAttention forward and backward implementations in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py

- Updated the forward function to return both output and log-sum-exp (LSE) values for improved gradient calculations.
- Enhanced autotuner configurations for forward pass, including new parameters for better performance tuning.
- Refined scaling factor calculations for numerical stability in both forward and backward passes.
- Improved comments and documentation for clarity and consistency across implementations.
- Adjusted main function to reflect changes in parameter handling and ensure compatibility with new output requirements.

* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py

- Removed outdated comments and improved clarity in the code.
- Enhanced the forward function to consistently return output and log-sum-exp (LSE) values.
- Updated autotuner configurations to include new parameters for better performance tuning.
- Refined tensor handling and scaling factor calculations for improved numerical stability.
- Adjusted the main function to ensure compatibility with updated output requirements and parameter handling.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Updated configuration parameters for backward calculations, including new options for block sizes, threads, and rasterization.
- Added new parameters (k_pack, qk_coalesced_width, v_coalesced_width) to improve performance tuning and memory access patterns.
- Modified tensor copy operations to utilize coalesced widths for optimized memory loads.
- Enhanced GEMM operations with k_pack for improved computational efficiency.
- Refined the configuration generation logic to accommodate the new parameters, ensuring comprehensive coverage for backward pass scenarios.

* Refactor configuration and tensor operations in example_amd_flash_attn_bwd.py

- Updated backward configuration parameters to include larger block sizes and a wider range of threads for enhanced performance tuning.
- Removed unnecessary parameters (k_pack, qk_coalesced_width, v_coalesced_width) from function signatures and tensor operations to simplify the implementation.
- Optimized tensor copy operations by eliminating coalesced width specifications, streamlining memory access patterns.
- Adjusted GEMM operations to improve computational efficiency without the use of k_pack.

* Enhance HIP code generation and FP8 type support

- Added support for additional FP8 types (e4m3, e4m3b11fnuz, e5m2fnuz, e8m0) in codegen_hip.cc to improve compatibility.
- Updated error logging to include unsupported FP8 type details for better debugging.
- Implemented handling for loop break and no-op register management in HIP within VisitExpr_ method.
- Introduced new FP8 vector types (e5 and e8) in hip_fp8.h for enhanced functionality.
- Added overloads for AtomicAdd in common.h to support both pointer and value arguments.

* Enhance FP8 type support and clarify accumulator handling in HIP

- Expanded FP8 type support in codegen_hip.cc to include additional float8 formats.
- Updated gemm.h to clarify the handling of the accumulator when clear_accum is true.
- Added comments in hip_fp8.h to indicate that E8M0 types are not supported in the current HIP version.

* Remove deprecated files and update print statements for clarity in example_amd_flash_attn_bwd.py

* Update print statement formatting for clarity in example_amd_flash_attn_bwd.py

* Remove redundant verification results summary print statement in example_amd_flash_attn_bwd.py for cleaner output.

* Fix formatting inconsistencies in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements.

* Refactor and enhance HIP code generation for improved FP8 support

- Reorganized and cleaned up code in codegen_hip.cc for better readability and maintainability.
- Enhanced handling of FP8 types, including additional formats and improved error logging for unsupported types.
- Updated AtomicAdd function in common.h to streamline its implementation.
- Refined the PrintVecElemLoadExpr method to handle volatile loads more effectively.
- Added function to manage the addition of new functions in the code generation process.

* Fix formatting issue in HIP code generation for MFMA call

- Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency.

* Refactor HIP code generation and enhance FP8 type handling

- Reintroduced necessary includes and reorganized code in codegen_hip.cc for improved structure and readability.
- Enhanced the GetFP8Type function to support additional FP8 formats and improved error handling for unsupported types.
- Updated PrintType and PrintVecElemLoadExpr methods to better manage type conversions and vector element loading.
- Refined the AddFunction method to streamline function addition in the code generation process.

* Remove unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.

* Refactor backward attention implementation in example_amd_flash_attn_bwd.py

- Updated the GEMM operation to use shared memory for improved performance.
- Adjusted parallelization parameters to enhance efficiency in the backward pass.

* Fix formatting by removing an unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.

* Add additional test cases for `assert_tl_matmul_correctness` with `float8_e4m3fnuz` and various configurations

* Refactor test case formatting for `assert_tl_matmul_correctness` in `test_tilelang_gemm_mfma_intrinsic.py`

---------
Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

80665cd1

23 Sep, 2025 1 commit
- [AMD] refactor MatrixCoreIntrinEmitter (#860) · 48c9a352
  Jiaxing Ding authored Sep 23, 2025
  
  48c9a352
18 Sep, 2025 1 commit
- [AMD] fix bf16x2 dtype codegen (#847) · 6efeb743
  Jiaxing Ding authored Sep 18, 2025
  
  6efeb743
10 Sep, 2025 1 commit
- [AMD] support mfma i32_16x16x32_i8 (#800) · 9fd6bb30
  Jiaxing Ding authored Sep 10, 2025
```
Co-authored-by: Jiaxing Ding <jiaxing.ding@bytedance.com>
```
  9fd6bb30
03 Aug, 2025 1 commit

[Refactor] Rebase pipeline injector from upstream tvm (#687) · 73bf8346

Lei Wang authored Aug 03, 2025

* [Enhancement] Introduce software pipeline rewriter and refactor buffer access handling

- Added a new `PipelineOpaqueAccessRewriter` class to manage opaque buffer accesses in the software pipeline.
- Refactored the `PipelineBodyRewriter` to utilize the new rewriter for improved buffer access handling.
- Enhanced the `PipelineRewriter` to support additional fragment information and streamline pipeline construction.
- Updated tests to reflect changes in buffer management and access patterns, ensuring compatibility with the new structure.
- Removed obsolete code related to previous buffer access methods for clarity and maintainability.

* test fix

73bf8346

01 Jun, 2025 1 commit

[AMD] Support float8 matrix core (#537) · 5872e647

Lei Wang authored Jun 02, 2025



* [Enhancement] Add support for FP8 types in CUDA and HIP code generation

* Updated `GetFP8Type` function in `codegen_cuda.cc` and `codegen_hip.cc` to handle new FP8 types, including `kFloat8_e4m3fnuz`.
* Introduced a new header file `hip_fp8.h` for FP8 type definitions in HIP.
* Modified type mappings in `dlpack.py` and `mfma_macro_generator.py` to accommodate new FP8 types.
* Enhanced type handling in `TLHIPSourceWrapper` and `tensor.py` for better integration with FP8 types.
* Added necessary includes and logic to support FP8 in the code generation process, improving performance and compatibility with FP8 data types.

* lint fix

* Update src/target/codegen_hip.cc
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update tilelang/intrinsics/mfma_macro_generator.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* workaround

* fix

* Update submodule TVM to latest commit 587028ffebfff0ded520f8f90d62f0f6b165906c

* bug fix

* Refactor tilelang matrix multiplication to support transposition and packing options. Adjusted shared memory shapes and loading logic for A and B matrices. Updated test cases to validate new functionality.

* Refactor assertion function for tilelang matrix multiplication to improve readability by formatting parameters and aligning code. Cleaned up whitespace in intrinsic layout functions for consistency.

* Update bfloat16 type definitions in common.h and gemm.h for consistency. Changed __hip_bfloat16 to hip_bfloat16 and updated MfmaTraits specialization accordingly.

* lint fix

---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

5872e647

26 Mar, 2025 1 commit

[Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1

Lei Wang authored Mar 26, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

* [Refactor] Remove deprecated decorator and enhance Cython kernel handling

- Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
- Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
- Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
- Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
- Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.

* [Feature] Add matrix multiplication test and kernel implementation

- Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
- Minor formatting improvements in `deprecated.py` for better readability.

* lint fix

* [Refactor] Update tensor creation in matrix multiplication test

- Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
- Updated imports in `__init__.py` to include `make_tensor`.
- Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.

* [Refactor] Update tensor definitions across multiple files

- Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
- Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
- Improved documentation in README and example files to reflect changes in tensor usage.

* lint fix

* [Refactor] Update tensor types in attention and matrix multiplication examples

- Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
- Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
- Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.

* lint fix

* [Refactor] Update tensor types in GEMM example and test files

- Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
- Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.

* [Refactor] Update tensor usage in customize.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the file.

* [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the test file.

* [Refactor] Update tensor types to SharedBuffer and FragmentBuffer

- Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
- Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.

* [Refactor] Introduce Tensor alias for Buffer in proxy.py

- Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
- This change enhances clarity and consistency in tensor usage across the codebase.

bf8a6fc1

16 Mar, 2025 1 commit

[Refactor] Introduce KernelParam integration across modules (#223) · 3de9f13c

Lei Wang authored Mar 16, 2025

* [Refactor] Update KernelParam integration across modules

- Replaced instances of TensorType with KernelParam in various modules to standardize parameter handling.
- Updated JITKernel, BaseKernelAdapter, and CythonKernelAdapter to utilize KernelParam for improved type consistency.
- Enhanced Profiler class to include KernelParam in its parameters, ensuring better integration with the new parameter structure.
- Adjusted tensor handling in utility functions to accommodate the new KernelParam type, improving overall code clarity and maintainability.
- Updated copyright headers to reflect the correct organization.

* [Refactor] Clean up whitespace in kernel, profiler, and tensor modules

- Added blank lines for improved readability in kernel.py, __init__.py, and tensor.py.
- Enhanced code clarity by ensuring consistent formatting across these modules.

* [Enhancement] Add detailed docstrings to KernelParam and Profiler classes

- Enhanced KernelParam class with comprehensive docstrings for better understanding of its purpose and methods.
- Updated Profiler class to include detailed docstrings for its attributes and methods, improving code documentation and usability.
- Removed unused do_bench function to streamline the profiler module and improve clarity.

* [Refactor] Update type hints in do_bench function and clean up whitespace in profiler module

- Changed type hints for grad_to_none and quantiles parameters in do_bench function to use Optional for better clarity.
- Added a blank line in __init__.py for improved readability and consistency in the profiler module.

* [Refactor] Update type hint in do_bench function for consistency

- Changed the return type hint in the do_bench function from a union type to a more explicit List type for better clarity and consistency in type annotations.

* [Refactor] Update return type hint in do_bench function for clarity

- Changed the return type hint in the do_bench function from a union type to Union[float, List[float]] for improved clarity and consistency in type annotations.

* [Enhancement] Add func property to Profiler class for adapter access

- Introduced a new property `func` in the Profiler class to provide access to the adapter, ensuring that the adapter is set before retrieval. This enhancement improves the usability of the Profiler class by allowing easier access to the adapter functionality.

* [Refactor] Update kernel compilation and profiling in tests

- Replaced instances of `TL.lower` and `TL.Profiler` with `tilelang.compile` and the new profiler interface across multiple test files.
- Enhanced the kernel compilation process to utilize the updated API, improving consistency and maintainability in the testing framework.
- Updated assertions to use the new profiler methods for better clarity and functionality in performance testing.

* [Refactor] Simplify kernel invocation and remove unused parameters in tests

- Updated the kernel invocation in `test_tilelang_dynamic_symbolic.py` to directly assign the result to `C`, improving clarity.
- Removed the `execution_backend` parameter from `tilelang.compile` calls in `test_tilelang_jit_callback.py` and `test_tilelang_jit_gemm.py` for consistency with the updated API.
- Commented out the call to `tilelang.testing.main()` in `test_tilelang_jit_callback.py` and replaced it with a direct call to `test_gemm_jit_kernel()` to streamline test execution.
- Adjusted the dtype mapping in `TorchDLPackKernelAdapter` to use the parameter's dtype directly, enhancing code simplicity.

* [Refactor] Remove unused imports in test files for cleaner code

- Eliminated unnecessary imports of `tilelang` as `TL` in various test files to enhance code clarity and maintainability.
- Updated multiple test files to streamline the codebase and reduce potential confusion from unused references.

* [Refactor] Simplify kernel invocation in tilelang kernel test

- Updated the kernel invocation in `test_tilelang_kernel_bf16_gemm_mma.py` to directly assign the result to `C`, enhancing code clarity and consistency with recent changes in the API.

* [Refactor] Simplify kernel invocation in tilelang kernel tests

- Updated kernel invocations in multiple test files to directly assign the result to `C`, improving code clarity and consistency with the updated API.
- Removed unnecessary initialization of `C` as a zero tensor, streamlining the code further.

* [Refactor] Update kernel invocation in tilelang transform tests

- Replaced the use of `TL.Profiler` with `tilelang.compile` in `test_tilelang_transform_simplify.py`, enhancing code clarity and consistency with the updated API.
- Streamlined the kernel invocation process by directly assigning the result to `C`, improving readability and maintainability of the test code.

3de9f13c

23 Jan, 2025 2 commits

[Refactor] Simplify interface via replacing argument thread binding of... · 362b3520

Lei Wang authored Jan 23, 2025

[Refactor] Simplify interface via replacing argument thread binding of intrinsics with `KernelFrame.Current` (#34)

* installation script fix

* readme typo fix

* doc fix for dequantize gemm

* [Doc] remove CODE_OF_CONDUCT.md and SECURITY.md; update references in CONTRIBUTING.md

* [Doc] add unit tests for AnnotateDeviceRegions transform; remove SUPPORT.md

* update license

* [Enhancement] add tensor supply handling for unsigned integers; improve error message for execution backend assertion

* [Refactor] improve code readability by reformatting function signatures and assertions

* [Refactor] replace torch.manual_seed with tilelang.testing.set_random_seed for consistency in random seed handling

* [Refactor] unify thread binding variable naming across kernel and example files

* [Refactor] remove unused thread binding parameter from matrix multiplication functions

* [Refactor] remove unused thread binding parameter from matrix multiplication functions

* [Refactor] enable main testing function in tilelang kernel gemm test

* bug fix

362b3520

[CI] Comprehensive Test cases Implementation of Matmul Dequantize (#32) · 7959d786

Lei Wang authored Jan 23, 2025

* installation script fix

* readme typo fix

* doc fix for dequantize gemm

* [Doc] remove CODE_OF_CONDUCT.md and SECURITY.md; update references in CONTRIBUTING.md

* [Doc] add unit tests for AnnotateDeviceRegions transform; remove SUPPORT.md

* update license

* [Enhancement] add tensor supply handling for unsigned integers; improve error message for execution backend assertion

* [Refactor] improve code readability by reformatting function signatures and assertions

* [Refactor] replace torch.manual_seed with tilelang.testing.set_random_seed for consistency in random seed handling

7959d786

11 Jan, 2025 2 commits

[Lint] Overall Typo and Linting Fixes (#13) · fa511857
Lei Wang authored Jan 11, 2025
```
* README.md fixed

* update test ci

* Lint and Typo Fix

* Clang Format Lint Fix
```
fa511857

[Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c

Lei Wang authored Jan 11, 2025



* Add format.sh script for code formatting and linting

* docs update

* center align the title

* lint fix

* add ignore

* Add .gitignore for 3rdparty directory

* Add requirements-dev.txt, requirements-test.txt, and requirements.txt

* 3rdparty

* Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h

* Refactor CMakeLists.txt and include statements

- Update CMakeLists.txt to use a newer version of CMake and add project name
- Remove unnecessary include directories

Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc

- Update include paths to use relative paths instead of absolute paths

* Update submodule for 3rdparty/tvm

* update

* load dll first

* Refactor CMakeLists.txt and include statements

* Refactor CMakeLists.txt and include statements

* git keep update

* Refactor CMakeLists.txt and include statements

* Refactor CMakeLists.txt and include statements

* refactor code structure

* Update Readme

* CMakeLists Customized

* update readme

* update README

* update readme

* update usage

* with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import

* annotate lower transform global func with `transform` prefix

* Migrate Simplify Pass from tilelang tvm branch

* enhance system environment handling with __init__ and CMake

* Initial commit

* CODE_OF_CONDUCT.md committed

* LICENSE committed

* README.md committed

* SECURITY.md committed

* SUPPORT.md committed

* CODE_OF_CONDUCT Commit

* LICENSE Commit

* SECURITY Commit

* SUPPORT Commit

* Modify Support

* Update README.md

* security ci update

* remove examples

* Update and implement clang-format

* add composable kernel components

* Migrate from latest update

* submodule update

* Test update

* Update License

* Spell check

* lint fix

* add clang-tidy to apply static analysis for c source

* update tilelang examples

* Update Install Docs

* Refactor filetree

* Enhance Install

* conflict resloved

* annotate_version

* Initial Update

* test fix

* install

* Implement setup.py

* lint fix

* Separate Init

* Separate test

* docker file commit

* add logo

* Update Readme and Examples

* update readme

* update logo

* Implement AMD Installation

* Add License

* Update AMD MI300x Benchmark

* update README

* update mi300 benchmark scripts

* update ignore

* enhance build scirpt

* update image

* enhance setup.py to remove duplicated libraries

* remove debug files

* update readme

* update image

* update gemm examples

* update flashattention README

* readme update

* add cmake into requirements

* libinfo fix

* auto update submodule

* lint fix

* Fix AMD Build and Test

* Update check for transpose attribute for CDNA Arch

* typo fix for amd

* Implement Matmul Benchmark

* Refactor Code

* [TypoFix] Fix GEMM Example

* [Docs] Init Linear Attention README

* [TYPO] Typo fix

* [Lint] Lint Fix

* enhance example with intrinsics

* [Enhancement] Improve Buffer Collection during IR Parser

* [Dev] Introduce Current classmethod to get current frame

* submodule update

* fake test pass update

* support thread_extent_api

* code optimize

* Add GEMM function implementation for matrix multiplication

* Update logging format to reflect TileLang in logger messages

* Refactor CMakeLists.txt for improved readability and set default build type to Release

* Support Gemm SS Primitives Implementation

* [README] Upload Tile Language Logo (#5)

* update logo

* Update README.md to enhance formatting and center the title

---------
Co-authored-by: microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
Co-authored-by: Microsoft Open Source <microsoftopensource@users.noreply.github.com>
Co-authored-by: Yu Cheng <yu.cheng@pku.edu.cn>

57ab687c