- 11 Sep, 2025 1 commit
-
-
Tang Xinsheng authored
* [AMD] support fp8 T.gemm * format --------- Co-authored-by:tangxinsheng.txs <tangxinsheng.txs@alibaba-inc.com>
-
- 01 Jul, 2025 1 commit
-
-
Lei Wang authored
- Added a line break in `quickstart.py` for better readability. - Simplified the JIT kernel compilation in `quickstart.py` by removing the unused execution backend option. - Modified `example_elementwise_add.py` to disable cache for `tilelang` and optimized the element-wise addition kernel by utilizing shared memory for input tensors, improving performance. - Updated default values for matrix dimensions and block sizes in the argument parser to enhance usability.
-
- 11 Jun, 2025 1 commit
-
-
Lei Wang authored
* [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic - Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes. - Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts. - Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate. - Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns. * lint fix * [Refactor] Update GEMM Layout Functions and Inference Logic - Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature. - Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture. - Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection. - Adjusted related layout functions to ensure consistent behavior across different architectures. * Update bulk_copy.cc * Update __init__.py
-
- 20 May, 2025 1 commit
-
-
Lei Wang authored
* Modified the layout creation in makeGemmFragmentB to enhance the order of operations, ensuring the Replicate method is called before Repeat for better readability and performance. * This change improves the logical flow of fragment creation, aligning with best practices for GEMM layout management.
-
- 16 May, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Improve GEMM layout function and documentation * Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies. * Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations. * Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases. * lint fix
-
- 13 May, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Enhance makeGemmFragmentB to support transposition * Updated the `makeGemmFragmentB` function to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition. * Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation. * Modified the function signature in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter. * Added a new `matmul_sr` function in the test suite to validate the behavior of the updated fragment generation with transposition support. * [Refactor] Enhance makeGemmFragmentA and makeGemmFragmentB for transposition support * Updated the `makeGemmFragmentA` and `makeGemmFragmentB` functions to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition. * Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation. * Modified function signatures in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter. * Added a new `matmul_rs` function in the test suite to validate the behavior of the updated fragment generation with transposition support. * * Improve error messaging in layout equality checks * Enhanced the error output in layout equality checks to provide clearer context by adding line breaks for better readability in the debug output. * This change ensures that when layouts are structurally unequal, the current and previous layouts are displayed more distinctly, aiding in debugging.
-
- 03 May, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability. * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management. * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files * Updated function names from CamelCase to snake_case for better consistency across the codebase. * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc. * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions. * [Refactor] Rename operations to snake_case for consistency * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others. * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability. * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions. * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang. * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames. * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions. * Enhanced the TileLang API with new methods for retrieving block and thread extents. * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation. * Improved layout inference and kernel launch logic for better performance and clarity. * [Refactor] Clean up code formatting and improve readability * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`. * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity. * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs. * Ensured consistent spacing and formatting across multiple files to enhance overall code readability. * lint fix * [Refactor] Update mbarrier functions for improved clarity and consistency * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability. * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity. * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code. * Added detailed docstrings to clarify usage examples for memory barrier functions. * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections. * [Feature] Add examples for warp specialization and TMA barrier integration * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers. * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance. * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch. * Updated the `phase.py` to include TMA barrier injection in the optimization process. * Improved documentation and comments for better clarity on usage and functionality. * [Feature] Add example for warp specialization in GEMM with TMA barriers * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers. * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance. * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation. * Enhanced documentation and comments for clarity on usage and functionality. * lint fix * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement. * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results. * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis. * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness. * lint fix
-
- 16 Apr, 2025 1 commit
-
-
Lei Wang authored
* make it python 3.8- happy * [Enhancement] Improve loop partitioning and vectorization logic in layout inference and loop vectorization - Enhanced the VisitStmt_ method to support local buffer handling in parallel loops, allowing for register usage without explicit thread binding. - Updated loop vectorization logic to simplify expressions and ensure accurate vector size calculations, improving performance and clarity in the vectorization process. * lint fix * [Refactor] Update warp size checks and enhance warp partitioning logic in GEMM - Changed warp_n size check from 16 to 8 in gemm_layouts.cc to improve compatibility with specific configurations. - Refactored warp partitioning logic in gemm.cc to prioritize N dimension for better performance based on aspect ratio. - Introduced a new CompileArgs dataclass in autotuner to streamline compile argument management and improve code clarity. * lint fix * [Enhancement] Initialize jit_compile in AutoTuner class - Added initialization for jit_compile attribute in the AutoTuner class to ensure it is set to None by default. - Updated the assignment logic for jit_compile to prevent overwriting an existing compile function, enhancing the flexibility of the AutoTuner's compilation process.
-
- 09 Apr, 2025 2 commits
-
-
Lei Wang authored
* [Refactor] Update AutoTuner run method and timeout handling - Modified the `run` method to reduce the default timeout from 100 to 30 seconds for improved responsiveness. - Changed the `get_input_tensors_supply` call to disable output generation, enhancing performance during tensor supply retrieval. - Refactored the latency measurement to streamline the benchmarking process, ensuring proper timeout handling with `ThreadPoolExecutor`. - Added logging for timeout occurrences to aid in debugging and performance analysis. * bug fix * lint fix
-
Lei Wang authored
* [Bugfix] Correct dynamic shared memory size error handling in HIP wrapper - Updated the error handling logic in `PREDEF_ATTRIBUTE_SET_DYNAMIC_MEMORY_HIP` to check if the dynamic shared memory size exceeds the maximum limit of 65536. - Improved error message clarity by specifying the function name and the attempted size, ensuring better debugging information. - Ensured the function returns 0 upon successful setting of the dynamic shared memory size. * [Add] Implement example for MLA decoding with AMD support - Introduced a new example script `example_mla_decode_amd.py` demonstrating the use of the flash attention mechanism with AMD hardware. - Implemented functions for attention calculation, including support for split processing and combining outputs. - Added command-line argument parsing for customizable input parameters such as batch size, number of heads, and dimensions. - Included a reference implementation for validation against the Tile-AI output, ensuring correctness of the implementation. - Enhanced performance profiling and output comparison for debugging and optimization purposes. * lint fix
-
- 07 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Update GEMM Fragment Layout and Improve Matrix Multiplication Functionality - Adjusted the layout configuration in `gemm_layouts.cc` to correct the repetition parameters for warp and block layouts, enhancing the efficiency of the GEMM fragment generation. - Refactored the `matmul_rs` function in `test_tilelang_test_amd.py` to improve readability by restructuring the function signature and ensuring consistent formatting. - Updated the test execution call to run the new `test_gemm_rs_f16f32f32_nt` function, enhancing test coverage for the GEMM functionality. * lint fix * bugfix
-
- 06 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Clean up whitespace in CUDA-related files - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability. - This change enhances the overall organization of the codebase without altering functionality. * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`. - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement. - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration. - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations. * lint fix * Enhance variable creation by associating data types in IR and layout files, and introduce ExpandIndexDataType transformation - Updated variable creation in `ir.cc`, `gemm_layouts.cc`, and `elem.cc` to include data types for better type safety. - Added a new transformation `ExpandIndexDataType` to promote integer types to int64 where necessary, improving compatibility and performance. - Integrated the new transformation into the optimization pipeline in `phase.py`. - Documented the new transformation in `__init__.py` for clarity. * lint fix * Add configuration option for index bitwidth and remove ExpandIndexDataType transformation - Introduced a new pass configuration option `kConfigIndexBitwidth` to allow customization of index bitwidth. - Updated the optimization pipeline in `phase.py` to utilize the new configuration option instead of the removed `ExpandIndexDataType` transformation. - Documented the new configuration option in the JIT compilation function's parameters for clarity. - Removed the `ExpandIndexDataType` transformation implementation from the codebase to streamline the transformation process. * lint fix * Refactor index bitwidth configuration handling - Updated the `ConfigIndexBitwidth` pass to only apply the bitwidth transformation if the configuration option is defined, preventing potential errors with undefined values. - Changed the default value of `tl.config_index_bitwidth` in the JIT compilation function's parameters from 32 to None for better clarity and flexibility. * lint fix * lint fix --------- Co-authored-by:LeiWang1999 <wyatuestc@gmail.com>
-
- 20 Mar, 2025 1 commit
-
-
Lei Wang authored
* remove llvm build * [Refactor] Update kernel compilation and profiling in examples - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation. - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency. - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations. * lint fix * License Update * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields. - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability. * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files - Improved comment alignment and readability in `cuda.h`. - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability. * lint fix * lint fix * lint fix * lint fix * fix * License update * [Enhancement] Update JITKernel to use artifact for kernel source - Assigned the generated artifact to `self.artifact` for better management. - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling. * lint fix * Add @tilelang.testing.requires_llvm decorator to vectorization tests * Enhance setup.py and env.py for library management - Added functionality to remove original files after copying in CMakeBuild. - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration. * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py * Refactor CMakeBuild file handling in setup.py - Added a check to ensure the target library directory exists before copying .so files. - Improved the logic for creating the target directory and copying files to enhance robustness. * bugfix * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement. * lint fix * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility. * lint fix * Add support for C target in device code generation - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function. * [Enhancement] Implement auto-clear cache feature based on environment variable * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing. * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing. * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true. * [Refactor] Update kernel invocation and import paths in tests and cache * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result. * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`. * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability. * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class. * Enhanced overall code formatting to align with project standards. * [Enhancement] Add bfloat16 test case and improve kernel caching logic * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`. * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading. * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management. * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database. * Improved code formatting and readability across several files. * lint fix * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
-
- 16 Mar, 2025 1 commit
-
-
zqh-wz authored
* add test for issue 101 * use ss_smem_selector from cutlass * fix mismatch between smem layout and mma * only fix for sm90 * Add CUDA requirements to GEMM thread tests * lint fix --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 27 Feb, 2025 1 commit
-
-
Lei Wang authored
* refactor code * enhance tutorial * Enhance error handling and code generation in CUDA and TileLang components This commit introduces several improvements across multiple files: - Added more informative error messages in GEMM layout checks - Updated CUDA codegen to support more flexible function signature generation - Improved TMA descriptor initialization and kernel dispatch logic - Refined library generation and source code parsing utilities - Enhanced error handling in various adapter and wrapper classes * Add thread tag validation for warp specialization Introduce a ThreadTagChecker to validate that a PrimFunc only uses threadIdx.x before applying warp specialization. This prevents unintended transformations on kernels with complex thread binding and provides a clear warning to users about potential issues with warp specialization. * Update TileLang Profiling and Compilation in Flash Decoding Examples Refactor the profiling and compilation workflow in two flash decoding example scripts: - Replace `tilelang.lower()` and `tilelang.Profiler()` with `tilelang.compile()` - Simplify profiler initialization using `get_profiler()` - Update method calls to use the new profiler and compiled kernel objects - Maintain existing performance benchmarking and validation logic * Refactor and clean up code formatting in TileLang testing and adapter modules This commit includes several code style and formatting improvements: - Adjust whitespace and line breaks in test files - Improve code formatting in CUDA source wrapper and adapter utilities - Enhance readability of function calls and argument handling - Remove unnecessary whitespace and standardize indentation - Simplify function signatures and argument parsing * Refactor CUDA codegen and improve code formatting This commit includes several improvements to CUDA code generation and formatting: - Enhance function signature generation in CodeGenTileLangCUDA - Improve code formatting and readability in CUDA-related files - Simplify parameter handling and type annotations - Clean up whitespace and line breaks in codegen and layout files --------- Co-authored-by:Ubuntu <dlisuser@h100testl730RPS.xu5snccwrbtejcqqalluoku5hb.xx.internal.cloudapp.net>
-
- 11 Jan, 2025 2 commits
-
-
Lei Wang authored
* README.md fixed * update test ci * Lint and Typo Fix * Clang Format Lint Fix
-
Lei Wang authored
* Add format.sh script for code formatting and linting * docs update * center align the title * lint fix * add ignore * Add .gitignore for 3rdparty directory * Add requirements-dev.txt, requirements-test.txt, and requirements.txt * 3rdparty * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h * Refactor CMakeLists.txt and include statements - Update CMakeLists.txt to use a newer version of CMake and add project name - Remove unnecessary include directories Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc - Update include paths to use relative paths instead of absolute paths * Update submodule for 3rdparty/tvm * update * load dll first * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * git keep update * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * refactor code structure * Update Readme * CMakeLists Customized * update readme * update README * update readme * update usage * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import * annotate lower transform global func with `transform` prefix * Migrate Simplify Pass from tilelang tvm branch * enhance system environment handling with __init__ and CMake * Initial commit * CODE_OF_CONDUCT.md committed * LICENSE committed * README.md committed * SECURITY.md committed * SUPPORT.md committed * CODE_OF_CONDUCT Commit * LICENSE Commit * SECURITY Commit * SUPPORT Commit * Modify Support * Update README.md * security ci update * remove examples * Update and implement clang-format * add composable kernel components * Migrate from latest update * submodule update * Test update * Update License * Spell check * lint fix * add clang-tidy to apply static analysis for c source * update tilelang examples * Update Install Docs * Refactor filetree * Enhance Install * conflict resloved * annotate_version * Initial Update * test fix * install * Implement setup.py * lint fix * Separate Init * Separate test * docker file commit * add logo * Update Readme and Examples * update readme * update logo * Implement AMD Installation * Add License * Update AMD MI300x Benchmark * update README * update mi300 benchmark scripts * update ignore * enhance build scirpt * update image * enhance setup.py to remove duplicated libraries * remove debug files * update readme * update image * update gemm examples * update flashattention README * readme update * add cmake into requirements * libinfo fix * auto update submodule * lint fix * Fix AMD Build and Test * Update check for transpose attribute for CDNA Arch * typo fix for amd * Implement Matmul Benchmark * Refactor Code * [TypoFix] Fix GEMM Example * [Docs] Init Linear Attention README * [TYPO] Typo fix * [Lint] Lint Fix * enhance example with intrinsics * [Enhancement] Improve Buffer Collection during IR Parser * [Dev] Introduce Current classmethod to get current frame * submodule update * fake test pass update * support thread_extent_api * code optimize * Add GEMM function implementation for matrix multiplication * Update logging format to reflect TileLang in logger messages * Refactor CMakeLists.txt for improved readability and set default build type to Release * Support Gemm SS Primitives Implementation * [README] Upload Tile Language Logo (#5) * update logo * Update README.md to enhance formatting and center the title --------- Co-authored-by:
microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com> Co-authored-by:
Microsoft Open Source <microsoftopensource@users.noreply.github.com> Co-authored-by:
Yu Cheng <yu.cheng@pku.edu.cn>
-