1. 12 Dec, 2025 1 commit
  2. 01 Dec, 2025 1 commit
    • botbw's avatar
      [Language] support `T.gemm_sp_v2` on sm80 and sm89 (#1056) · 283a9a00
      botbw authored
      * [misc] add a cpp side wrapper for gemm_sp_py
      
      * [misc] typing
      
      * [IR] bind GemmSPWarpPolicy
      
      * [chore] add wrapper code
      
      * [IR] fix GemmSPWarpPolicy
      
      * [codegen] apply ptxas instructions
      
      * [intrinsic] add typical (unused) mma layout
      
      * [template] add uint16 debug func
      
      * [intrinsic] add b matrix layout
      
      * [gemm_sp] enable fp16/bf16 on sm8x
      
      * [layout] refactor fp16/bf16 layout
      
      * [gemm_sp] enable int8
      
      * [chore] update test case dtype
      
      * [gemm_sp] enable fp32
      
      * [layout] refactor layouts
      
      * [intrinsic] enable ldmatrix for mat A
      
      * [layout] enable ldsm for matrix b
      
      * [layout] add ldmatrix for fp32 and fp8
      
      * [chore] refine
      
      * [chore] refactor
      
      * [chore] add fp8 efactor
      
      * [chore] refactor
      
      * [chore] add remove negative zero util
      
      * [example] add a custom compress kernel
      
      * [chore] minor update
      
      * [test] refactor gemm_sp test
      
      * [refactor] make metadata layout func
      
      * [example] add option for using cutlass layout
      
      * [doc] add a gemm_sp doc
      
      * [doc] minor polish
      
      * [chore] remove unused
      
      * [bugfix] fix non replicate b case
      
      * [test] refactor
      
      * [chore] add a check
      
      * [bugfix] fix util bug
      
      * [wip] init a new test case for v2
      
      * [chore] minor refactor
      
      * [chore] minor update
      
      * [bugfix] enable 16bit rs
      
      * [language] enable rs
      
      * [language] enable gemm_sp_sr
      
      * [language] enable gemm_sp_rr
      
      * [test] enable more tests
      
      * [tvm] update ffi binding
      
      * [chore] remove print
      
      * [chore] fix benchmark script
      
      * [lint] precommit lint
      
      * [chore] apply feedback
      
      * [test] use arch 8.0
      
      * [chore] rollback ::ordered_metadata for backward compatibility
      
      * [bugfix] fix captialized
      
      * [example] keep gemm_sp on hopper
      
      * [test] fix no fp8 normal kernel
      
      * [test] reduce matmul size to satisfy accum error
      
      * [test] use cal_diff for assertion
      
      * [bugfix] expand float8 type
      
      * [lib] add make_int4 for short type
      
      * [language] add transpose E
      
      * [bugfix] fix wrong var
      
      * [format] format
      
      * [chore] refactor binding
      
      * [chore] fix wrong passing var
      283a9a00
  3. 15 Sep, 2025 1 commit
    • botbw's avatar
      [feat] support gemm_sp for ampere and ada arch (#691) · 0b3683bf
      botbw authored
      
      
      * [feat] add an example mma atom
      
      * [fix] fix typo naming
      
      * [feat] add a template to enable compilation
      
      * [feat] add print util
      
      * [WIP] pass on single block tile
      
      * [feat] add sm80 metadata layout
      
      * [chore] clean codebase
      
      * [CI] format.sh
      
      * [feat] add sm80 compress utils
      
      * [bugfix] fix C fragment layout
      
      * [refactor] use nvcc version instead of str
      
      * [test] add test cases
      
      * [chore] add a param check
      
      * [chore] format a bit
      
      * [chore] rename func to satisfy PEP 8 and appease gemini
      
      * [chore] add check
      
      * [feat] support sm75 layout && add assertion && chore
      
      * [bug] fix illegal memory access when using two warps over N=32
      
      This could be a missing check related to cutlass 2.x implementation.
      Using the cutlass example can't trigger this cause it's bypassed by
      padding the input.
      
      For now I think it might be safe to increase the atom size and inve-
      sgate in the future.
      
      * [chore] add example
      
      * [chore] format
      
      * [example] update benchmark
      
      * [bugfix] fix namespace and format
      
      * [bugfix] fix incorrect param passing
      
      * [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp
      
      * [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py
      
      * [CI] fix arch
      
      * [example] add torch sparse benchmark
      
      * [misc] polish && add reference && apply review suggestionsi && format
      
      * [CI] format with clang-tidy
      
      * [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h
      
      * [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0b3683bf
  4. 13 Sep, 2025 1 commit
  5. 22 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) · 5c11d245
      Lei Wang authored
      * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy
      
      * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed.
      * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping.
      * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.
      
      * lint fix
      
      * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.
      
      * remove bulk copy
      
      * Refactor copy and atomic add operations to support TMA lower configuration
      
      - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations.
      - Modified `Lower` method in `Copy` to incorporate the new TMA configuration.
      - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic.
      - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity.
      - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.
      
      * Enhance TMA bulk copy logic in `LowerBulkCopy` method
      
      - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling.
      - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.
      
      * lint fix
      
      * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.
      
      * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions
      
      - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations.
      - Added TODO comments to indicate the need for further improvements in shared memory handling.
      
      * Update `native_sparse_attention` function to include TMA configuration options
      
      - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations.
      - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.
      
      * Refactor JIT decorator formatting in `native_sparse_attention` function
      
      - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase.
      - No functional changes were made; this update focuses on code clarity and maintainability.
      
      * Enhance thread management and logging in TileLang compilation
      
      - Added a method to check if printing is enabled during compilation, improving control over logging behavior.
      - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output.
      - Added comments to clarify the purpose of changes and improve code readability.
      
      * Add warp specialization scope and refactor register management in TileLang
      
      - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management.
      - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process.
      - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management.
      - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.
      
      * Refactor test for InjectSetMaxNReg pass in TileLang
      
      - Improved readability by restructuring conditional checks and assertions in the test cases.
      - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic.
      - Ensured consistent formatting and spacing throughout the test functions for better maintainability.
      
      * Enhance bulk copy and store checks in `Copy` class
      
      - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options.
      - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations.
      - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.
      
      * lint fix
      5c11d245
  6. 03 Jul, 2025 1 commit
    • botbw's avatar
      [Experimental][Language] add `T.GEMM_SP` for sm90 sparse tensor core (#526) · be44758c
      botbw authored
      
      
      * [experimental] add a draft gemm_sp
      
      * [3rdparty] bump cutlass to v3.9.3
      
      * [lint] run format.sh
      
      * [chore] rebase
      
      * [chore] use abs path
      
      * [gemm_sp] add metadata layout
      
      * [ci] add more example
      
      * [lint] run format.sh
      
      * [chore] polish
      
      * [chore] move gemm_sp to experimental
      
      * [chore] polish
      
      * [lint] run format.sh
      
      * [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test
      
      * Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy.
      * Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process.
      * Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality.
      
      * Implement Test
      
      * [Enhancement] Update GEMM SP and SM89 templates for improved functionality
      
      * Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture.
      * Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets.
      * Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types.
      * Added conditional inclusion for GEMM SP header based on CUDA architecture version.
      
      * lint fix
      
      * [gemm_sp] support more layout and data types
      
      * Enhancement: sync T.gemm_sp's layout inference with T.gemm
      
      * Enhancement: support more block_k in compress util
      
      * [Enhancement] enable block_k=64
      
      * [Lint] run format.sh
      
      * [Enhancement] compressor support more dtype
      
      * Enhancement: enable block_K=32
      
      * [Lint] format.sh
      
      * [Fixbug] fix shape
      
      * Refactor: sync gemm
      
      * [Enhancement] enable transpose
      
      * [Enhancement] enable fp8_e4m3
      
      * [Enhancement] enable int8
      
      * [Lint] run format.sh
      
      * [Benchmark] add gemm_sp benchmark
      
      * [Example] fix 256 threads hang
      
      * [CI] fix ci
      
      * [Chore] resolve gemini feedback
      
      * [Benchmark] increase search space
      
      * [Lint] format
      
      * [CI] skip sparse tensor core related tests as only sm90 is supported
      
      * [CI] pass local run
      
      * Update gemm_sm89.h
      
      * lint fix
      
      * lint fix
      
      * [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags
      
      - Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation.
      - Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag.
      - Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected.
      - Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch.
      - Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module.
      
      * Update test_compress_utils.py
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      be44758c
  7. 28 May, 2025 1 commit
    • Lei Wang's avatar
      [Autotune] Introduce cache mechanism for auto tuner (#527) · 7171aff6
      Lei Wang authored
      * [Enhancement] Add commit ID to versioning and improve logging initialization
      
      * Updated `get_tilelang_version` to include an optional commit ID in the version string.
      * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
      * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
      * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
      * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.
      
      * [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching
      
      * Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability.
      * Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations.
      * Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking.
      * Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process.
      * Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity.
      * Minor adjustments in related modules to support the new caching and profiling features.
      
      * [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules
      
      * Consolidated import statements and removed unnecessary line breaks for better readability.
      * Standardized function argument formatting across the AutoTuner and CompileArgs classes.
      * Enhanced consistency in the use of whitespace and indentation throughout the codebase.
      * Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability.
      * Ensured that all changes adhere to the project's coding style guidelines.
      
      * [Refactor] Remove redundant type hints in AutoTuner modules
      
      * Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`.
      * Improved code readability and maintainability by streamlining type imports across the AutoTuner module.
      
      * [Refactor] Update AutoTuner configuration for improved profiling and target detection
      
      * Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings.
      * Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection.
      * Removed redundant target specifications in certain instances to streamline the configuration process.
      * Improved overall clarity and maintainability of the autotuning logic in various example scripts.
      
      * [Refactor] Simplify code formatting and improve readability in example scripts
      
      * Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity.
      * Removed unnecessary line breaks and standardized argument placement across multiple files.
      * Enhanced overall code readability and maintainability in autotuning examples and performance scripts.
      
      * [Refactor] Update JIT decorator usage across multiple files
      
      * Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code.
      * Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase.
      * Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions.
      
      * [Refactor] Standardize JIT decorator formatting across benchmark and example scripts
      
      * Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks.
      * Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts.
      * Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.
      7171aff6
  8. 31 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Updated autotune usage in the examples to align with the latest changes (#309) · 66c7f6a1
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      
      * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings
      
      - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
      - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
      - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
      - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.
      
      * lint fix
      
      * typofix
      
      * [Refactor] Update matmul and flashattn function calls to return structured results
      
      - Modified the matmul and flashattn function calls to return a single object containing latency, configuration, and reference latency, improving code clarity and reducing the number of returned variables.
      - Updated all relevant instances in benchmark and example scripts to accommodate the new return structure, ensuring consistent usage across the codebase.
      
      * lint fix
      66c7f6a1
  9. 26 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
  10. 23 Mar, 2025 1 commit
    • Lei Wang's avatar
      Refactor matrix multiplication benchmark and autotuner logging (#263) · 8c94de32
      Lei Wang authored
      - Updated `ref_program` in `benchmark_matmul.py` to remove the unused parameter `C`, simplifying the function signature.
      - Changed logging level in `autotuner/__init__.py` from `INFO` to `DEBUG` for more detailed logging during autotuning.
      - Modified the error handling in the autotuner to provide clearer messages and log errors at the debug level.
      - Enhanced error reporting in the JIT adapter by adding detailed context to error messages in `cython_wrapper.pyx` when kernel calls fail.
      8c94de32
  11. 22 Mar, 2025 1 commit
    • Chaofan Lin's avatar
      [Bugfix] Fix Benchmark/Example Code for Autotuning (#254) · 0430cfe7
      Chaofan Lin authored
      
      
      * fix tune args
      
      * lint
      
      * Refactor gemm example and autotuner logging
      
      - Updated `ref_program` in `example_gemm.py` to return the result of matrix multiplication instead of modifying an input parameter.
      - Changed logging filename in `__init__.py` from 'out.log' to 'autotuner.log' for better clarity.
      - Modified JIT kernel compilation process to include `out_idx` directly in the adapter creation, enhancing flexibility.
      - Improved validation of `result_idx` in `BaseKernelAdapter` to ensure it falls within valid bounds.
      
      * Refactor `ref_program` in `benchmark_matmul_intrinsic.py` to use the `@` operator for matrix multiplication instead of `torch.matmul`, simplifying the implementation by removing the unused parameter `C`.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0430cfe7
  12. 06 Mar, 2025 2 commits
    • Chaofan Lin's avatar
      [Carver] Multi-Threads Compilation for Fast Auto Tuning (#156) · 18be9e07
      Chaofan Lin authored
      * [Carver] Multi-Threads Compilation for Fast Auto Tuning
      
      * Add progress bar for compilation
      
      * lint
      18be9e07
    • Lei Wang's avatar
      [Carver] Enhance Carver Adaptation for MatMul Benchmarking (#153) · 3c53297b
      Lei Wang authored
      * [Refactor] Consolidate GemmWarpPolicy Enum and Add Utility Method
      
      - Move GemmWarpPolicy from copy.py and gemm.py to primitives/gemm/base.py
      - Implement from_warp_partition class method to determine warp policy
      - Add docstring with examples for policy determination
      - Remove duplicate GemmWarpPolicy class definitions
      
      * [Enhancement] Add TensorCore Intrinsic Matrix Multiplication Benchmarks
      
      - Implement two new matrix multiplication benchmark scripts:
        1. `benchmark_matmul_intrinsic.py`: Uses TensorCore intrinsics with advanced configuration
        2. `benchmark_matmul.py`: Provides a more generic matrix multiplication benchmark
      
      - Add support for roller-based configuration generation in both benchmarks
      - Enhance MMA macro generator to handle 2D and 4D output buffer layouts
      - Implement flexible autotuning configurations with multiple parameters
      - Support different data types and accumulation modes
      - Add command-line arguments for matrix dimensions and roller configuration
      
      * lint fix
      
      * Fix roller hints generation in get_roller_hints_from_func
      
      - Simplify roller hints generation logic
      - Ensure policy-based configuration is always emitted when a policy is available
      - Remove redundant None check for roller hints
      
      * Add shared memory for matrix multiplication in benchmark and quickstart examples
      
      - Modify benchmark_matmul.py and quickstart.py to include C_shared allocation
      - Change accumulation dtype from float16 to float in benchmark_matmul.py
      - Update matrix multiplication kernels to use shared memory for result storage
      - Enable CUDA kernel source printing in quickstart example
      3c53297b
  13. 05 Mar, 2025 2 commits
  14. 11 Jan, 2025 1 commit
    • Lei Wang's avatar
      [Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c
      Lei Wang authored
      
      
      * Add format.sh script for code formatting and linting
      
      * docs update
      
      * center align the title
      
      * lint fix
      
      * add ignore
      
      * Add .gitignore for 3rdparty directory
      
      * Add requirements-dev.txt, requirements-test.txt, and requirements.txt
      
      * 3rdparty
      
      * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h
      
      * Refactor CMakeLists.txt and include statements
      
      - Update CMakeLists.txt to use a newer version of CMake and add project name
      - Remove unnecessary include directories
      
      Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc
      
      - Update include paths to use relative paths instead of absolute paths
      
      * Update submodule for 3rdparty/tvm
      
      * update
      
      * load dll first
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * git keep update
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * refactor code structure
      
      * Update Readme
      
      * CMakeLists Customized
      
      * update readme
      
      * update README
      
      * update readme
      
      * update usage
      
      * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import
      
      * annotate lower transform global func with `transform` prefix
      
      * Migrate Simplify Pass from tilelang tvm branch
      
      * enhance system environment handling with __init__ and CMake
      
      * Initial commit
      
      * CODE_OF_CONDUCT.md committed
      
      * LICENSE committed
      
      * README.md committed
      
      * SECURITY.md committed
      
      * SUPPORT.md committed
      
      * CODE_OF_CONDUCT Commit
      
      * LICENSE Commit
      
      * SECURITY Commit
      
      * SUPPORT Commit
      
      * Modify Support
      
      * Update README.md
      
      * security ci update
      
      * remove examples
      
      * Update and implement clang-format
      
      * add composable kernel components
      
      * Migrate from latest update
      
      * submodule update
      
      * Test update
      
      * Update License
      
      * Spell check
      
      * lint fix
      
      * add clang-tidy to apply static analysis for c source
      
      * update tilelang examples
      
      * Update Install Docs
      
      * Refactor filetree
      
      * Enhance Install
      
      * conflict resloved
      
      * annotate_version
      
      * Initial Update
      
      * test fix
      
      * install
      
      * Implement setup.py
      
      * lint fix
      
      * Separate Init
      
      * Separate test
      
      * docker file commit
      
      * add logo
      
      * Update Readme and Examples
      
      * update readme
      
      * update logo
      
      * Implement AMD Installation
      
      * Add License
      
      * Update AMD MI300x Benchmark
      
      * update README
      
      * update mi300 benchmark scripts
      
      * update ignore
      
      * enhance build scirpt
      
      * update image
      
      * enhance setup.py to remove duplicated libraries
      
      * remove debug files
      
      * update readme
      
      * update image
      
      * update gemm examples
      
      * update flashattention README
      
      * readme update
      
      * add cmake into requirements
      
      * libinfo fix
      
      * auto update submodule
      
      * lint fix
      
      * Fix AMD Build and Test
      
      * Update check for transpose attribute for CDNA Arch
      
      * typo fix for amd
      
      * Implement Matmul Benchmark
      
      * Refactor Code
      
      * [TypoFix] Fix GEMM Example
      
      * [Docs] Init Linear Attention README
      
      * [TYPO] Typo fix
      
      * [Lint] Lint Fix
      
      * enhance example with intrinsics
      
      * [Enhancement] Improve Buffer Collection during IR Parser
      
      * [Dev] Introduce Current classmethod to get current frame
      
      * submodule update
      
      * fake test pass update
      
      * support thread_extent_api
      
      * code optimize
      
      * Add GEMM function implementation for matrix multiplication
      
      * Update logging format to reflect TileLang in logger messages
      
      * Refactor CMakeLists.txt for improved readability and set default build type to Release
      
      * Support Gemm SS Primitives Implementation
      
      * [README] Upload Tile Language Logo (#5)
      
      * update logo
      
      * Update README.md to enhance formatting and center the title
      
      ---------
      Co-authored-by: default avatarmicrosoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
      Co-authored-by: default avatarMicrosoft Open Source <microsoftopensource@users.noreply.github.com>
      Co-authored-by: default avatarYu Cheng <yu.cheng@pku.edu.cn>
      57ab687c