1. 31 Mar, 2025 3 commits
    • Lei Wang's avatar
      [Bugfix] Updated autotune usage in the examples to align with the latest changes (#309) · 66c7f6a1
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      
      * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings
      
      - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
      - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
      - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
      - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.
      
      * lint fix
      
      * typofix
      
      * [Refactor] Update matmul and flashattn function calls to return structured results
      
      - Modified the matmul and flashattn function calls to return a single object containing latency, configuration, and reference latency, improving code clarity and reducing the number of returned variables.
      - Updated all relevant instances in benchmark and example scripts to accommodate the new return structure, ensuring consistent usage across the codebase.
      
      * lint fix
      66c7f6a1
    • Lei Wang's avatar
      [Cache] Implement in-memory cache (#308) · 5802c01b
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      
      * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings
      
      - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
      - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
      - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
      - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.
      
      * lint fix
      5802c01b
    • Wenhao Xie's avatar
  2. 30 Mar, 2025 4 commits
    • Lei Wang's avatar
      [Enhancement] Add support for CUDA architecture 8.9 in GEMM template (#304) · edbb9b6d
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      edbb9b6d
    • Leslin's avatar
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError (#305) · 6e294de9
      Leslin authored
      
      
      * Update elementwise_add.py
      
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError
      
      * Update rms_norm.py
      
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError
      
      * Remove adapter argument from do_bench call
      
      * Remove adapter argument from do_bench call
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      6e294de9
    • Haodong Tian's avatar
      [Bugfix] Resolve autotuner bugs for blocksparse GEMM example (#300) · 92e8d5f4
      Haodong Tian authored
      * [Bugfix] Configure autotuner specific logger for correct level handling
      - Previously, logging relied on basicConfig, which configured the root logger. This caused the named autotuner logger to ignore DEBUG messages.
      - This commit sets up a dedicated logger for autotuner, correctly route DEBUG messages to 'autotuner.log' and INFO+ messages to the console.
      
      * [Bugfix] Fix tensor_supply for boolean type
      - Previously `get_tensor_supply` used `torch.randint(-2, 3)` as a fallback, which caused error when the dtype was `torch.bool`.
      - This commits adds an `is_boolean` check in `KernelParam` and updates `get_tensor_supply` to specifically use `torch.randint(0, 2)` for boolean dtypes.
      
      * [Bugfix] Always regenerate JIT inputs during tuning
      - Removes the caching for `self.jit_input_tensors` within `AutoTuner`. When different autotuning configurations can alter the required input tensor shapes or other properties, reusing cached inputs from a previous configuration lead to errors or incorrect assessments.
      - This change ensures that `profiler._get_inputs()` is called unconditionally for each configuration evaluation. Since `_get_inputs` is assumed to be relatively inexpensive, the potential overhead is considered acceptable.
      
      * [Example] Update example_blocksparse_gemm for autotuner
      
      * Run code formatter
      
      * [Feature] Enable custom tensor supply and input caching control in Autotuner
      - Previously, tensor generation was tied to `supply_type` and input caching behavior across configurations was less explicit/controlled.
      - This commit introduces a `supply_prog` parameter to allow providing a custom function for generating input tensors, overriding the default mechanism.
      - Adds a `cache_input_tensors` flag (default True) to control input tensor caching:
          - If True, tensors are generated once per configuration and reused for repetitions, with a check for potential shape mismatches between configurations.
          - If False, tensors are regenerated for every configuration trial.
      - Refactors internal input tensor handling using supplier functions for clarity.
      - Adds a `check_tensor_list_compatibility` utility for shape comparison.
      
      * [Example] Update example_blocksparse_gemm for autotuner
      
      * Run code formatter
      
      * [Example] Small fix in example_blocksparse_gemm
      
      * [Fix] Raise error if autotuning yields no valid configuration
      92e8d5f4
    • yyttt6's avatar
      [Example] Add autotune to conv example (#301) · 1873dc00
      yyttt6 authored
      
      
      * add autotune to example_gemm.py
      
      * add autotune to conv
      
      * still coding ...
      
      * version 0
      
      * version 0
      
      * version 0
      
      * refactor autotune
      
      * refactor autotune
      
      * add autotune to conv example
      
      * add conv template to carver
      
      * add conv template to carver
      
      * add conv template to carver
      
      * Update num_stages configuration values
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      1873dc00
  3. 29 Mar, 2025 1 commit
  4. 28 Mar, 2025 5 commits
    • NaOHCC's avatar
    • Lei Wang's avatar
      [Refactor] Improve documentation and add detailed docstrings across multiple modules (#298) · 3f294650
      Lei Wang authored
      * [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h
      
      - Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
      - Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.
      
      * [Enhancement] Improve documentation and add detailed docstrings across multiple modules
      
      - Updated the `__init__.py` file to enhance module documentation, providing clarity on auto-tuning functionalities.
      - Added comprehensive docstrings to the `JITContext`, `AutotuneResult`, and `AutoTuner` classes, detailing their attributes and methods.
      - Enhanced memory allocation utilities in `allocate.py` with detailed descriptions for each allocation function.
      - Improved documentation for various intrinsic operations in `builtin.py`, `copy.py`, `customize.py`, `frame.py`, `gemm.py`, `memscope.py`, and `reduce.py`, ensuring clear explanations of parameters and return values.
      - Refactored the `KernelCache` class to improve clarity and maintainability, including detailed comments and docstrings for methods.
      - Overall, these changes aim to enhance code readability and provide better guidance for future developers and users of the Tile-AI framework.
      3f294650
    • Lei Wang's avatar
      [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h (#297) · 9ad9d9cd
      Lei Wang authored
      - Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
      - Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.
      9ad9d9cd
    • Lei Wang's avatar
      [Feature] Implement ParallelLoopTransformer for enhanced loop analysis (#295) · 5c8de061
      Lei Wang authored
      * [Feature] Implement ParallelLoopTransformer for enhanced loop analysis
      
      - Introduced the ParallelLoopTransformer class to improve the handling of parallel loops in layout inference.
      - Enhanced the analysis of loop variables and their extents, allowing for more accurate index range calculations.
      - Added a BufferAccessCollector to gather buffer access information, ensuring correct index mapping and condition handling.
      - Updated the LayoutInference pass to utilize the new transformer, improving overall performance and accuracy in loop transformations.
      
      * test fix
      
      * Fix typo in buffer variable documentation and enhance loop variable handling in layout inference. Added checks for related loop variables and improved condition handling for index mapping.
      
      * Refactor loop variable handling in layout inference. Updated loop index variable from `i` to `j` for clarity and improved condition handling for index mapping by replacing `indices[i]` with `index` in predicate construction.
      5c8de061
    • botbw's avatar
      [doc/example] add gemv doc and examples (#293) · ff3cfa59
      botbw authored
      * [doc/example] init gemv doc and examples
      
      * [example] add vectorized read
      
      * [example] use local register instead of smem
      
      * [example] add bench
      
      * [doc] update doc
      
      * [doc] refine doc
      
      * [lint] format code
      
      * [doc] add tips
      
      * [doc/example] fix typo
      
      * [example] use tmv_all_reduce
      
      * [doc] update doc accordingly
      
      * [doc] add benchmark table
      
      * [lint] format code
      ff3cfa59
  5. 27 Mar, 2025 5 commits
    • penguin_wwy's avatar
      [Dev] Correcting cxx compiler (#294) · 304b4465
      penguin_wwy authored
      304b4465
    • Lei Wang's avatar
      Remove citation page (#292) · 5079e2a5
      Lei Wang authored
      5079e2a5
    • Wenhao Xie's avatar
      [Doc] Python API docs generation (#278) · 5501b31c
      Wenhao Xie authored
      * fix bug
      
      * update performance.py
      
      * update python api docs
      
      * test workflow
      
      * fix dependency
      
      * fix bug
      
      * fix
      
      * update correct git config
      
      * test workflow
      
      * clear cache
      
      * lint fix
      
      * fix exclude path
      5501b31c
    • Lei Wang's avatar
      [Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 (#291) · 83412458
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      
      * [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py
      
      - Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
      - Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
      - Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.
      
      * [Refactor] Update imports in __init__.py for tir compatibility
      
      - Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
      - Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.
      
      * lint fix
      
      * [Refactor] Update imports in tir.ir.py for improved compatibility
      
      - Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
      - Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.
      
      * [Refactor] Update function calls in tir.ir.py to return values
      
      - Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.
      
      * bugfix
      
      * [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper
      
      - Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.
      
      * bugfix
      
      * [Update] Sync subproject commit and modify CUDA atomic add functions
      
      - Updated the subproject commit for TVM to edd35139a0481e9359aa269e3e50450b95ba2f5a.
      - Commented out the CUDA capability check in the example convolution script to prevent execution errors.
      - Refactored atomic add functions for BFLOAT16 in common.h to include a conditional compilation directive for improved compatibility with CUDA architectures.
      83412458
    • Lei Wang's avatar
      [Language] Proxy tvm ir to make linter happy (#287) · be0bf36d
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      
      * [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py
      
      - Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
      - Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
      - Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.
      
      * [Refactor] Update imports in __init__.py for tir compatibility
      
      - Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
      - Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.
      
      * lint fix
      
      * [Refactor] Update imports in tir.ir.py for improved compatibility
      
      - Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
      - Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.
      
      * [Refactor] Update function calls in tir.ir.py to return values
      
      - Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.
      
      * bugfix
      
      * [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper
      
      - Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.
      
      * bugfix
      
      * Uncomment main function call
      be0bf36d
  6. 26 Mar, 2025 4 commits
    • Yu Cheng's avatar
      [Feature] Introduce NoSetMaxNReg for warp specialization (#289) · 76435ca8
      Yu Cheng authored
      - Added NoSetMaxNReg as a new TIR built-in to indicate no register hint for warp-specialized branches.
      - Updated the warp specialization rewriter to handle the new NoSetMaxNReg operation, allowing for improved register management.
      - Enhanced the Python interface to include NoSetMaxNReg for consistency with TIR operations.
      76435ca8
    • Yu Cheng's avatar
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
    • yyttt6's avatar
      add autotune to example_gemm.py (#285) · 73d2c62e
      yyttt6 authored
      73d2c62e
  7. 25 Mar, 2025 5 commits
    • Lei Wang's avatar
      [Refactor] Update cache key generation in KernelCache (#283) · 7bd59f21
      Lei Wang authored
      - Changed the cache key generation to use the serialized script of the function instead of the function object itself, improving the uniqueness of cache keys.
      7bd59f21
    • yyttt6's avatar
      [Refactor] Enhance Autotune (#266) · 541e1685
      yyttt6 authored
      * add autotune to example_gemm.py
      
      * format init.py
      541e1685
    • Lei Wang's avatar
      [Language] Introduce `T.ptr` and `T.Tensor` (#276) · 8ad53855
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      8ad53855
    • Wenhao Xie's avatar
      [CI] Add gemm performance test (#274) · 18f29277
      Wenhao Xie authored
      * [Typo] Fix formatting in installation instructions in README.md
      
      * [Enhancement] Improve CUDA path detection and update configuration handling
      
      * fix typo
      
      * remove IS_WINDOWS constant
      
      * lint fix
      
      * Improve error messages for CUDA detection failure
      
      * lint fix
      
      * lint fix
      
      * Fix .gitignore to correctly include venv directory
      
      * [Doc] Add instructions for installing nightly version of TileLang
      
      * update installation instructions
      
      * update install instruction
      
      * update performance ci
      
      * update
      
      * update
      
      * update
      
      * update ci workflow
      
      * delete test.yml
      
      * lint fix
      
      * update bot.yml
      
      * update bot.yml
      
      * remove changes in ci.yml
      18f29277
    • Xiaochuan Ye's avatar
      [Bugfix]Add CUDA availability check in CtypesKernelAdapter (#267) · 29b7d374
      Xiaochuan Ye authored
      * fix: Add CUDA availability check in CtypesKernelAdapter
      
      * fix: Add CUDA availability check in CythonKernelWrapper
      29b7d374
  8. 24 Mar, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Improve flash attention example and layout comparison logic (#270) · 5f5bf53c
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      5f5bf53c
    • Yu Cheng's avatar
      [Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter (#269) · 2abd6ab7
      Yu Cheng authored
      - Introduced TMAFinder and ProducerUsedBufferFinder classes to analyze TMA loads and identify buffers used in producer conditions.
      - Enhanced WarpSpecializedRoleMarker to prepare and utilize the identified buffers during role marking.
      - Updated VisitStmt methods to incorporate new analysis logic for IfThenElse and For nodes, improving the handling of TMA loads in the warp specialization process.
      2abd6ab7
    • Lei Wang's avatar
      [Bugfix] Support `T.clear` for let binding (#268) · 47caf219
      Lei Wang authored
      * Fix indentation in JIT adapter wrapper to ensure consistent formatting of return statement in generated C code.
      
      * Enhance Fill Operation in TileLang
      
      - Updated the Fill constructor to support BufferLoad instances, adding checks for ramp indices and ensuring only stride 1 ramps are processed.
      - Introduced a region array to manage the bounds of the fill operation, improving error checking for static regions.
      - Modified the MakeSIMTLoop method to utilize the new region array for loop variable bounds, enhancing flexibility in kernel generation.
      - Updated the fill and clear functions in fill.py to accept both tir.Buffer and tir.BufferRegion types, improving usability and type handling.
      
      * Refactor Fill Operation and Improve Readability
      
      - Simplified the Fill constructor by enhancing the handling of BufferLoad instances and ensuring proper checks for ramp indices.
      - Improved error messages for region size checks to enhance clarity.
      - Cleaned up formatting in the Fill method for better readability.
      - Added a blank line in the matmul function test to improve code organization.
      - Introduced a blank line in the fill function to enhance readability in fill.py.
      
      * Add matrix multiplication functionality and test in TileLang
      
      - Introduced a new test file `test_tilelang_language_clear.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `__init__.py` in the utils module to include `map_torch_type`, enhancing type handling for tensor operations.
      
      * lint fix
      47caf219
  9. 23 Mar, 2025 3 commits
    • Lei Wang's avatar
      [Release] Bump version to 0.1.3 (#264) · 9981ac59
      Lei Wang authored
      * Bump version to 0.1.3
      
      * Refactor Docker script to streamline installation commands
      
      - Removed the installation of the Python environment and CMake from the Docker run command, simplifying the execution process.
      - Updated the command to focus on pip installation and running tox for testing across multiple Python versions.
      9981ac59
    • Lei Wang's avatar
      Refactor matrix multiplication benchmark and autotuner logging (#263) · 8c94de32
      Lei Wang authored
      - Updated `ref_program` in `benchmark_matmul.py` to remove the unused parameter `C`, simplifying the function signature.
      - Changed logging level in `autotuner/__init__.py` from `INFO` to `DEBUG` for more detailed logging during autotuning.
      - Modified the error handling in the autotuner to provide clearer messages and log errors at the debug level.
      - Enhanced error reporting in the JIT adapter by adding detailed context to error messages in `cython_wrapper.pyx` when kernel calls fail.
      8c94de32
    • Lei Wang's avatar
      [Language] Enhance alias to support blockwise memory load (#261) · 927e50d9
      Lei Wang authored
      * [Enhancement] Introduce caching control and frame management in TileLang
      
      - Added cache control functions (`enable_cache`, `disable_cache`, `is_cache_enabled`) in `env.py` to manage kernel caching behavior.
      - Updated `kernel_cache.py` to utilize the cache state, preventing unnecessary kernel compilation when caching is disabled.
      - Introduced a new `frame.py` module to manage LetFrame instances, including a stack for variable-value mapping and enhanced frame management.
      - Updated imports in various modules to accommodate new caching and frame functionalities, improving overall organization and clarity.
      
      * [Refactor] Clean up and enhance caching and frame management in TileLang
      
      - Added spacing for improved readability in `env.py` and `frame.py`.
      - Refactored `LetFrame` class to enhance clarity in buffer region assignment.
      - Ensured consistent formatting and organization across caching control and frame management functions.
      
      * [Feature] Add matrix multiplication functionality in TileLang
      
      - Introduced a new test file `test_tilelang_language_alias.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul` function defines a kernel for performing tile-level GEMM operations, with support for customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated `gemm.py` to allow `tir.Buffer` or `tir.Var` as valid argument types for the `gemm` function, enhancing flexibility in argument handling.
      
      * [Refactor] Improve formatting and readability in test_tilelang_language_alias.py
      
      - Adjusted spacing and alignment in the `matmul` and `run_matmul` functions for better readability.
      - Cleaned up unnecessary blank lines and ensured consistent formatting throughout the file.
      - Enhanced overall code clarity without altering functionality.
      927e50d9
  10. 22 Mar, 2025 5 commits
    • Chaofan Lin's avatar
      [Bugfix] Fix Benchmark/Example Code for Autotuning (#254) · 0430cfe7
      Chaofan Lin authored
      
      
      * fix tune args
      
      * lint
      
      * Refactor gemm example and autotuner logging
      
      - Updated `ref_program` in `example_gemm.py` to return the result of matrix multiplication instead of modifying an input parameter.
      - Changed logging filename in `__init__.py` from 'out.log' to 'autotuner.log' for better clarity.
      - Modified JIT kernel compilation process to include `out_idx` directly in the adapter creation, enhancing flexibility.
      - Improved validation of `result_idx` in `BaseKernelAdapter` to ensure it falls within valid bounds.
      
      * Refactor `ref_program` in `benchmark_matmul_intrinsic.py` to use the `@` operator for matrix multiplication instead of `torch.matmul`, simplifying the implementation by removing the unused parameter `C`.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0430cfe7
    • Yichen Yan's avatar
      [CI] Use auditwheel to generate manylinux wheels (#251) · 60923344
      Yichen Yan authored
      
      
      * use auditwheel to get correct manylinux wheels
      
      * fix
      
      * make py3.8 happy
      
      * trivial updates
      
      * Add typing.Tuple import and update annotations
      
      * fmt
      
      * Remove unused import and update type hints
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      60923344
    • You Jiacheng's avatar
      [Refactor] Move compilation outside critical section (#260) · 001e7b2a
      You Jiacheng authored
      
      
      * move compilation outside critical section
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      001e7b2a
    • Lei Wang's avatar
      [Refactor] Refactor CUDA post-processing callback registration in TileLang (#259) · f47b43c5
      Lei Wang authored
      * Add GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
      - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
      - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
      - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.
      
      * Refactor TileLang examples and enhance kernel compilation
      
      - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
      - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
      - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
      - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
      - Cleaned up import statements in `__init__.py` for better organization and clarity.
      
      * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
      - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
      
      * Refactor CUDA post-processing callback registration in TileLang
      
      - Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility.
      - Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability.
      - Added debug prints to the CUDA code generation process for better traceability during development.
      - Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation.
      - Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling.
      
      * lint fix
      
      * Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters
      
      - Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio.
      - Implemented a function to generate various kernel configurations for autotuning.
      - Refactored the main execution block to support both autotuned and default configurations.
      - Improved the block mask generation to accommodate specified sparsity levels.
      - Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.
      f47b43c5
    • Lei Wang's avatar
      [Example] Implement Kernel Example cumsum (#258) · cd9ec62e
      Lei Wang authored
      * Add GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
      - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
      - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
      - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.
      
      * Refactor TileLang examples and enhance kernel compilation
      
      - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
      - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
      - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
      - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
      - Cleaned up import statements in `__init__.py` for better organization and clarity.
      
      * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
      - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
      cd9ec62e
  11. 21 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Language] Introduce `T.alloc_var` to define a variable like `int var;` (#255) · c770a58f
      Lei Wang authored
      * [Enhancement] Add matrix multiplication functions for integer and float variables in Cython JIT
      
      - Introduced `matmul_int_variable` and `matmul_float_variable` functions to support matrix multiplication with dynamic shapes and additional parameters.
      - Implemented corresponding `run_matmul_int_variable` and `run_matmul_float_variable` functions for testing.
      - Updated test cases to validate the new matrix multiplication implementations.
      - Enhanced error handling in library initialization and compilation processes across various modules.
      - Improved dynamic memory handling in CUDA kernel initialization to provide better error reporting.
      
      * lint fix
      
      * optimize
      
      * Support var defiine
      
      * lint fix
      
      * Update TVM submodule and add alloc_variable function to allocate local variables in TileLang
      
      - Updated the TVM submodule to the latest commit.
      - Introduced `alloc_variable` function in `allocate.py` to support local variable allocation with specified data types and scopes.
      
      * lint fix
      
      * Refactor variable allocation functions for consistency
      
      - Renamed `alloc_variable` to `alloc_var` across multiple files for improved consistency.
      - Updated corresponding test functions to reflect the new naming convention.
      - Adjusted imports in `__init__.py` to align with the changes.
      c770a58f
    • yyttt6's avatar
      add autotune to example_gemm.py (#252) · 316d3b97
      yyttt6 authored
      * add autotune to example_gemm.py
      
      * add autotune to example_gemm.py
      
      * add autotune to example_gemm.py
      
      * add autotune to example_gemm.py
      316d3b97