1. 05 Apr, 2025 1 commit
    • Yuqing Xia's avatar
      [Example] Add sparse gqa decode example (#332) · 8fdfdf03
      Yuqing Xia authored
      
      
      * add example gqa decode wgmma pipelined
      
      * add sparse gqa
      
      * support num split
      
      * support num split
      
      * add if condition
      
      * add heuristic num split
      
      * clean code
      
      * add ref
      
      * fix bug
      
      * add torch ref
      
      * fix bug
      
      * integrate to torch
      
      * symbolic
      
      * clean mask
      
      * rm actual_num_blocks
      
      * clean code
      
      * get num_sm via torch
      
      * add sparse gqa decode example
      
      * format
      
      * rm example_gqa_decode_wgmma_pipelined.py
      
      * Add license headers to example scripts
      
      * format
      
      * Remove commented-out cache disabling lines
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      8fdfdf03
  2. 04 Apr, 2025 6 commits
    • Lei Wang's avatar
      [AMD] Fix for missing composable kernel include path when compile kernels on amd gpus (#334) · eb757608
      Lei Wang authored
      * [Enhancement] Add new matrix multiplication functions and tests for GEMM with transpose options
      
      - Introduced `matmul_rs` function for flexible matrix multiplication with optional transposition.
      - Added `run_gemm_rs` function to facilitate testing of the new matrix multiplication implementation.
      - Expanded test coverage for GEMM with additional cases for transposition configurations.
      - Corrected index usage in `gemm.h` to ensure proper matrix layout handling.
      
      These changes enhance the GEMM functionality and improve testing capabilities for various matrix configurations.
      
      * [Enhancement] Add Composable Kernel Path Handling in Environment Setup
      
      - Introduced support for the Composable Kernel by adding a new environment variable `TL_COMPOSABLE_KERNEL_PATH`.
      - Updated the environment setup to check for the existence of the Composable Kernel and log a warning if not found.
      - Modified the `LibraryGenerator` to include the Composable Kernel include directory during compilation for HIP targets.
      
      These changes improve the integration of the Composable Kernel into the TileLang environment, enhancing flexibility for users.
      eb757608
    • Yu Cheng's avatar
      [Refactor] Optimize RMS normalization kernel in rms_norm.py (#333) · 85e411c8
      Yu Cheng authored
      - Introduced a new local fragment for squared values to improve performance.
      - Updated the computation of the RMS normalization to use the new fragment, enhancing memory efficiency.
      - Refactored the final multiplication step to operate on the local fragment instead of shared memory.
      - Added a configuration option to the kernel compilation for better control over TMA lowering.
      
      These changes enhance the efficiency and clarity of the RMS normalization implementation.
      85e411c8
    • Lei Wang's avatar
      [Enhancement] Add new matrix multiplication functions and tests for GEMM with... · 9e5a757e
      Lei Wang authored
      [Enhancement] Add new matrix multiplication functions and tests for GEMM with transpose options (#331)
      
      - Introduced `matmul_rs` function for flexible matrix multiplication with optional transposition.
      - Added `run_gemm_rs` function to facilitate testing of the new matrix multiplication implementation.
      - Expanded test coverage for GEMM with additional cases for transposition configurations.
      - Corrected index usage in `gemm.h` to ensure proper matrix layout handling.
      
      These changes enhance the GEMM functionality and improve testing capabilities for various matrix configurations.
      9e5a757e
    • Lei Wang's avatar
      [Enhancement] Improve flashattn function in example_gqa_decode.py (#329) · 32060ecd
      Lei Wang authored
      - Added a manual seed for reproducibility in PyTorch.
      - Refactored local variable allocations for better memory management.
      - Enhanced parallel processing in the flashattn function to improve performance.
      - Updated layout annotations for clarity and efficiency.
      
      These changes optimize the flash attention mechanism and ensure consistent behavior across runs.
      32060ecd
    • Zhengju Tang's avatar
      [Dynamic Symbolic] Adaptively vectorize with different condition expressions (#326) · 5ee58ec7
      Zhengju Tang authored
      
      
      * [Dynamic Symbolic] Adaptively vectorize with different condition expressions
      
      * Format
      
      * Format
      
      * Format
      
      * Format
      
      * Add MIT License headers to Python files
      
      * Simplify return statement in loop vectorization
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      5ee58ec7
    • Lei Wang's avatar
      [AMD] Adapt rocm and support `T.gemm` with transpose_b=False for amd backend (#327) · eab47249
      Lei Wang authored
      
      
      * [Enhancement] Update GEMM and ROCm Integration
      
      - Removed the restriction on transposing matrix B for CDNA in `gemm.cc`, allowing for more flexible matrix operations.
      - Added a new debug header file `debug.h` for enhanced debugging capabilities in ROCm kernels.
      - Updated `codegen_hip.cc` to include the new debug header and improved handling of float16 and bfloat16 types in vector element stores.
      - Refactored `rt_mod_hip.cc` to return a ROCM module directly from `BuildTileLangHIPWithoutCompile`, enhancing the module creation process.
      - Introduced a new ROCm utility in `rocm.py` for linking and managing ROCm paths, improving the build process for ROCm applications.
      - Updated tests to reflect changes in GEMM configurations and ensure compatibility with the new features.
      
      These changes enhance the flexibility and debugging capabilities of the GEMM operations and improve the integration with the ROCm backend.
      
      * [Fix] Corrected syntax error in pyproject.toml and improved error message formatting in rocm.py
      
      - Added missing quotation mark for "HSA" in the `select` section of `pyproject.toml`.
      - Simplified the error message formatting in `get_rocm_arch` function of `rocm.py` for better readability and consistency.
      
      * lint fix
      
      * Update tilelang/jit/adapter/wrapper.py
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      eab47249
  3. 03 Apr, 2025 5 commits
    • botbw's avatar
      [Bugfix] add a patch to fix T.abs on float16 (#325) · 2cec52aa
      botbw authored
      * [bug] fix T.abs on float16
      
      * [lint] lint
      2cec52aa
    • Lei Wang's avatar
      [Feat] Enhance CUDA Property Handling (#322) · c0378aa9
      Lei Wang authored
      
      
      * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
      
      - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
      - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
      - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
      - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      c0378aa9
    • Chunan Zeng's avatar
    • yyttt6's avatar
      [Tools] Summarize TFLOPS Information from a tilelang program (#321) · 853898a7
      yyttt6 authored
      * refactor autotune
      
      * refactor autotune
      
      * refactor autotune
      
      * refactor autotune
      
      * format init.py
      
      * add tutorial for autotune
      
      * merge
      
      * merge
      
      * format analyzer
      
      * add readme for analyzer
      
      * format
      
      * [Tools] Summarize TFLOPS Information from a tilelang program
      
      * Summarize TFLOPS Information from a tilelang program
      853898a7
    • Yu Cheng's avatar
      [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support (#320) · 4b705eb2
      Yu Cheng authored
      * [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support
      
      * Added `example_per_token_cast_to_fp8.py` in examples/cast, providing token-wise FP8 quantization implementation.
      * Added `example_triton_cast_to_fp8.py` in examples/cast, providing Triton-based FP8 quantization implementation.
      * Added support for absolute maximum (absmax) reduction operation in reduce.cc and reduce.h.
      * Implemented `reduce_absmax` function in reduce.py, allowing absolute maximum reduction on input buffers.
      * Updated tilelang.language module to include the new `reduce_absmax` function.
      
      These changes enhance FP8 quantization capabilities and extend reduction operation support.
      
      * [Enhancement] Update per_token_cast_to_fp8 for improved FP8 quantization
      
      * Modified the `per_token_cast_to_fp8` function to support variable block sizes and improved memory layout annotations.
      * Adjusted the handling of absolute maximum values and scaling factors for better performance and accuracy.
      * Updated the main execution block to allow for larger matrix dimensions and refined the profiler setup for benchmarking.
      
      These changes enhance the flexibility and efficiency of the FP8 quantization process.
      
      * lint
      
      * [Dev] Update per_token_cast_fp8.py
      4b705eb2
  4. 02 Apr, 2025 1 commit
    • Lei Wang's avatar
      [CostModel] Introduce cuda driver api to get precise shared memory capacity (#317) · 3b660b67
      Lei Wang authored
      
      
      * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
      
      - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
      - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
      - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
      - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      3b660b67
  5. 01 Apr, 2025 2 commits
  6. 31 Mar, 2025 5 commits
    • Lei Wang's avatar
      [Bugfix] Fix layout conflict issue for gqa decoding examples (#314) · 0fd82ed5
      Lei Wang authored
      * Remove logging statement from LoopVectorizerDynamic Substitute method for cleaner output.
      
      * Refactor flashattn example to improve CUDA configuration handling
      
      - Updated the `flashattn` function in `example_gqa_decode.py` to utilize a heuristic configuration based on CUDA device capabilities, enhancing compatibility with different architectures.
      - Replaced local variable allocations with more efficient constructs and removed unnecessary logging statements for cleaner output.
      - Adjusted the `do_bench` method call to streamline performance profiling.
      
      * lint fix
      0fd82ed5
    • Lei Wang's avatar
      [Bugfix] Fix dynamic axis with variable extent (#311) · c30904ea
      Lei Wang authored
      * [Enhancement] Improve error message for RampNode in CUDA codegen
      
      - Updated the error message in the VisitExpr_ method for RampNode to include the specific Ramp node and lane count when the lane count exceeds the limit of 4. This change enhances debugging by providing clearer context for the error.
      - Refactored the loop vectorization logic in loop_vectorize_dynamic.cc to improve readability and maintainability, ensuring that dynamic vectorization checks are performed correctly and efficiently.
      
      * lint fix
      c30904ea
    • Lei Wang's avatar
      [Bugfix] Updated autotune usage in the examples to align with the latest changes (#309) · 66c7f6a1
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      
      * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings
      
      - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
      - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
      - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
      - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.
      
      * lint fix
      
      * typofix
      
      * [Refactor] Update matmul and flashattn function calls to return structured results
      
      - Modified the matmul and flashattn function calls to return a single object containing latency, configuration, and reference latency, improving code clarity and reducing the number of returned variables.
      - Updated all relevant instances in benchmark and example scripts to accommodate the new return structure, ensuring consistent usage across the codebase.
      
      * lint fix
      66c7f6a1
    • Lei Wang's avatar
      [Cache] Implement in-memory cache (#308) · 5802c01b
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      
      * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings
      
      - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
      - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
      - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
      - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.
      
      * lint fix
      5802c01b
    • Wenhao Xie's avatar
  7. 30 Mar, 2025 4 commits
    • Lei Wang's avatar
      [Enhancement] Add support for CUDA architecture 8.9 in GEMM template (#304) · edbb9b6d
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      edbb9b6d
    • Leslin's avatar
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError (#305) · 6e294de9
      Leslin authored
      
      
      * Update elementwise_add.py
      
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError
      
      * Update rms_norm.py
      
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError
      
      * Remove adapter argument from do_bench call
      
      * Remove adapter argument from do_bench call
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      6e294de9
    • Haodong Tian's avatar
      [Bugfix] Resolve autotuner bugs for blocksparse GEMM example (#300) · 92e8d5f4
      Haodong Tian authored
      * [Bugfix] Configure autotuner specific logger for correct level handling
      - Previously, logging relied on basicConfig, which configured the root logger. This caused the named autotuner logger to ignore DEBUG messages.
      - This commit sets up a dedicated logger for autotuner, correctly route DEBUG messages to 'autotuner.log' and INFO+ messages to the console.
      
      * [Bugfix] Fix tensor_supply for boolean type
      - Previously `get_tensor_supply` used `torch.randint(-2, 3)` as a fallback, which caused error when the dtype was `torch.bool`.
      - This commits adds an `is_boolean` check in `KernelParam` and updates `get_tensor_supply` to specifically use `torch.randint(0, 2)` for boolean dtypes.
      
      * [Bugfix] Always regenerate JIT inputs during tuning
      - Removes the caching for `self.jit_input_tensors` within `AutoTuner`. When different autotuning configurations can alter the required input tensor shapes or other properties, reusing cached inputs from a previous configuration lead to errors or incorrect assessments.
      - This change ensures that `profiler._get_inputs()` is called unconditionally for each configuration evaluation. Since `_get_inputs` is assumed to be relatively inexpensive, the potential overhead is considered acceptable.
      
      * [Example] Update example_blocksparse_gemm for autotuner
      
      * Run code formatter
      
      * [Feature] Enable custom tensor supply and input caching control in Autotuner
      - Previously, tensor generation was tied to `supply_type` and input caching behavior across configurations was less explicit/controlled.
      - This commit introduces a `supply_prog` parameter to allow providing a custom function for generating input tensors, overriding the default mechanism.
      - Adds a `cache_input_tensors` flag (default True) to control input tensor caching:
          - If True, tensors are generated once per configuration and reused for repetitions, with a check for potential shape mismatches between configurations.
          - If False, tensors are regenerated for every configuration trial.
      - Refactors internal input tensor handling using supplier functions for clarity.
      - Adds a `check_tensor_list_compatibility` utility for shape comparison.
      
      * [Example] Update example_blocksparse_gemm for autotuner
      
      * Run code formatter
      
      * [Example] Small fix in example_blocksparse_gemm
      
      * [Fix] Raise error if autotuning yields no valid configuration
      92e8d5f4
    • yyttt6's avatar
      [Example] Add autotune to conv example (#301) · 1873dc00
      yyttt6 authored
      
      
      * add autotune to example_gemm.py
      
      * add autotune to conv
      
      * still coding ...
      
      * version 0
      
      * version 0
      
      * version 0
      
      * refactor autotune
      
      * refactor autotune
      
      * add autotune to conv example
      
      * add conv template to carver
      
      * add conv template to carver
      
      * add conv template to carver
      
      * Update num_stages configuration values
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      1873dc00
  8. 29 Mar, 2025 1 commit
  9. 28 Mar, 2025 5 commits
    • NaOHCC's avatar
    • Lei Wang's avatar
      [Refactor] Improve documentation and add detailed docstrings across multiple modules (#298) · 3f294650
      Lei Wang authored
      * [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h
      
      - Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
      - Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.
      
      * [Enhancement] Improve documentation and add detailed docstrings across multiple modules
      
      - Updated the `__init__.py` file to enhance module documentation, providing clarity on auto-tuning functionalities.
      - Added comprehensive docstrings to the `JITContext`, `AutotuneResult`, and `AutoTuner` classes, detailing their attributes and methods.
      - Enhanced memory allocation utilities in `allocate.py` with detailed descriptions for each allocation function.
      - Improved documentation for various intrinsic operations in `builtin.py`, `copy.py`, `customize.py`, `frame.py`, `gemm.py`, `memscope.py`, and `reduce.py`, ensuring clear explanations of parameters and return values.
      - Refactored the `KernelCache` class to improve clarity and maintainability, including detailed comments and docstrings for methods.
      - Overall, these changes aim to enhance code readability and provide better guidance for future developers and users of the Tile-AI framework.
      3f294650
    • Lei Wang's avatar
      [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h (#297) · 9ad9d9cd
      Lei Wang authored
      - Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
      - Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.
      9ad9d9cd
    • Lei Wang's avatar
      [Feature] Implement ParallelLoopTransformer for enhanced loop analysis (#295) · 5c8de061
      Lei Wang authored
      * [Feature] Implement ParallelLoopTransformer for enhanced loop analysis
      
      - Introduced the ParallelLoopTransformer class to improve the handling of parallel loops in layout inference.
      - Enhanced the analysis of loop variables and their extents, allowing for more accurate index range calculations.
      - Added a BufferAccessCollector to gather buffer access information, ensuring correct index mapping and condition handling.
      - Updated the LayoutInference pass to utilize the new transformer, improving overall performance and accuracy in loop transformations.
      
      * test fix
      
      * Fix typo in buffer variable documentation and enhance loop variable handling in layout inference. Added checks for related loop variables and improved condition handling for index mapping.
      
      * Refactor loop variable handling in layout inference. Updated loop index variable from `i` to `j` for clarity and improved condition handling for index mapping by replacing `indices[i]` with `index` in predicate construction.
      5c8de061
    • botbw's avatar
      [doc/example] add gemv doc and examples (#293) · ff3cfa59
      botbw authored
      * [doc/example] init gemv doc and examples
      
      * [example] add vectorized read
      
      * [example] use local register instead of smem
      
      * [example] add bench
      
      * [doc] update doc
      
      * [doc] refine doc
      
      * [lint] format code
      
      * [doc] add tips
      
      * [doc/example] fix typo
      
      * [example] use tmv_all_reduce
      
      * [doc] update doc accordingly
      
      * [doc] add benchmark table
      
      * [lint] format code
      ff3cfa59
  10. 27 Mar, 2025 5 commits
    • penguin_wwy's avatar
      [Dev] Correcting cxx compiler (#294) · 304b4465
      penguin_wwy authored
      304b4465
    • Lei Wang's avatar
      Remove citation page (#292) · 5079e2a5
      Lei Wang authored
      5079e2a5
    • Wenhao Xie's avatar
      [Doc] Python API docs generation (#278) · 5501b31c
      Wenhao Xie authored
      * fix bug
      
      * update performance.py
      
      * update python api docs
      
      * test workflow
      
      * fix dependency
      
      * fix bug
      
      * fix
      
      * update correct git config
      
      * test workflow
      
      * clear cache
      
      * lint fix
      
      * fix exclude path
      5501b31c
    • Lei Wang's avatar
      [Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 (#291) · 83412458
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      
      * [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py
      
      - Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
      - Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
      - Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.
      
      * [Refactor] Update imports in __init__.py for tir compatibility
      
      - Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
      - Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.
      
      * lint fix
      
      * [Refactor] Update imports in tir.ir.py for improved compatibility
      
      - Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
      - Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.
      
      * [Refactor] Update function calls in tir.ir.py to return values
      
      - Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.
      
      * bugfix
      
      * [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper
      
      - Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.
      
      * bugfix
      
      * [Update] Sync subproject commit and modify CUDA atomic add functions
      
      - Updated the subproject commit for TVM to edd35139a0481e9359aa269e3e50450b95ba2f5a.
      - Commented out the CUDA capability check in the example convolution script to prevent execution errors.
      - Refactored atomic add functions for BFLOAT16 in common.h to include a conditional compilation directive for improved compatibility with CUDA architectures.
      83412458
    • Lei Wang's avatar
      [Language] Proxy tvm ir to make linter happy (#287) · be0bf36d
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      
      * [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py
      
      - Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
      - Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
      - Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.
      
      * [Refactor] Update imports in __init__.py for tir compatibility
      
      - Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
      - Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.
      
      * lint fix
      
      * [Refactor] Update imports in tir.ir.py for improved compatibility
      
      - Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
      - Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.
      
      * [Refactor] Update function calls in tir.ir.py to return values
      
      - Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.
      
      * bugfix
      
      * [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper
      
      - Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.
      
      * bugfix
      
      * Uncomment main function call
      be0bf36d
  11. 26 Mar, 2025 4 commits
    • Yu Cheng's avatar
      [Feature] Introduce NoSetMaxNReg for warp specialization (#289) · 76435ca8
      Yu Cheng authored
      - Added NoSetMaxNReg as a new TIR built-in to indicate no register hint for warp-specialized branches.
      - Updated the warp specialization rewriter to handle the new NoSetMaxNReg operation, allowing for improved register management.
      - Enhanced the Python interface to include NoSetMaxNReg for consistency with TIR operations.
      76435ca8
    • Yu Cheng's avatar
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
    • yyttt6's avatar
      add autotune to example_gemm.py (#285) · 73d2c62e
      yyttt6 authored
      73d2c62e
  12. 25 Mar, 2025 1 commit