1. 06 Apr, 2025 4 commits
    • Yu Cheng's avatar
      [Bugfix] Fix X_amax Correctness Issue in Group Cast FP8 (#345) · 847a461b
      Yu Cheng authored
      - Modified the `group_per_split_token_cast_to_fp8` function to include a conditional check for batch sizes, ensuring that the scaling factor is applied only when within the valid range. This change enhances the robustness of the FP8 conversion process for grouped per-split tokens.
      847a461b
    • Lei Wang's avatar
      [Enhancement] Support index bit width configuration (#343) · 70546adc
      Lei Wang authored
      
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      * Enhance variable creation by associating data types in IR and layout files, and introduce ExpandIndexDataType transformation
      
      - Updated variable creation in `ir.cc`, `gemm_layouts.cc`, and `elem.cc` to include data types for better type safety.
      - Added a new transformation `ExpandIndexDataType` to promote integer types to int64 where necessary, improving compatibility and performance.
      - Integrated the new transformation into the optimization pipeline in `phase.py`.
      - Documented the new transformation in `__init__.py` for clarity.
      
      * lint fix
      
      * Add configuration option for index bitwidth and remove ExpandIndexDataType transformation
      
      - Introduced a new pass configuration option `kConfigIndexBitwidth` to allow customization of index bitwidth.
      - Updated the optimization pipeline in `phase.py` to utilize the new configuration option instead of the removed `ExpandIndexDataType` transformation.
      - Documented the new configuration option in the JIT compilation function's parameters for clarity.
      - Removed the `ExpandIndexDataType` transformation implementation from the codebase to streamline the transformation process.
      
      * lint fix
      
      * Refactor index bitwidth configuration handling
      
      - Updated the `ConfigIndexBitwidth` pass to only apply the bitwidth transformation if the configuration option is defined, preventing potential errors with undefined values.
      - Changed the default value of `tl.config_index_bitwidth` in the JIT compilation function's parameters from 32 to None for better clarity and flexibility.
      
      * lint fix
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      70546adc
    • YizhaoGao's avatar
      [Example] Add triton block sparse gqa decode (#341) · bee5618e
      YizhaoGao authored
      
      
      * [Example] Add triton block sparse gqa decode
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      bee5618e
    • Lei Wang's avatar
      [Enhancement] Support region padding when convert buffer load to buffer region (#342) · 10804a0d
      Lei Wang authored
      * Enhance error checking in RegionOp and buffer_load_to_tile_region
      
      - Added detailed error messages to the index size check in `RegionOp` to aid debugging.
      - Implemented a check in `buffer_load_to_tile_region` to ensure the length of indices matches extents, with a fallback to expand extents if necessary. This improves robustness in handling buffer loads with mismatched dimensions.
      
      * lint fix
      10804a0d
  2. 05 Apr, 2025 4 commits
    • Yu Cheng's avatar
      [Dev] Add Group Cast FP8 Example (#338) · 73885cfd
      Yu Cheng authored
      Implements FP8 type conversion functionality for grouped per-split tokens. The script includes several helper functions for handling tensor TMA alignment and FP8 conversion, enhancing support for FP8 data types and providing performance benchmarks. This change provides users with more flexible examples of FP8 operations.
      73885cfd
    • yeh-sudo's avatar
      [Doc] Fix typo and heading level in GEMV tutorial (#337) · 17386d7d
      yeh-sudo authored
      This pull request includes a change to the `gemv.md` file. The changes
      add heading level to title in the document to make the heading level
      right.
      17386d7d
    • Lei Wang's avatar
      [Enhancement] Enhance FP8/FP4 type handling in CUDA codegen (#323) · 89725f7f
      Lei Wang authored
      
      
      * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
      
      - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
      - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
      - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
      - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      * Update submodule and enhance FP8 type handling in CUDA codegen
      
      - Updated the TVM submodule to the latest commit.
      - Modified FP8 type handling in `codegen_cuda.cc` to use more descriptive type codes.
      - Improved constant printing for FP8 and bfloat16 types, ensuring correct representation in generated code.
      - Added error handling for missing configuration keys in the AutoTuner class.
      
      * lint fix
      
      * Remove print statement from example script
      
      * lint fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      89725f7f
    • Yuqing Xia's avatar
      [Example] Add sparse gqa decode example (#332) · 8fdfdf03
      Yuqing Xia authored
      
      
      * add example gqa decode wgmma pipelined
      
      * add sparse gqa
      
      * support num split
      
      * support num split
      
      * add if condition
      
      * add heuristic num split
      
      * clean code
      
      * add ref
      
      * fix bug
      
      * add torch ref
      
      * fix bug
      
      * integrate to torch
      
      * symbolic
      
      * clean mask
      
      * rm actual_num_blocks
      
      * clean code
      
      * get num_sm via torch
      
      * add sparse gqa decode example
      
      * format
      
      * rm example_gqa_decode_wgmma_pipelined.py
      
      * Add license headers to example scripts
      
      * format
      
      * Remove commented-out cache disabling lines
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      8fdfdf03
  3. 04 Apr, 2025 6 commits
    • Lei Wang's avatar
      [AMD] Fix for missing composable kernel include path when compile kernels on amd gpus (#334) · eb757608
      Lei Wang authored
      * [Enhancement] Add new matrix multiplication functions and tests for GEMM with transpose options
      
      - Introduced `matmul_rs` function for flexible matrix multiplication with optional transposition.
      - Added `run_gemm_rs` function to facilitate testing of the new matrix multiplication implementation.
      - Expanded test coverage for GEMM with additional cases for transposition configurations.
      - Corrected index usage in `gemm.h` to ensure proper matrix layout handling.
      
      These changes enhance the GEMM functionality and improve testing capabilities for various matrix configurations.
      
      * [Enhancement] Add Composable Kernel Path Handling in Environment Setup
      
      - Introduced support for the Composable Kernel by adding a new environment variable `TL_COMPOSABLE_KERNEL_PATH`.
      - Updated the environment setup to check for the existence of the Composable Kernel and log a warning if not found.
      - Modified the `LibraryGenerator` to include the Composable Kernel include directory during compilation for HIP targets.
      
      These changes improve the integration of the Composable Kernel into the TileLang environment, enhancing flexibility for users.
      eb757608
    • Yu Cheng's avatar
      [Refactor] Optimize RMS normalization kernel in rms_norm.py (#333) · 85e411c8
      Yu Cheng authored
      - Introduced a new local fragment for squared values to improve performance.
      - Updated the computation of the RMS normalization to use the new fragment, enhancing memory efficiency.
      - Refactored the final multiplication step to operate on the local fragment instead of shared memory.
      - Added a configuration option to the kernel compilation for better control over TMA lowering.
      
      These changes enhance the efficiency and clarity of the RMS normalization implementation.
      85e411c8
    • Lei Wang's avatar
      [Enhancement] Add new matrix multiplication functions and tests for GEMM with... · 9e5a757e
      Lei Wang authored
      [Enhancement] Add new matrix multiplication functions and tests for GEMM with transpose options (#331)
      
      - Introduced `matmul_rs` function for flexible matrix multiplication with optional transposition.
      - Added `run_gemm_rs` function to facilitate testing of the new matrix multiplication implementation.
      - Expanded test coverage for GEMM with additional cases for transposition configurations.
      - Corrected index usage in `gemm.h` to ensure proper matrix layout handling.
      
      These changes enhance the GEMM functionality and improve testing capabilities for various matrix configurations.
      9e5a757e
    • Lei Wang's avatar
      [Enhancement] Improve flashattn function in example_gqa_decode.py (#329) · 32060ecd
      Lei Wang authored
      - Added a manual seed for reproducibility in PyTorch.
      - Refactored local variable allocations for better memory management.
      - Enhanced parallel processing in the flashattn function to improve performance.
      - Updated layout annotations for clarity and efficiency.
      
      These changes optimize the flash attention mechanism and ensure consistent behavior across runs.
      32060ecd
    • Zhengju Tang's avatar
      [Dynamic Symbolic] Adaptively vectorize with different condition expressions (#326) · 5ee58ec7
      Zhengju Tang authored
      
      
      * [Dynamic Symbolic] Adaptively vectorize with different condition expressions
      
      * Format
      
      * Format
      
      * Format
      
      * Format
      
      * Add MIT License headers to Python files
      
      * Simplify return statement in loop vectorization
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      5ee58ec7
    • Lei Wang's avatar
      [AMD] Adapt rocm and support `T.gemm` with transpose_b=False for amd backend (#327) · eab47249
      Lei Wang authored
      
      
      * [Enhancement] Update GEMM and ROCm Integration
      
      - Removed the restriction on transposing matrix B for CDNA in `gemm.cc`, allowing for more flexible matrix operations.
      - Added a new debug header file `debug.h` for enhanced debugging capabilities in ROCm kernels.
      - Updated `codegen_hip.cc` to include the new debug header and improved handling of float16 and bfloat16 types in vector element stores.
      - Refactored `rt_mod_hip.cc` to return a ROCM module directly from `BuildTileLangHIPWithoutCompile`, enhancing the module creation process.
      - Introduced a new ROCm utility in `rocm.py` for linking and managing ROCm paths, improving the build process for ROCm applications.
      - Updated tests to reflect changes in GEMM configurations and ensure compatibility with the new features.
      
      These changes enhance the flexibility and debugging capabilities of the GEMM operations and improve the integration with the ROCm backend.
      
      * [Fix] Corrected syntax error in pyproject.toml and improved error message formatting in rocm.py
      
      - Added missing quotation mark for "HSA" in the `select` section of `pyproject.toml`.
      - Simplified the error message formatting in `get_rocm_arch` function of `rocm.py` for better readability and consistency.
      
      * lint fix
      
      * Update tilelang/jit/adapter/wrapper.py
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      eab47249
  4. 03 Apr, 2025 5 commits
    • botbw's avatar
      [Bugfix] add a patch to fix T.abs on float16 (#325) · 2cec52aa
      botbw authored
      * [bug] fix T.abs on float16
      
      * [lint] lint
      2cec52aa
    • Lei Wang's avatar
      [Feat] Enhance CUDA Property Handling (#322) · c0378aa9
      Lei Wang authored
      
      
      * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
      
      - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
      - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
      - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
      - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      c0378aa9
    • Chunan Zeng's avatar
    • yyttt6's avatar
      [Tools] Summarize TFLOPS Information from a tilelang program (#321) · 853898a7
      yyttt6 authored
      * refactor autotune
      
      * refactor autotune
      
      * refactor autotune
      
      * refactor autotune
      
      * format init.py
      
      * add tutorial for autotune
      
      * merge
      
      * merge
      
      * format analyzer
      
      * add readme for analyzer
      
      * format
      
      * [Tools] Summarize TFLOPS Information from a tilelang program
      
      * Summarize TFLOPS Information from a tilelang program
      853898a7
    • Yu Cheng's avatar
      [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support (#320) · 4b705eb2
      Yu Cheng authored
      * [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support
      
      * Added `example_per_token_cast_to_fp8.py` in examples/cast, providing token-wise FP8 quantization implementation.
      * Added `example_triton_cast_to_fp8.py` in examples/cast, providing Triton-based FP8 quantization implementation.
      * Added support for absolute maximum (absmax) reduction operation in reduce.cc and reduce.h.
      * Implemented `reduce_absmax` function in reduce.py, allowing absolute maximum reduction on input buffers.
      * Updated tilelang.language module to include the new `reduce_absmax` function.
      
      These changes enhance FP8 quantization capabilities and extend reduction operation support.
      
      * [Enhancement] Update per_token_cast_to_fp8 for improved FP8 quantization
      
      * Modified the `per_token_cast_to_fp8` function to support variable block sizes and improved memory layout annotations.
      * Adjusted the handling of absolute maximum values and scaling factors for better performance and accuracy.
      * Updated the main execution block to allow for larger matrix dimensions and refined the profiler setup for benchmarking.
      
      These changes enhance the flexibility and efficiency of the FP8 quantization process.
      
      * lint
      
      * [Dev] Update per_token_cast_fp8.py
      4b705eb2
  5. 02 Apr, 2025 1 commit
    • Lei Wang's avatar
      [CostModel] Introduce cuda driver api to get precise shared memory capacity (#317) · 3b660b67
      Lei Wang authored
      
      
      * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
      
      - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
      - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
      - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
      - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      3b660b67
  6. 01 Apr, 2025 2 commits
  7. 31 Mar, 2025 5 commits
    • Lei Wang's avatar
      [Bugfix] Fix layout conflict issue for gqa decoding examples (#314) · 0fd82ed5
      Lei Wang authored
      * Remove logging statement from LoopVectorizerDynamic Substitute method for cleaner output.
      
      * Refactor flashattn example to improve CUDA configuration handling
      
      - Updated the `flashattn` function in `example_gqa_decode.py` to utilize a heuristic configuration based on CUDA device capabilities, enhancing compatibility with different architectures.
      - Replaced local variable allocations with more efficient constructs and removed unnecessary logging statements for cleaner output.
      - Adjusted the `do_bench` method call to streamline performance profiling.
      
      * lint fix
      0fd82ed5
    • Lei Wang's avatar
      [Bugfix] Fix dynamic axis with variable extent (#311) · c30904ea
      Lei Wang authored
      * [Enhancement] Improve error message for RampNode in CUDA codegen
      
      - Updated the error message in the VisitExpr_ method for RampNode to include the specific Ramp node and lane count when the lane count exceeds the limit of 4. This change enhances debugging by providing clearer context for the error.
      - Refactored the loop vectorization logic in loop_vectorize_dynamic.cc to improve readability and maintainability, ensuring that dynamic vectorization checks are performed correctly and efficiently.
      
      * lint fix
      c30904ea
    • Lei Wang's avatar
      [Bugfix] Updated autotune usage in the examples to align with the latest changes (#309) · 66c7f6a1
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      
      * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings
      
      - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
      - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
      - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
      - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.
      
      * lint fix
      
      * typofix
      
      * [Refactor] Update matmul and flashattn function calls to return structured results
      
      - Modified the matmul and flashattn function calls to return a single object containing latency, configuration, and reference latency, improving code clarity and reducing the number of returned variables.
      - Updated all relevant instances in benchmark and example scripts to accommodate the new return structure, ensuring consistent usage across the codebase.
      
      * lint fix
      66c7f6a1
    • Lei Wang's avatar
      [Cache] Implement in-memory cache (#308) · 5802c01b
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      
      * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings
      
      - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
      - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
      - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
      - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.
      
      * lint fix
      5802c01b
    • Wenhao Xie's avatar
  8. 30 Mar, 2025 4 commits
    • Lei Wang's avatar
      [Enhancement] Add support for CUDA architecture 8.9 in GEMM template (#304) · edbb9b6d
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      edbb9b6d
    • Leslin's avatar
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError (#305) · 6e294de9
      Leslin authored
      
      
      * Update elementwise_add.py
      
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError
      
      * Update rms_norm.py
      
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError
      
      * Remove adapter argument from do_bench call
      
      * Remove adapter argument from do_bench call
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      6e294de9
    • Haodong Tian's avatar
      [Bugfix] Resolve autotuner bugs for blocksparse GEMM example (#300) · 92e8d5f4
      Haodong Tian authored
      * [Bugfix] Configure autotuner specific logger for correct level handling
      - Previously, logging relied on basicConfig, which configured the root logger. This caused the named autotuner logger to ignore DEBUG messages.
      - This commit sets up a dedicated logger for autotuner, correctly route DEBUG messages to 'autotuner.log' and INFO+ messages to the console.
      
      * [Bugfix] Fix tensor_supply for boolean type
      - Previously `get_tensor_supply` used `torch.randint(-2, 3)` as a fallback, which caused error when the dtype was `torch.bool`.
      - This commits adds an `is_boolean` check in `KernelParam` and updates `get_tensor_supply` to specifically use `torch.randint(0, 2)` for boolean dtypes.
      
      * [Bugfix] Always regenerate JIT inputs during tuning
      - Removes the caching for `self.jit_input_tensors` within `AutoTuner`. When different autotuning configurations can alter the required input tensor shapes or other properties, reusing cached inputs from a previous configuration lead to errors or incorrect assessments.
      - This change ensures that `profiler._get_inputs()` is called unconditionally for each configuration evaluation. Since `_get_inputs` is assumed to be relatively inexpensive, the potential overhead is considered acceptable.
      
      * [Example] Update example_blocksparse_gemm for autotuner
      
      * Run code formatter
      
      * [Feature] Enable custom tensor supply and input caching control in Autotuner
      - Previously, tensor generation was tied to `supply_type` and input caching behavior across configurations was less explicit/controlled.
      - This commit introduces a `supply_prog` parameter to allow providing a custom function for generating input tensors, overriding the default mechanism.
      - Adds a `cache_input_tensors` flag (default True) to control input tensor caching:
          - If True, tensors are generated once per configuration and reused for repetitions, with a check for potential shape mismatches between configurations.
          - If False, tensors are regenerated for every configuration trial.
      - Refactors internal input tensor handling using supplier functions for clarity.
      - Adds a `check_tensor_list_compatibility` utility for shape comparison.
      
      * [Example] Update example_blocksparse_gemm for autotuner
      
      * Run code formatter
      
      * [Example] Small fix in example_blocksparse_gemm
      
      * [Fix] Raise error if autotuning yields no valid configuration
      92e8d5f4
    • yyttt6's avatar
      [Example] Add autotune to conv example (#301) · 1873dc00
      yyttt6 authored
      
      
      * add autotune to example_gemm.py
      
      * add autotune to conv
      
      * still coding ...
      
      * version 0
      
      * version 0
      
      * version 0
      
      * refactor autotune
      
      * refactor autotune
      
      * add autotune to conv example
      
      * add conv template to carver
      
      * add conv template to carver
      
      * add conv template to carver
      
      * Update num_stages configuration values
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      1873dc00
  9. 29 Mar, 2025 1 commit
  10. 28 Mar, 2025 5 commits
    • NaOHCC's avatar
    • Lei Wang's avatar
      [Refactor] Improve documentation and add detailed docstrings across multiple modules (#298) · 3f294650
      Lei Wang authored
      * [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h
      
      - Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
      - Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.
      
      * [Enhancement] Improve documentation and add detailed docstrings across multiple modules
      
      - Updated the `__init__.py` file to enhance module documentation, providing clarity on auto-tuning functionalities.
      - Added comprehensive docstrings to the `JITContext`, `AutotuneResult`, and `AutoTuner` classes, detailing their attributes and methods.
      - Enhanced memory allocation utilities in `allocate.py` with detailed descriptions for each allocation function.
      - Improved documentation for various intrinsic operations in `builtin.py`, `copy.py`, `customize.py`, `frame.py`, `gemm.py`, `memscope.py`, and `reduce.py`, ensuring clear explanations of parameters and return values.
      - Refactored the `KernelCache` class to improve clarity and maintainability, including detailed comments and docstrings for methods.
      - Overall, these changes aim to enhance code readability and provide better guidance for future developers and users of the Tile-AI framework.
      3f294650
    • Lei Wang's avatar
      [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h (#297) · 9ad9d9cd
      Lei Wang authored
      - Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
      - Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.
      9ad9d9cd
    • Lei Wang's avatar
      [Feature] Implement ParallelLoopTransformer for enhanced loop analysis (#295) · 5c8de061
      Lei Wang authored
      * [Feature] Implement ParallelLoopTransformer for enhanced loop analysis
      
      - Introduced the ParallelLoopTransformer class to improve the handling of parallel loops in layout inference.
      - Enhanced the analysis of loop variables and their extents, allowing for more accurate index range calculations.
      - Added a BufferAccessCollector to gather buffer access information, ensuring correct index mapping and condition handling.
      - Updated the LayoutInference pass to utilize the new transformer, improving overall performance and accuracy in loop transformations.
      
      * test fix
      
      * Fix typo in buffer variable documentation and enhance loop variable handling in layout inference. Added checks for related loop variables and improved condition handling for index mapping.
      
      * Refactor loop variable handling in layout inference. Updated loop index variable from `i` to `j` for clarity and improved condition handling for index mapping by replacing `indices[i]` with `index` in predicate construction.
      5c8de061
    • botbw's avatar
      [doc/example] add gemv doc and examples (#293) · ff3cfa59
      botbw authored
      * [doc/example] init gemv doc and examples
      
      * [example] add vectorized read
      
      * [example] use local register instead of smem
      
      * [example] add bench
      
      * [doc] update doc
      
      * [doc] refine doc
      
      * [lint] format code
      
      * [doc] add tips
      
      * [doc/example] fix typo
      
      * [example] use tmv_all_reduce
      
      * [doc] update doc accordingly
      
      * [doc] add benchmark table
      
      * [lint] format code
      ff3cfa59
  11. 27 Mar, 2025 3 commits