Commits · 847a461bd77cad4513a6942f2a11b0552c387ef8 · OpenDAS / tilelang

06 Apr, 2025 4 commits

[Bugfix] Fix X_amax Correctness Issue in Group Cast FP8 (#345) · 847a461b

Yu Cheng authored Apr 06, 2025

- Modified the `group_per_split_token_cast_to_fp8` function to include a conditional check for batch sizes, ensuring that the scaling factor is applied only when within the valid range. This change enhances the robustness of the FP8 conversion process for grouped per-split tokens.

847a461b

[Enhancement] Support index bit width configuration (#343) · 70546adc

Lei Wang authored Apr 06, 2025

* [Refactor] Clean up whitespace in CUDA-related files

- Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
- This change enhances the overall organization of the codebase without altering functionality.

* [Benchmark] Add FP8 Matrix Multiplication Benchmark Script

- Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
- The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
- Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
- The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.

* lint fix

* Enhance variable creation by associating data types in IR and layout files, and introduce ExpandIndexDataType transformation

- Updated variable creation in `ir.cc`, `gemm_layouts.cc`, and `elem.cc` to include data types for better type safety.
- Added a new transformation `ExpandIndexDataType` to promote integer types to int64 where necessary, improving compatibility and performance.
- Integrated the new transformation into the optimization pipeline in `phase.py`.
- Documented the new transformation in `__init__.py` for clarity.

* lint fix

* Add configuration option for index bitwidth and remove ExpandIndexDataType transformation

- Introduced a new pass configuration option `kConfigIndexBitwidth` to allow customization of index bitwidth.
- Updated the optimization pipeline in `phase.py` to utilize the new configuration option instead of the removed `ExpandIndexDataType` transformation.
- Documented the new configuration option in the JIT compilation function's parameters for clarity.
- Removed the `ExpandIndexDataType` transformation implementation from the codebase to streamline the transformation process.

* lint fix

* Refactor index bitwidth configuration handling

- Updated the `ConfigIndexBitwidth` pass to only apply the bitwidth transformation if the configuration option is defined, preventing potential errors with undefined values.
- Changed the default value of `tl.config_index_bitwidth` in the JIT compilation function's parameters from 32 to None for better clarity and flexibility.

* lint fix

---------
Co-authored-by: LeiWang1999 <wyatuestc@gmail.com>

70546adc

[Example] Add triton block sparse gqa decode (#341) · bee5618e

YizhaoGao authored Apr 06, 2025



* [Example] Add triton block sparse gqa decode

* lint fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

bee5618e

[Enhancement] Support region padding when convert buffer load to buffer region (#342) · 10804a0d

Lei Wang authored Apr 06, 2025

* Enhance error checking in RegionOp and buffer_load_to_tile_region

- Added detailed error messages to the index size check in `RegionOp` to aid debugging.
- Implemented a check in `buffer_load_to_tile_region` to ensure the length of indices matches extents, with a fallback to expand extents if necessary. This improves robustness in handling buffer loads with mismatched dimensions.

* lint fix

10804a0d

05 Apr, 2025 4 commits

[Dev] Add Group Cast FP8 Example (#338) · 73885cfd

Yu Cheng authored Apr 05, 2025

Implements FP8 type conversion functionality for grouped per-split tokens. The script includes several helper functions for handling tensor TMA alignment and FP8 conversion, enhancing support for FP8 data types and providing performance benchmarks. This change provides users with more flexible examples of FP8 operations.

73885cfd

[Doc] Fix typo and heading level in GEMV tutorial (#337) · 17386d7d

yeh-sudo authored Apr 05, 2025

This pull request includes a change to the `gemv.md` file. The changes
add heading level to title in the document to make the heading level
right.

17386d7d

[Enhancement] Enhance FP8/FP4 type handling in CUDA codegen (#323) · 89725f7f

Lei Wang authored Apr 05, 2025

* [Enhancement] Introduce CUDA driver module and refactor CUDA device handling

- Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
- Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
- Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
- This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.

* [Refactor] Clean up whitespace in CUDA-related files

* [Benchmark] Add FP8 Matrix Multiplication Benchmark Script

* lint fix

* Update submodule and enhance FP8 type handling in CUDA codegen

- Updated the TVM submodule to the latest commit.
- Modified FP8 type handling in `codegen_cuda.cc` to use more descriptive type codes.
- Improved constant printing for FP8 and bfloat16 types, ensuring correct representation in generated code.
- Added error handling for missing configuration keys in the AutoTuner class.

* lint fix

* Remove print statement from example script

* lint fix

* fix

---------
Co-authored-by: LeiWang1999 <wyatuestc@gmail.com>

89725f7f

[Example] Add sparse gqa decode example (#332) · 8fdfdf03

Yuqing Xia authored Apr 05, 2025



* add example gqa decode wgmma pipelined

* add sparse gqa

* support num split

* support num split

* add if condition

* add heuristic num split

* clean code

* add ref

* fix bug

* add torch ref

* fix bug

* integrate to torch

* symbolic

* clean mask

* rm actual_num_blocks

* clean code

* get num_sm via torch

* add sparse gqa decode example

* format

* rm example_gqa_decode_wgmma_pipelined.py

* Add license headers to example scripts

* format

* Remove commented-out cache disabling lines

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

8fdfdf03

04 Apr, 2025 6 commits

[AMD] Fix for missing composable kernel include path when compile kernels on amd gpus (#334) · eb757608

Lei Wang authored Apr 04, 2025

* [Enhancement] Add new matrix multiplication functions and tests for GEMM with transpose options

- Introduced `matmul_rs` function for flexible matrix multiplication with optional transposition.
- Added `run_gemm_rs` function to facilitate testing of the new matrix multiplication implementation.
- Expanded test coverage for GEMM with additional cases for transposition configurations.
- Corrected index usage in `gemm.h` to ensure proper matrix layout handling.

These changes enhance the GEMM functionality and improve testing capabilities for various matrix configurations.

* [Enhancement] Add Composable Kernel Path Handling in Environment Setup

- Introduced support for the Composable Kernel by adding a new environment variable `TL_COMPOSABLE_KERNEL_PATH`.
- Updated the environment setup to check for the existence of the Composable Kernel and log a warning if not found.
- Modified the `LibraryGenerator` to include the Composable Kernel include directory during compilation for HIP targets.

These changes improve the integration of the Composable Kernel into the TileLang environment, enhancing flexibility for users.

eb757608

[Refactor] Optimize RMS normalization kernel in rms_norm.py (#333) · 85e411c8

Yu Cheng authored Apr 04, 2025

- Introduced a new local fragment for squared values to improve performance.
- Updated the computation of the RMS normalization to use the new fragment, enhancing memory efficiency.
- Refactored the final multiplication step to operate on the local fragment instead of shared memory.
- Added a configuration option to the kernel compilation for better control over TMA lowering.

These changes enhance the efficiency and clarity of the RMS normalization implementation.

85e411c8

[Enhancement] Add new matrix multiplication functions and tests for GEMM with... · 9e5a757e

Lei Wang authored Apr 04, 2025

[Enhancement] Add new matrix multiplication functions and tests for GEMM with transpose options (#331)

- Introduced `matmul_rs` function for flexible matrix multiplication with optional transposition.
- Added `run_gemm_rs` function to facilitate testing of the new matrix multiplication implementation.
- Expanded test coverage for GEMM with additional cases for transposition configurations.
- Corrected index usage in `gemm.h` to ensure proper matrix layout handling.

These changes enhance the GEMM functionality and improve testing capabilities for various matrix configurations.

9e5a757e

[Enhancement] Improve flashattn function in example_gqa_decode.py (#329) · 32060ecd

Lei Wang authored Apr 04, 2025

- Added a manual seed for reproducibility in PyTorch.
- Refactored local variable allocations for better memory management.
- Enhanced parallel processing in the flashattn function to improve performance.
- Updated layout annotations for clarity and efficiency.

These changes optimize the flash attention mechanism and ensure consistent behavior across runs.

32060ecd

[Dynamic Symbolic] Adaptively vectorize with different condition expressions (#326) · 5ee58ec7

Zhengju Tang authored Apr 04, 2025



* [Dynamic Symbolic] Adaptively vectorize with different condition expressions

* Format

* Format

* Format

* Format

* Add MIT License headers to Python files

* Simplify return statement in loop vectorization

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

5ee58ec7

[AMD] Adapt rocm and support `T.gemm` with transpose_b=False for amd backend (#327) · eab47249

Lei Wang authored Apr 04, 2025



* [Enhancement] Update GEMM and ROCm Integration

- Removed the restriction on transposing matrix B for CDNA in `gemm.cc`, allowing for more flexible matrix operations.
- Added a new debug header file `debug.h` for enhanced debugging capabilities in ROCm kernels.
- Updated `codegen_hip.cc` to include the new debug header and improved handling of float16 and bfloat16 types in vector element stores.
- Refactored `rt_mod_hip.cc` to return a ROCM module directly from `BuildTileLangHIPWithoutCompile`, enhancing the module creation process.
- Introduced a new ROCm utility in `rocm.py` for linking and managing ROCm paths, improving the build process for ROCm applications.
- Updated tests to reflect changes in GEMM configurations and ensure compatibility with the new features.

These changes enhance the flexibility and debugging capabilities of the GEMM operations and improve the integration with the ROCm backend.

* [Fix] Corrected syntax error in pyproject.toml and improved error message formatting in rocm.py

- Added missing quotation mark for "HSA" in the `select` section of `pyproject.toml`.
- Simplified the error message formatting in `get_rocm_arch` function of `rocm.py` for better readability and consistency.

* lint fix

* Update tilelang/jit/adapter/wrapper.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* lint fix

---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

eab47249

03 Apr, 2025 5 commits

[Bugfix] add a patch to fix T.abs on float16 (#325) · 2cec52aa
botbw authored Apr 04, 2025
```
* [bug] fix T.abs on float16

* [lint] lint
```
2cec52aa

[Feat] Enhance CUDA Property Handling (#322) · c0378aa9

Lei Wang authored Apr 03, 2025

* [Enhancement] Introduce CUDA driver module and refactor CUDA device handling

* [Refactor] Clean up whitespace in CUDA-related files

* [Benchmark] Add FP8 Matrix Multiplication Benchmark Script

* lint fix

---------
Co-authored-by: LeiWang1999 <wyatuestc@gmail.com>

c0378aa9

Support block_N sizes that are 2^n in deepgemm example (#319) · d2f59cfa
Chunan Zeng authored Apr 03, 2025

d2f59cfa

[Tools] Summarize TFLOPS Information from a tilelang program (#321) · 853898a7

yyttt6 authored Apr 03, 2025

* refactor autotune

* refactor autotune

* refactor autotune

* refactor autotune

* format init.py

* add tutorial for autotune

* merge

* merge

* format analyzer

* add readme for analyzer

* format

* [Tools] Summarize TFLOPS Information from a tilelang program

* Summarize TFLOPS Information from a tilelang program

853898a7

[Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support (#320) · 4b705eb2

Yu Cheng authored Apr 03, 2025

* [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support

* Added `example_per_token_cast_to_fp8.py` in examples/cast, providing token-wise FP8 quantization implementation.
* Added `example_triton_cast_to_fp8.py` in examples/cast, providing Triton-based FP8 quantization implementation.
* Added support for absolute maximum (absmax) reduction operation in reduce.cc and reduce.h.
* Implemented `reduce_absmax` function in reduce.py, allowing absolute maximum reduction on input buffers.
* Updated tilelang.language module to include the new `reduce_absmax` function.

These changes enhance FP8 quantization capabilities and extend reduction operation support.

* [Enhancement] Update per_token_cast_to_fp8 for improved FP8 quantization

* Modified the `per_token_cast_to_fp8` function to support variable block sizes and improved memory layout annotations.
* Adjusted the handling of absolute maximum values and scaling factors for better performance and accuracy.
* Updated the main execution block to allow for larger matrix dimensions and refined the profiler setup for benchmarking.

These changes enhance the flexibility and efficiency of the FP8 quantization process.

* lint

* [Dev] Update per_token_cast_fp8.py

4b705eb2

02 Apr, 2025 1 commit

[CostModel] Introduce cuda driver api to get precise shared memory capacity (#317) · 3b660b67

Lei Wang authored Apr 02, 2025



* [Enhancement] Introduce CUDA driver module and refactor CUDA device handling

- Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
- Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
- Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
- This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.

* [Refactor] Clean up whitespace in CUDA-related files

- Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
- This change enhances the overall organization of the codebase without altering functionality.

---------
Co-authored-by: LeiWang1999 <wyatuestc@gmail.com>

3b660b67

01 Apr, 2025 2 commits

[Bugfix] Fix logic error in ReduceOp when handling CUDA architecture (#316) · 19c85907

Yu Cheng authored Apr 01, 2025

* [Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding

* [Bugfix] Fix logic error in ReduceOp when handling CUDA architecture

- Added a check for the existence of the target attribute "arch" to ensure that there is no undefined behavior when handling the specific architecture "sm_90". This change improves the robustness and compatibility of the code.

19c85907

[Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding (#315) · 3e859e67
Yu Cheng authored Apr 01, 2025

3e859e67

31 Mar, 2025 5 commits

[Bugfix] Fix layout conflict issue for gqa decoding examples (#314) · 0fd82ed5

Lei Wang authored Apr 01, 2025

* Remove logging statement from LoopVectorizerDynamic Substitute method for cleaner output.

* Refactor flashattn example to improve CUDA configuration handling

- Updated the `flashattn` function in `example_gqa_decode.py` to utilize a heuristic configuration based on CUDA device capabilities, enhancing compatibility with different architectures.
- Replaced local variable allocations with more efficient constructs and removed unnecessary logging statements for cleaner output.
- Adjusted the `do_bench` method call to streamline performance profiling.

* lint fix

0fd82ed5

[Bugfix] Fix dynamic axis with variable extent (#311) · c30904ea

Lei Wang authored Mar 31, 2025

* [Enhancement] Improve error message for RampNode in CUDA codegen

- Updated the error message in the VisitExpr_ method for RampNode to include the specific Ramp node and lane count when the lane count exceeds the limit of 4. This change enhances debugging by providing clearer context for the error.
- Refactored the loop vectorization logic in loop_vectorize_dynamic.cc to improve readability and maintainability, ensuring that dynamic vectorization checks are performed correctly and efficiently.

* lint fix

c30904ea

[Bugfix] Updated autotune usage in the examples to align with the latest changes (#309) · 66c7f6a1

Lei Wang authored Mar 31, 2025

* [Enhancement] Add support for CUDA architecture 8.9 in GEMM template

- Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
- This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.

* lintfix

* [Refactor] Clean up includes in gemm_sm89.h

- Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
- This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.

* [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings

- Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
- Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
- Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
- Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.

* lint fix

* typofix

* [Refactor] Update matmul and flashattn function calls to return structured results

- Modified the matmul and flashattn function calls to return a single object containing latency, configuration, and reference latency, improving code clarity and reducing the number of returned variables.
- Updated all relevant instances in benchmark and example scripts to accommodate the new return structure, ensuring consistent usage across the codebase.

* lint fix

66c7f6a1

[Cache] Implement in-memory cache (#308) · 5802c01b

Lei Wang authored Mar 31, 2025

* [Enhancement] Add support for CUDA architecture 8.9 in GEMM template

- Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
- This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.

* lintfix

* [Refactor] Clean up includes in gemm_sm89.h

- Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
- This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.

* [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings

- Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
- Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
- Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
- Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.

* lint fix

5802c01b

[BugFix] Fix unintended Git config overrides in CI runners (#306) · a2a32dea
Wenhao Xie authored Mar 31, 2025

a2a32dea

30 Mar, 2025 4 commits

[Enhancement] Add support for CUDA architecture 8.9 in GEMM template (#304) · edbb9b6d

Lei Wang authored Mar 31, 2025

* [Enhancement] Add support for CUDA architecture 8.9 in GEMM template

- Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
- This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.

* lintfix

* [Refactor] Clean up includes in gemm_sm89.h

- Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
- This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.

edbb9b6d

[Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError (#305) · 6e294de9

Leslin authored Mar 30, 2025



* Update elementwise_add.py

[Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError

* Update rms_norm.py

[Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError

* Remove adapter argument from do_bench call

* Remove adapter argument from do_bench call

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

6e294de9

[Bugfix] Resolve autotuner bugs for blocksparse GEMM example (#300) · 92e8d5f4

Haodong Tian authored Mar 30, 2025

* [Bugfix] Configure autotuner specific logger for correct level handling
- Previously, logging relied on basicConfig, which configured the root logger. This caused the named autotuner logger to ignore DEBUG messages.
- This commit sets up a dedicated logger for autotuner, correctly route DEBUG messages to 'autotuner.log' and INFO+ messages to the console.

* [Bugfix] Fix tensor_supply for boolean type
- Previously `get_tensor_supply` used `torch.randint(-2, 3)` as a fallback, which caused error when the dtype was `torch.bool`.
- This commits adds an `is_boolean` check in `KernelParam` and updates `get_tensor_supply` to specifically use `torch.randint(0, 2)` for boolean dtypes.

* [Bugfix] Always regenerate JIT inputs during tuning
- Removes the caching for `self.jit_input_tensors` within `AutoTuner`. When different autotuning configurations can alter the required input tensor shapes or other properties, reusing cached inputs from a previous configuration lead to errors or incorrect assessments.
- This change ensures that `profiler._get_inputs()` is called unconditionally for each configuration evaluation. Since `_get_inputs` is assumed to be relatively inexpensive, the potential overhead is considered acceptable.

* [Example] Update example_blocksparse_gemm for autotuner

* Run code formatter

* [Feature] Enable custom tensor supply and input caching control in Autotuner
- Previously, tensor generation was tied to `supply_type` and input caching behavior across configurations was less explicit/controlled.
- This commit introduces a `supply_prog` parameter to allow providing a custom function for generating input tensors, overriding the default mechanism.
- Adds a `cache_input_tensors` flag (default True) to control input tensor caching:
- If True, tensors are generated once per configuration and reused for repetitions, with a check for potential shape mismatches between configurations.
- If False, tensors are regenerated for every configuration trial.
- Refactors internal input tensor handling using supplier functions for clarity.
- Adds a `check_tensor_list_compatibility` utility for shape comparison.

* [Example] Update example_blocksparse_gemm for autotuner

* Run code formatter

* [Example] Small fix in example_blocksparse_gemm

* [Fix] Raise error if autotuning yields no valid configuration

92e8d5f4

[Example] Add autotune to conv example (#301) · 1873dc00

yyttt6 authored Mar 30, 2025



* add autotune to example_gemm.py

* add autotune to conv

* still coding ...

* version 0

* version 0

* version 0

* refactor autotune

* refactor autotune

* add autotune to conv example

* add conv template to carver

* add conv template to carver

* add conv template to carver

* Update num_stages configuration values

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

1873dc00

29 Mar, 2025 1 commit

[Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely (#302) · d3bf4fe1

Zhengju Tang authored Mar 29, 2025



* [Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely

* lint fix

* update license

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

d3bf4fe1

28 Mar, 2025 5 commits

[Bugfix] Correct method call for block reduction check when analyzing memory footprint (#299) · 55051166
NaOHCC authored Mar 28, 2025

55051166

[Refactor] Improve documentation and add detailed docstrings across multiple modules (#298) · 3f294650

Lei Wang authored Mar 28, 2025

* [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h

- Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
- Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.

* [Enhancement] Improve documentation and add detailed docstrings across multiple modules

- Updated the `__init__.py` file to enhance module documentation, providing clarity on auto-tuning functionalities.
- Added comprehensive docstrings to the `JITContext`, `AutotuneResult`, and `AutoTuner` classes, detailing their attributes and methods.
- Enhanced memory allocation utilities in `allocate.py` with detailed descriptions for each allocation function.
- Improved documentation for various intrinsic operations in `builtin.py`, `copy.py`, `customize.py`, `frame.py`, `gemm.py`, `memscope.py`, and `reduce.py`, ensuring clear explanations of parameters and return values.
- Refactored the `KernelCache` class to improve clarity and maintainability, including detailed comments and docstrings for methods.
- Overall, these changes aim to enhance code readability and provide better guidance for future developers and users of the Tile-AI framework.

3f294650

[Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h (#297) · 9ad9d9cd

Lei Wang authored Mar 28, 2025

- Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
- Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.

9ad9d9cd

[Feature] Implement ParallelLoopTransformer for enhanced loop analysis (#295) · 5c8de061

Lei Wang authored Mar 28, 2025

* [Feature] Implement ParallelLoopTransformer for enhanced loop analysis

- Introduced the ParallelLoopTransformer class to improve the handling of parallel loops in layout inference.
- Enhanced the analysis of loop variables and their extents, allowing for more accurate index range calculations.
- Added a BufferAccessCollector to gather buffer access information, ensuring correct index mapping and condition handling.
- Updated the LayoutInference pass to utilize the new transformer, improving overall performance and accuracy in loop transformations.

* test fix

* Fix typo in buffer variable documentation and enhance loop variable handling in layout inference. Added checks for related loop variables and improved condition handling for index mapping.

* Refactor loop variable handling in layout inference. Updated loop index variable from `i` to `j` for clarity and improved condition handling for index mapping by replacing `indices[i]` with `index` in predicate construction.

5c8de061

[doc/example] add gemv doc and examples (#293) · ff3cfa59

botbw authored Mar 28, 2025

* [doc/example] init gemv doc and examples

* [example] add vectorized read

* [example] use local register instead of smem

* [example] add bench

* [doc] update doc

* [doc] refine doc

* [lint] format code

* [doc] add tips

* [doc/example] fix typo

* [example] use tmv_all_reduce

* [doc] update doc accordingly

* [doc] add benchmark table

* [lint] format code

ff3cfa59

27 Mar, 2025 3 commits
- [Dev] Correcting cxx compiler (#294) · 304b4465
  penguin_wwy authored Mar 28, 2025
  
  304b4465
- Remove citation page (#292) · 5079e2a5
  Lei Wang authored Mar 27, 2025
  
  5079e2a5
- [Doc] Python API docs generation (#278) · 5501b31c
  Wenhao Xie authored Mar 27, 2025
```
* fix bug

* update performance.py

* update python api docs

* test workflow

* fix dependency

* fix bug

* fix

* update correct git config

* test workflow

* clear cache

* lint fix

* fix exclude path
```
  5501b31c