Commits · 924225ed5f393fc620344e1e907769209ed11f06 · OpenDAS / tilelang

"profiler/vscode:/vscode.git/clone" did not exist on "2aec026aa21cc63ef80b76be50da4dc2797821ce"

06 Dec, 2025 1 commit

[Tool] Provide layout visualization tool (#1353) · 924225ed

Cunxiao Ni authored Dec 06, 2025

* Provide layout visualization tool

Adds a layout visualization tool to TileLang, which helps users understand and debug the layout transformations applied during compilation.

This tool visualizes the memory layout of tensors at different stages of the compilation process, allowing developers to identify potential inefficiencies and optimize their code for better performance.

The visualization can be enabled via a pass config option.

* format

* add layout visual example

* Adds vis extra with matplotlib dependency

* rafactor pass config name

* fix lint

* Enables configurable layout visualization formats

Allows users to specify the output formats (png, pdf, svg) for layout visualization through a pass config option.

This change provides more flexibility in how layout visualizations are generated, allowing users to choose the formats that best suit their needs.

It also fixes a bug where layout visualization was not correctly disabled when the config option was set to "false".

* Adds visual layout inference tool docs

* fix lint

* fix lint

* Rafactor configurable layout visualization formats

* fix lint

* fix typo

* add some comments

* fix lints

* add some warnings for user

* Moves layout visualization

* Refactors layout visualization pass configuration

Updates the layout visualization pass configuration to use boolean flag for enabling and a string for specifying formats.

* Enables multiple layout visualization formats

* Updates layout visualization docs

* Moves layout visualization to analysis

924225ed

06 Oct, 2025 1 commit

[Profiler] Adds CUPTI profiler support (#936) · 91d5ef54

Cunxiao Ni authored Oct 06, 2025



* [Profiler]Adds CUPTI profiler support

* format

* rafactor cupti profiler

* format

* rafactor

* rafactor

* fix lint

* fix lint

* refactor

* add profiler tests

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

91d5ef54

25 Jun, 2025 1 commit

[Example] Update examples to use @tilelang.jit (#597) · 3db18726

Cunxiao Ni authored Jun 25, 2025



* [Example] Update kernel compilation in examples to use @tilelang.jit

- Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead.
- Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability.
- Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed.

* format

* Update example_tilelang_sparse_gqa_decode_varlen_indice.py

* Update example_dequant_gemm_fine_grained.py

* Update example_gemm_autotune.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

3db18726

27 May, 2025 1 commit

[CI] Add gemm and gemm_fp8 example to CI (#516) · ee4e708d

Leslin authored May 28, 2025

* [CI] Add gemm and gemm_fp8 example to CI

* Fix lint via format.sh

* Resolved the issues with profiler API usage and parse_args

ee4e708d

09 Apr, 2025 1 commit

[Example] Introduce autotuning example for GEMM with enhanced configuration options (#360) · d4194222

Yu Cheng authored Apr 09, 2025

* Added a new example script `example_gemm_autotune.py` to demonstrate autotuning for matrix multiplication (GEMM) using TileLang.
* Implemented functions for generating configurations, selecting the best configuration, and benchmarking performance.
* Refactored the existing `matmul` function to support dynamic configuration parameters and improved kernel compilation.
* Updated the main execution block to include command-line argument parsing for matrix dimensions and autotuning options.
* Enhanced the example to validate results against a reference implementation, ensuring correctness in matrix multiplication operations.

d4194222

07 Apr, 2025 1 commit

[AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler (#344) · f005db9f

Lei Wang authored Apr 07, 2025

* [Enhancement] Update GEMM examples and autotuner for improved performance

- Modified `example_gemm_intrinsics.py` to enhance matrix multiplication configurations, increasing warp sizes and adjusting data types for better performance.
- Updated the kernel compilation process to utilize the new `tilelang.compile` method and improved latency measurement with the profiler.
- Refactored `example_gemm.py` to include a new autotuning configuration and ensure consistency in latency checks against reference results.
- Adjusted tensor supply generation in `tilelang/utils/tensor.py` to use `torch.randn` for better randomness in tensor initialization.
- Enhanced the `JITContext` in `tilelang/autotuner/__init__.py` to replace the profiler with a kernel instance for performance measurement, improving the overall structure of the autotuner.

* bug fix

* fix

* [Enhancement] Update convolution tests and profiling assertions

- Added a random seed setting for reproducibility in convolution tests.
- Removed several redundant convolution test cases to streamline the testing process.
- Updated the assertion in the matrix multiplication profiling to include a maximum mismatched ratio for improved accuracy in results.
- Enabled the main testing function for better test execution.

* lint fix

f005db9f

03 Apr, 2025 1 commit

[Feat] Enhance CUDA Property Handling (#322) · c0378aa9

Lei Wang authored Apr 03, 2025

* [Enhancement] Introduce CUDA driver module and refactor CUDA device handling

- Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
- Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
- Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
- This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.

* [Refactor] Clean up whitespace in CUDA-related files

- Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
- This change enhances the overall organization of the codebase without altering functionality.

* [Benchmark] Add FP8 Matrix Multiplication Benchmark Script

- Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
- The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
- Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
- The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.

* lint fix

---------
Co-authored-by: LeiWang1999 <wyatuestc@gmail.com>

c0378aa9

30 Mar, 2025 1 commit

[Example] Add autotune to conv example (#301) · 1873dc00

yyttt6 authored Mar 30, 2025



* add autotune to example_gemm.py

* add autotune to conv

* still coding ...

* version 0

* version 0

* version 0

* refactor autotune

* refactor autotune

* add autotune to conv example

* add conv template to carver

* add conv template to carver

* add conv template to carver

* Update num_stages configuration values

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

1873dc00

26 Mar, 2025 1 commit

[Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1

Lei Wang authored Mar 26, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

* [Refactor] Remove deprecated decorator and enhance Cython kernel handling

- Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
- Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
- Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
- Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
- Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.

* [Feature] Add matrix multiplication test and kernel implementation

- Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
- Minor formatting improvements in `deprecated.py` for better readability.

* lint fix

* [Refactor] Update tensor creation in matrix multiplication test

- Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
- Updated imports in `__init__.py` to include `make_tensor`.
- Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.

* [Refactor] Update tensor definitions across multiple files

- Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
- Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
- Improved documentation in README and example files to reflect changes in tensor usage.

* lint fix

* [Refactor] Update tensor types in attention and matrix multiplication examples

- Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
- Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
- Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.

* lint fix

* [Refactor] Update tensor types in GEMM example and test files

- Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
- Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.

* [Refactor] Update tensor usage in customize.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the file.

* [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the test file.

* [Refactor] Update tensor types to SharedBuffer and FragmentBuffer

- Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
- Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.

* [Refactor] Introduce Tensor alias for Buffer in proxy.py

- Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
- This change enhances clarity and consistency in tensor usage across the codebase.

bf8a6fc1

25 Mar, 2025 1 commit
- [Refactor] Enhance Autotune (#266) · 541e1685
  yyttt6 authored Mar 25, 2025
```
* add autotune to example_gemm.py

* format init.py
```
  541e1685
22 Mar, 2025 2 commits

[Bugfix] Fix Benchmark/Example Code for Autotuning (#254) · 0430cfe7

Chaofan Lin authored Mar 23, 2025



* fix tune args

* lint

* Refactor gemm example and autotuner logging

- Updated `ref_program` in `example_gemm.py` to return the result of matrix multiplication instead of modifying an input parameter.
- Changed logging filename in `__init__.py` from 'out.log' to 'autotuner.log' for better clarity.
- Modified JIT kernel compilation process to include `out_idx` directly in the adapter creation, enhancing flexibility.
- Improved validation of `result_idx` in `BaseKernelAdapter` to ensure it falls within valid bounds.

* Refactor `ref_program` in `benchmark_matmul_intrinsic.py` to use the `@` operator for matrix multiplication instead of `torch.matmul`, simplifying the implementation by removing the unused parameter `C`.

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

0430cfe7

[Refactor] Refactor CUDA post-processing callback registration in TileLang (#259) · f47b43c5

Lei Wang authored Mar 22, 2025

* Add GPU kernel for 2D continuous cumulative sum in TileLang example

- Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
- Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
- Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
- Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.

* Refactor TileLang examples and enhance kernel compilation

- Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
- Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
- Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
- Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
- Cleaned up import statements in `__init__.py` for better organization and clarity.

* Enhance GPU kernel for 2D continuous cumulative sum in TileLang example

- Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
- Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.

* Refactor CUDA post-processing callback registration in TileLang

- Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility.
- Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability.
- Added debug prints to the CUDA code generation process for better traceability during development.
- Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation.
- Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling.

* lint fix

* Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters

- Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio.
- Implemented a function to generate various kernel configurations for autotuning.
- Refactored the main execution block to support both autotuned and default configurations.
- Improved the block mask generation to accommodate specified sparsity levels.
- Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.

f47b43c5

21 Mar, 2025 1 commit

add autotune to example_gemm.py (#252) · 316d3b97

yyttt6 authored Mar 21, 2025

* add autotune to example_gemm.py

* add autotune to example_gemm.py

* add autotune to example_gemm.py

* add autotune to example_gemm.py

316d3b97

12 Mar, 2025 1 commit

[Enhancement] Simplify GEMM example with direct kernel compilation (#191) · 79ea77e8

Lei Wang authored Mar 12, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

* Simplify GEMM example with direct kernel compilation

- Update copyright header to Tile-AI Corporation
- Remove Profiler import and usage
- Replace tilelang.lower() with tilelang.compile()
- Simplify kernel execution workflow
- Update kernel source retrieval method

79ea77e8

11 Jan, 2025 2 commits

[Lint] Overall Typo and Linting Fixes (#13) · fa511857
Lei Wang authored Jan 11, 2025
```
* README.md fixed

* update test ci

* Lint and Typo Fix

* Clang Format Lint Fix
```
fa511857

[Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c

Lei Wang authored Jan 11, 2025



* Add format.sh script for code formatting and linting

* docs update

* center align the title

* lint fix

* add ignore

* Add .gitignore for 3rdparty directory

* Add requirements-dev.txt, requirements-test.txt, and requirements.txt

* 3rdparty

* Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h

* Refactor CMakeLists.txt and include statements

- Update CMakeLists.txt to use a newer version of CMake and add project name
- Remove unnecessary include directories

Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc

- Update include paths to use relative paths instead of absolute paths

* Update submodule for 3rdparty/tvm

* update

* load dll first

* Refactor CMakeLists.txt and include statements

* Refactor CMakeLists.txt and include statements

* git keep update

* Refactor CMakeLists.txt and include statements

* Refactor CMakeLists.txt and include statements

* refactor code structure

* Update Readme

* CMakeLists Customized

* update readme

* update README

* update readme

* update usage

* with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import

* annotate lower transform global func with `transform` prefix

* Migrate Simplify Pass from tilelang tvm branch

* enhance system environment handling with __init__ and CMake

* Initial commit

* CODE_OF_CONDUCT.md committed

* LICENSE committed

* README.md committed

* SECURITY.md committed

* SUPPORT.md committed

* CODE_OF_CONDUCT Commit

* LICENSE Commit

* SECURITY Commit

* SUPPORT Commit

* Modify Support

* Update README.md

* security ci update

* remove examples

* Update and implement clang-format

* add composable kernel components

* Migrate from latest update

* submodule update

* Test update

* Update License

* Spell check

* lint fix

* add clang-tidy to apply static analysis for c source

* update tilelang examples

* Update Install Docs

* Refactor filetree

* Enhance Install

* conflict resloved

* annotate_version

* Initial Update

* test fix

* install

* Implement setup.py

* lint fix

* Separate Init

* Separate test

* docker file commit

* add logo

* Update Readme and Examples

* update readme

* update logo

* Implement AMD Installation

* Add License

* Update AMD MI300x Benchmark

* update README

* update mi300 benchmark scripts

* update ignore

* enhance build scirpt

* update image

* enhance setup.py to remove duplicated libraries

* remove debug files

* update readme

* update image

* update gemm examples

* update flashattention README

* readme update

* add cmake into requirements

* libinfo fix

* auto update submodule

* lint fix

* Fix AMD Build and Test

* Update check for transpose attribute for CDNA Arch

* typo fix for amd

* Implement Matmul Benchmark

* Refactor Code

* [TypoFix] Fix GEMM Example

* [Docs] Init Linear Attention README

* [TYPO] Typo fix

* [Lint] Lint Fix

* enhance example with intrinsics

* [Enhancement] Improve Buffer Collection during IR Parser

* [Dev] Introduce Current classmethod to get current frame

* submodule update

* fake test pass update

* support thread_extent_api

* code optimize

* Add GEMM function implementation for matrix multiplication

* Update logging format to reflect TileLang in logger messages

* Refactor CMakeLists.txt for improved readability and set default build type to Release

* Support Gemm SS Primitives Implementation

* [README] Upload Tile Language Logo (#5)

* update logo

* Update README.md to enhance formatting and center the title

---------
Co-authored-by: microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
Co-authored-by: Microsoft Open Source <microsoftopensource@users.noreply.github.com>
Co-authored-by: Yu Cheng <yu.cheng@pku.edu.cn>

57ab687c