Commits · 6d3d474306ba4b4b0beb35a00a0554c8204457d0 · OpenDAS / tilelang

12 Apr, 2025 2 commits

[Doc] Add deepseek_mla to documentation index (#380) · aa85ddc7
Lei Wang authored Apr 12, 2025
```
* Add deepseek_mla to documentation index (#380)

* lint fix
```
aa85ddc7

[Docs] Add AMD Flash MLA Documentation to Tutorials Section (#376) · 0997c333

Lei Wang authored Apr 12, 2025

* [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang

- Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism.
- Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use.
- Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization.
- Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming.

* doc update

* [Add] Enhance AMD FlashMLA implementation and documentation

- Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation.
- Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation.
- Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks.

* lint fix

* [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies

- Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators.
- Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling.
- Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks.
- Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.

0997c333

05 Apr, 2025 1 commit

[Doc] Fix typo and heading level in GEMV tutorial (#337) · 17386d7d

yeh-sudo authored Apr 05, 2025

This pull request includes a change to the `gemv.md` file. The changes
add heading level to title in the document to make the heading level
right.

17386d7d

28 Mar, 2025 1 commit

[doc/example] add gemv doc and examples (#293) · ff3cfa59

botbw authored Mar 28, 2025

* [doc/example] init gemv doc and examples

* [example] add vectorized read

* [example] use local register instead of smem

* [example] add bench

* [doc] update doc

* [doc] refine doc

* [lint] format code

* [doc] add tips

* [doc/example] fix typo

* [example] use tmv_all_reduce

* [doc] update doc accordingly

* [doc] add benchmark table

* [lint] format code

ff3cfa59

27 Mar, 2025 1 commit

[Doc] Python API docs generation (#278) · 5501b31c

Wenhao Xie authored Mar 27, 2025

* fix bug

* update performance.py

* update python api docs

* test workflow

* fix dependency

* fix bug

* fix

* update correct git config

* test workflow

* clear cache

* lint fix

* fix exclude path

5501b31c

26 Mar, 2025 3 commits

[Doc] Update README.md to correct documentation link for TileLang debug tools (#286) · eee45f17
Yu Cheng authored Mar 26, 2025

eee45f17

[Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1

Lei Wang authored Mar 26, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

* [Refactor] Remove deprecated decorator and enhance Cython kernel handling

- Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
- Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
- Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
- Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
- Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.

* [Feature] Add matrix multiplication test and kernel implementation

- Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
- Minor formatting improvements in `deprecated.py` for better readability.

* lint fix

* [Refactor] Update tensor creation in matrix multiplication test

- Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
- Updated imports in `__init__.py` to include `make_tensor`.
- Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.

* [Refactor] Update tensor definitions across multiple files

- Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
- Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
- Improved documentation in README and example files to reflect changes in tensor usage.

* lint fix

* [Refactor] Update tensor types in attention and matrix multiplication examples

- Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
- Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
- Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.

* lint fix

* [Refactor] Update tensor types in GEMM example and test files

- Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
- Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.

* [Refactor] Update tensor usage in customize.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the file.

* [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py

- Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
- Improved code clarity by standardizing buffer usage across the test file.

* [Refactor] Update tensor types to SharedBuffer and FragmentBuffer

- Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
- Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.

* [Refactor] Introduce Tensor alias for Buffer in proxy.py

- Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
- This change enhances clarity and consistency in tensor usage across the codebase.

bf8a6fc1

add autotune to example_gemm.py (#285) · 73d2c62e
yyttt6 authored Mar 26, 2025

73d2c62e

22 Mar, 2025 1 commit

[Refactor] Refactor CUDA post-processing callback registration in TileLang (#259) · f47b43c5

Lei Wang authored Mar 22, 2025

* Add GPU kernel for 2D continuous cumulative sum in TileLang example

- Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
- Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
- Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
- Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.

* Refactor TileLang examples and enhance kernel compilation

- Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
- Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
- Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
- Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
- Cleaned up import statements in `__init__.py` for better organization and clarity.

* Enhance GPU kernel for 2D continuous cumulative sum in TileLang example

- Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
- Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.

* Refactor CUDA post-processing callback registration in TileLang

- Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility.
- Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability.
- Added debug prints to the CUDA code generation process for better traceability during development.
- Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation.
- Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling.

* lint fix

* Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters

- Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio.
- Implemented a function to generate various kernel configurations for autotuning.
- Refactored the main execution block to support both autotuned and default configurations.
- Improved the block mask generation to accommodate specified sparsity levels.
- Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.

f47b43c5

19 Mar, 2025 1 commit

[Examples] Implement elementwise add kernel (#219) · 43bd9d3e

Chenghua authored Mar 19, 2025

* [Example] Modify tuning configurations for FlashAttention example

* [Examples] formatting example_gqa_fwd_bshd.py

* [Examples] Implement elementwise add kernel

* [Doc] Update ElementWise Operators document

* [Examples] Replace the example of elementwise add.

43bd9d3e

17 Mar, 2025 1 commit

[Doc] Add instructions for installing nightly version (#226) · 3824adab

Wenhao Xie authored Mar 17, 2025

* [Typo] Fix formatting in installation instructions in README.md

* [Enhancement] Improve CUDA path detection and update configuration handling

* fix typo

* remove IS_WINDOWS constant

* lint fix

* Improve error messages for CUDA detection failure

* lint fix

* lint fix

* Fix .gitignore to correctly include venv directory

* [Doc] Add instructions for installing nightly version of TileLang

* update installation instructions

* update install instruction

3824adab

21 Feb, 2025 1 commit

[JIT] Support Cython jit and make cython a default execution backend (#102) · 3471904f

Lei Wang authored Feb 21, 2025

* [Feature] Add CTypes JIT kernel support for dynamic shapes and multi-stream execution

- Enhance CtypesKernelAdapter to handle dynamic symbolic shapes
- Add support for multi-stream kernel execution in CTypes backend
- Implement dynamic shape handling in test_tilelang_jit_gemm_ctypes.py
- Add symbolic shape utility function in tilelang.language
- Update profiler to improve flexibility in benchmark selection

* Remove redundant thread binding in GEMM kernel implementations

- Remove unnecessary `thread_binding` line in GEMM kernel functions
- Clean up code in `examples/gemm/README.md` and `testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py`
- Enhance code readability by removing redundant thread binding annotation

* Fix indentation in int4 GEMM kernel test file

- Correct indentation for function calls in `test_tilelang_kernel_int4_gemm_mma.py`
- Remove extra indentation in `mma_emitter.ldmatrix_a()` and `mma_emitter.ldmatrix_b()` calls
- Improve code formatting for better readability

* [Feature] Add Cython JIT kernel support for dynamic shapes and multi-stream execution

- Implement CythonKernelAdapter to handle dynamic symbolic shapes
- Add support for multi-stream kernel execution in Cython backend
- Create comprehensive test suite for Cython GEMM kernel in test_tilelang_jit_gemm_cython.py
- Update JITKernel to include "cython" as a valid execution backend
- Add Cython-specific wrapper and library generation modules
- Update .gitignore to exclude Cython cache directory
- Modify setup.py to include Cython source files in package data

* lint fix

* [Refactor] Replace JITKernel with compile() function for kernel compilation

- Add new `compile()` function in tilelang/jit/__init__.py as a wrapper for JITKernel
- Update multiple test files and examples to use `tilelang.compile()` instead of `tilelang.JITKernel()`
- Modify kernel adapters to support optional kernel-only source retrieval
- Update `__init__.py` to import the new `compile()` function
- Improve kernel source retrieval for different execution backends

* lint fix

* remove debug print

* Add C/C++ compiler utility module and update Cython JIT kernel support

- Introduce new `tilelang/contrib/cc.py` module with cross-platform C/C++ compiler utilities
- Add functions to detect and retrieve system C/C++ compilers
- Implement cross-compilation and shared library creation support
- Update Cython JIT kernel to validate C++ compiler availability
- Modify Cython adapter to use detected C++ compiler for library generation

* Refactor float8 dtype mapping in tensor utility module

- Move float8_dtype_map inside adapt_torch2tvm function
- Simplify global scope by localizing the dtype mapping
- Maintain existing functionality for converting torch float8 tensors to TVM ndarray

* Refactor float8 dtype mapping in tensor utility module

- Move float8_dtype_map inside adapt_torch2tvm function
- Simplify global scope by localizing the dtype mapping
- Maintain existing functionality for converting torch float8 tensors to TVM ndarray

* revert

* Enhance Cython JIT adapter with Cython compiler detection

- Add `get_cython_compiler()` function to dynamically locate Cython executable
- Update Cython adapter to use detected Cython compiler instead of hardcoded command
- Raise an exception if no Cython compiler is found
- Update requirements.txt to specify minimum PyTorch version (>=2.2.0)

* Fix Cython kernel wrapper stream handling and type annotations

- Update stream parameter type to int64_t for better compatibility
- Directly use torch.cuda.current_stream().cuda_stream instead of casting
- Improve type safety and precision in Cython kernel wrapper

3471904f

13 Feb, 2025 2 commits

[Bugfix] Bugfix of installing with develop mode (#81) · b43ab2dd

Lei Wang authored Feb 13, 2025

* bump version into v0.1.0

* [Enhancement] Add custom develop command for editable installs and update .gitignore

* [Documentation] Update README to include system dependencies installation instructions

* [Build] Update setup.py to support library file copying for both release and develop modes

* [Build] Refactor library file copying logic in setup.py

* [Documentation] Remove unnecessary install section header in Installation.md

b43ab2dd

[Doc] Convert docs from rst format to Markdown format. (#82) · d44e291c

Wenhao Xie authored Feb 13, 2025

* [CI] Clean up target repository before publishing documentation.

* [Doc] Convert docs from rst format to Markdown format.

d44e291c

09 Feb, 2025 1 commit

[Tools] Introduce `plot_layout` to visualize the fragment layout (#68) · f9b6a92e

Lei Wang authored Feb 10, 2025

* [Enhancement] Add VectorizeLoop function and update imports for compatibility

* [CI][Test] Improve test cases for vectorization and fix typos in parser comments

* lint fix

* Fix incorrect module reference for VectorizeLoop transformation

* Refactor vectorize_loop transformation by removing unused extent mutation logic

* [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen

* Fix formatting in CUDA FP8 header file for consistency

* Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity

* Update submodule 'tvm' to latest commit for improved functionality

* Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.

* Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.

* Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency

* Add CUDA requirements to FP8 test cases and update references for clarity

* Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py

* Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py

* Add CUDA requirements and FP8 test cases for matmul and gemv simulations

* Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py

* Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py

* Add BF16 support to matrix multiplication and introduce corresponding test cases

* Add a blank line for improved readability in BF16 GEMM test

* Update acknowledgements in README to include supervision by Zhi Yang at Peking University

* enhance acknowledgement

* Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives

* Update subproject commit for TVM dependency

* Update subproject commit for TVM dependency

* Add int4_t type and functions for packing char values in CUDA common header

* Add plot_layout example and implement GetForwardVars method in layout classes

* Refactor code for improved readability by adjusting line breaks and formatting in layout and test files

* Fix formatting by removing unnecessary line break in layout.h

* Refactor make_int4 function for improved readability by adjusting parameter formatting

f9b6a92e

02 Feb, 2025 1 commit

[Doc] Add matmul kernel tutorial documentations with tile library (#60) · ea612446

Lei Wang authored Feb 02, 2025

* implement jit test case

* [Dev] implement auto tune test case for matrix multiplication

* Implement test for legalize memory access and vectorized loop

* lint fix

* introduce run_once

* Refactor callback function names for consistency and improve code readability

* enhance documentations

* lint fix

* lint fix

* lint fix

* lint fix

* fix formatting issues in rt_mod_hip.cc

* add random seed initialization for deterministic testing

* Add documentation images and comprehensive GEMM tutorial for TileLang

* Update MATMUL documentation title to highlight Tile Library

ea612446

26 Jan, 2025 1 commit

[Doc] Addd debug relevant testing and documentations (#58) · 5e259239

Lei Wang authored Jan 26, 2025

* implement jit test case

* [Dev] implement auto tune test case for matrix multiplication

* Implement test for legalize memory access and vectorized loop

* lint fix

* introduce run_once

* Refactor callback function names for consistency and improve code readability

* enhance documentations

* lint fix

* lint fix

* lint fix

* lint fix

* fix formatting issues in rt_mod_hip.cc

* add random seed initialization for deterministic testing

5e259239

24 Jan, 2025 1 commit
- [Doc] Update documentation structure and content: add overview section, revise... · 3f38c10c
  Lei Wang authored Jan 24, 2025
```
[Doc] Update documentation structure and content: add overview section, revise project name, and change theme to Furo (#39)
```
  3f38c10c
23 Jan, 2025 1 commit

[Doc] Update GitHub Actions workflow for documentation deployment and add CNAME file. (#33) · 1b63d3a2

Wenhao Xie authored Jan 23, 2025

* [Doc] Use sphinx to generate docs.

* [Doc] Fix a bug on tlcpack_sphinx_addon.

* [Doc] Fix linting issues.

* [Doc] Create a workflow to host docs using GitHub Pages.

* [Doc] Remove all deprecated docs.

* [Doc] Update GitHub Actions workflow for documentation deployment and add CNAME file.

* [CI] Remove conditional check for GitHub Pages deployment on main branch.

1b63d3a2

22 Jan, 2025 1 commit
- [Doc] fix a typo in installation.rst (#24) · b089aa9a
  Cunxiao Ni authored Jan 22, 2025
```
* [Doc] fix installation.rst

* [Doc] fix installation
```
  b089aa9a
21 Jan, 2025 2 commits

[Doc] update installation.md and readme (#22) · 43f55f7d

Cunxiao Ni authored Jan 22, 2025

* [Doc] update installation.md and readme

* solve conflicts

* change readme

* fix installation.rst

* fix readme

* fix installation

43f55f7d

[Doc] Use sphinx to generate docs. (#21) · 8e7feca9

Wenhao Xie authored Jan 22, 2025

* [Doc] Use sphinx to generate docs.

* [Doc] Fix a bug on tlcpack_sphinx_addon.

* [Doc] Fix linting issues.

8e7feca9

20 Jan, 2025 1 commit

[Dev][jit] Introduce jit for kernel functions (#12) · 39fc5a6d

Lei Wang authored Jan 20, 2025

* instruction update

* replace link with TileLang/tile-lang

* [Dev][Adapter] Implement Torch DLPack Kernel Adapter and related utilities

* lint fix

* Implement JIT Compiler Components

* Documents update

* lint fix

* update logo

* install script fix

39fc5a6d

18 Jan, 2025 1 commit
- [Doc] Revert repo link (#11) · 18718446
  Lei Wang authored Jan 18, 2025
```
* instruction update

* replace link with TileLang/tile-lang
```
  18718446
11 Jan, 2025 2 commits

[Lint] Overall Typo and Linting Fixes (#13) · fa511857
Lei Wang authored Jan 11, 2025
```
* README.md fixed

* update test ci

* Lint and Typo Fix

* Clang Format Lint Fix
```
fa511857

[Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c

Lei Wang authored Jan 11, 2025



* Add format.sh script for code formatting and linting

* docs update

* center align the title

* lint fix

* add ignore

* Add .gitignore for 3rdparty directory

* Add requirements-dev.txt, requirements-test.txt, and requirements.txt

* 3rdparty

* Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h

* Refactor CMakeLists.txt and include statements

- Update CMakeLists.txt to use a newer version of CMake and add project name
- Remove unnecessary include directories

Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc

- Update include paths to use relative paths instead of absolute paths

* Update submodule for 3rdparty/tvm

* update

* load dll first

* Refactor CMakeLists.txt and include statements

* Refactor CMakeLists.txt and include statements

* git keep update

* Refactor CMakeLists.txt and include statements

* Refactor CMakeLists.txt and include statements

* refactor code structure

* Update Readme

* CMakeLists Customized

* update readme

* update README

* update readme

* update usage

* with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import

* annotate lower transform global func with `transform` prefix

* Migrate Simplify Pass from tilelang tvm branch

* enhance system environment handling with __init__ and CMake

* Initial commit

* CODE_OF_CONDUCT.md committed

* LICENSE committed

* README.md committed

* SECURITY.md committed

* SUPPORT.md committed

* CODE_OF_CONDUCT Commit

* LICENSE Commit

* SECURITY Commit

* SUPPORT Commit

* Modify Support

* Update README.md

* security ci update

* remove examples

* Update and implement clang-format

* add composable kernel components

* Migrate from latest update

* submodule update

* Test update

* Update License

* Spell check

* lint fix

* add clang-tidy to apply static analysis for c source

* update tilelang examples

* Update Install Docs

* Refactor filetree

* Enhance Install

* conflict resloved

* annotate_version

* Initial Update

* test fix

* install

* Implement setup.py

* lint fix

* Separate Init

* Separate test

* docker file commit

* add logo

* Update Readme and Examples

* update readme

* update logo

* Implement AMD Installation

* Add License

* Update AMD MI300x Benchmark

* update README

* update mi300 benchmark scripts

* update ignore

* enhance build scirpt

* update image

* enhance setup.py to remove duplicated libraries

* remove debug files

* update readme

* update image

* update gemm examples

* update flashattention README

* readme update

* add cmake into requirements

* libinfo fix

* auto update submodule

* lint fix

* Fix AMD Build and Test

* Update check for transpose attribute for CDNA Arch

* typo fix for amd

* Implement Matmul Benchmark

* Refactor Code

* [TypoFix] Fix GEMM Example

* [Docs] Init Linear Attention README

* [TYPO] Typo fix

* [Lint] Lint Fix

* enhance example with intrinsics

* [Enhancement] Improve Buffer Collection during IR Parser

* [Dev] Introduce Current classmethod to get current frame

* submodule update

* fake test pass update

* support thread_extent_api

* code optimize

* Add GEMM function implementation for matrix multiplication

* Update logging format to reflect TileLang in logger messages

* Refactor CMakeLists.txt for improved readability and set default build type to Release

* Support Gemm SS Primitives Implementation

* [README] Upload Tile Language Logo (#5)

* update logo

* Update README.md to enhance formatting and center the title

---------
Co-authored-by: microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
Co-authored-by: Microsoft Open Source <microsoftopensource@users.noreply.github.com>
Co-authored-by: Yu Cheng <yu.cheng@pku.edu.cn>

57ab687c