1. 09 Apr, 2025 1 commit
    • Yu Cheng's avatar
      [Example] Introduce autotuning example for GEMM with enhanced configuration options (#360) · d4194222
      Yu Cheng authored
      * Added a new example script `example_gemm_autotune.py` to demonstrate autotuning for matrix multiplication (GEMM) using TileLang.
      * Implemented functions for generating configurations, selecting the best configuration, and benchmarking performance.
      * Refactored the existing `matmul` function to support dynamic configuration parameters and improved kernel compilation.
      * Updated the main execution block to include command-line argument parsing for matrix dimensions and autotuning options.
      * Enhanced the example to validate results against a reference implementation, ensuring correctness in matrix multiplication operations.
      d4194222
  2. 07 Apr, 2025 1 commit
    • Lei Wang's avatar
      [AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler (#344) · f005db9f
      Lei Wang authored
      * [Enhancement] Update GEMM examples and autotuner for improved performance
      
      - Modified `example_gemm_intrinsics.py` to enhance matrix multiplication configurations, increasing warp sizes and adjusting data types for better performance.
      - Updated the kernel compilation process to utilize the new `tilelang.compile` method and improved latency measurement with the profiler.
      - Refactored `example_gemm.py` to include a new autotuning configuration and ensure consistency in latency checks against reference results.
      - Adjusted tensor supply generation in `tilelang/utils/tensor.py` to use `torch.randn` for better randomness in tensor initialization.
      - Enhanced the `JITContext` in `tilelang/autotuner/__init__.py` to replace the profiler with a kernel instance for performance measurement, improving the overall structure of the autotuner.
      
      * bug fix
      
      * fix
      
      * [Enhancement] Update convolution tests and profiling assertions
      
      - Added a random seed setting for reproducibility in convolution tests.
      - Removed several redundant convolution test cases to streamline the testing process.
      - Updated the assertion in the matrix multiplication profiling to include a maximum mismatched ratio for improved accuracy in results.
      - Enabled the main testing function for better test execution.
      
      * lint fix
      f005db9f
  3. 03 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Feat] Enhance CUDA Property Handling (#322) · c0378aa9
      Lei Wang authored
      
      
      * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
      
      - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
      - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
      - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
      - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      c0378aa9
  4. 30 Mar, 2025 1 commit
  5. 26 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
  6. 25 Mar, 2025 1 commit
  7. 22 Mar, 2025 2 commits
    • Chaofan Lin's avatar
      [Bugfix] Fix Benchmark/Example Code for Autotuning (#254) · 0430cfe7
      Chaofan Lin authored
      
      
      * fix tune args
      
      * lint
      
      * Refactor gemm example and autotuner logging
      
      - Updated `ref_program` in `example_gemm.py` to return the result of matrix multiplication instead of modifying an input parameter.
      - Changed logging filename in `__init__.py` from 'out.log' to 'autotuner.log' for better clarity.
      - Modified JIT kernel compilation process to include `out_idx` directly in the adapter creation, enhancing flexibility.
      - Improved validation of `result_idx` in `BaseKernelAdapter` to ensure it falls within valid bounds.
      
      * Refactor `ref_program` in `benchmark_matmul_intrinsic.py` to use the `@` operator for matrix multiplication instead of `torch.matmul`, simplifying the implementation by removing the unused parameter `C`.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0430cfe7
    • Lei Wang's avatar
      [Refactor] Refactor CUDA post-processing callback registration in TileLang (#259) · f47b43c5
      Lei Wang authored
      * Add GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
      - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
      - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
      - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.
      
      * Refactor TileLang examples and enhance kernel compilation
      
      - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
      - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
      - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
      - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
      - Cleaned up import statements in `__init__.py` for better organization and clarity.
      
      * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
      - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
      
      * Refactor CUDA post-processing callback registration in TileLang
      
      - Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility.
      - Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability.
      - Added debug prints to the CUDA code generation process for better traceability during development.
      - Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation.
      - Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling.
      
      * lint fix
      
      * Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters
      
      - Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio.
      - Implemented a function to generate various kernel configurations for autotuning.
      - Refactored the main execution block to support both autotuned and default configurations.
      - Improved the block mask generation to accommodate specified sparsity levels.
      - Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.
      f47b43c5
  8. 21 Mar, 2025 1 commit
    • yyttt6's avatar
      add autotune to example_gemm.py (#252) · 316d3b97
      yyttt6 authored
      * add autotune to example_gemm.py
      
      * add autotune to example_gemm.py
      
      * add autotune to example_gemm.py
      
      * add autotune to example_gemm.py
      316d3b97
  9. 20 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180
      Lei Wang authored
      * remove llvm build
      
      * [Refactor] Update kernel compilation and profiling in examples
      
      - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
      - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
      - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.
      
      * lint fix
      
      * License Update
      
      * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files
      
      - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
      - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.
      
      * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files
      
      - Improved comment alignment and readability in `cuda.h`.
      - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix
      
      * License update
      
      * [Enhancement] Update JITKernel to use artifact for kernel source
      
      - Assigned the generated artifact to `self.artifact` for better management.
      - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.
      
      * lint fix
      
      * Add @tilelang.testing.requires_llvm decorator to vectorization tests
      
      * Enhance setup.py and env.py for library management
      
      - Added functionality to remove original files after copying in CMakeBuild.
      - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.
      
      * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py
      
      * Refactor CMakeBuild file handling in setup.py
      
      - Added a check to ensure the target library directory exists before copying .so files.
      - Improved the logic for creating the target directory and copying files to enhance robustness.
      
      * bugfix
      
      * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.
      
      * lint fix
      
      * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.
      
      * lint fix
      
      * Add support for C target in device code generation
      
      - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.
      
      * [Enhancement] Implement auto-clear cache feature based on environment variable
      
      * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
      * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
      * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.
      
      * [Refactor] Update kernel invocation and import paths in tests and cache
      
      * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
      * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
      * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.
      
      * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py
      
      * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
      * Enhanced overall code formatting to align with project standards.
      
      * [Enhancement] Add bfloat16 test case and improve kernel caching logic
      
      * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
      * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
      * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
      * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
      * Improved code formatting and readability across several files.
      
      * lint fix
      
      * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
      f2e99180
  10. 12 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Simplify GEMM example with direct kernel compilation (#191) · 79ea77e8
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      79ea77e8
  11. 20 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Feature] Add CTypes JIT kernel support (#100) · 7c817d51
      Lei Wang authored
      * [Feature] Add CTypes JIT kernel support for dynamic shapes and multi-stream execution
      
      - Enhance CtypesKernelAdapter to handle dynamic symbolic shapes
      - Add support for multi-stream kernel execution in CTypes backend
      - Implement dynamic shape handling in test_tilelang_jit_gemm_ctypes.py
      - Add symbolic shape utility function in tilelang.language
      - Update profiler to improve flexibility in benchmark selection
      
      * Remove redundant thread binding in GEMM kernel implementations
      
      - Remove unnecessary `thread_binding` line in GEMM kernel functions
      - Clean up code in `examples/gemm/README.md` and `testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py`
      - Enhance code readability by removing redundant thread binding annotation
      
      * Fix indentation in int4 GEMM kernel test file
      
      - Correct indentation for function calls in `test_tilelang_kernel_int4_gemm_mma.py`
      - Remove extra indentation in `mma_emitter.ldmatrix_a()` and `mma_emitter.ldmatrix_b()` calls
      - Improve code formatting for better readability
      7c817d51
  12. 23 Jan, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Simplify interface via replacing argument thread binding of... · 362b3520
      Lei Wang authored
      [Refactor] Simplify interface via replacing argument thread binding of intrinsics with `KernelFrame.Current` (#34)
      
      * installation script fix
      
      * readme typo fix
      
      * doc fix for dequantize gemm
      
      * [Doc] remove CODE_OF_CONDUCT.md and SECURITY.md; update references in CONTRIBUTING.md
      
      * [Doc] add unit tests for AnnotateDeviceRegions transform; remove SUPPORT.md
      
      * update license
      
      * [Enhancement] add tensor supply handling for unsigned integers; improve error message for execution backend assertion
      
      * [Refactor] improve code readability by reformatting function signatures and assertions
      
      * [Refactor] replace torch.manual_seed with tilelang.testing.set_random_seed for consistency in random seed handling
      
      * [Refactor] unify thread binding variable naming across kernel and example files
      
      * [Refactor] remove unused thread binding parameter from matrix multiplication functions
      
      * [Refactor] remove unused thread binding parameter from matrix multiplication functions
      
      * [Refactor] enable main testing function in tilelang kernel gemm test
      
      * bug fix
      362b3520
    • Lei Wang's avatar
      [CI] Comprehensive Test cases Implementation of Matmul Dequantize (#32) · 7959d786
      Lei Wang authored
      * installation script fix
      
      * readme typo fix
      
      * doc fix for dequantize gemm
      
      * [Doc] remove CODE_OF_CONDUCT.md and SECURITY.md; update references in CONTRIBUTING.md
      
      * [Doc] add unit tests for AnnotateDeviceRegions transform; remove SUPPORT.md
      
      * update license
      
      * [Enhancement] add tensor supply handling for unsigned integers; improve error message for execution backend assertion
      
      * [Refactor] improve code readability by reformatting function signatures and assertions
      
      * [Refactor] replace torch.manual_seed with tilelang.testing.set_random_seed for consistency in random seed handling
      7959d786
  13. 11 Jan, 2025 2 commits
    • Lei Wang's avatar
      [Lint] Overall Typo and Linting Fixes (#13) · fa511857
      Lei Wang authored
      * README.md fixed
      
      * update test ci
      
      * Lint and Typo Fix
      
      * Clang Format Lint Fix
      fa511857
    • Lei Wang's avatar
      [Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c
      Lei Wang authored
      
      
      * Add format.sh script for code formatting and linting
      
      * docs update
      
      * center align the title
      
      * lint fix
      
      * add ignore
      
      * Add .gitignore for 3rdparty directory
      
      * Add requirements-dev.txt, requirements-test.txt, and requirements.txt
      
      * 3rdparty
      
      * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h
      
      * Refactor CMakeLists.txt and include statements
      
      - Update CMakeLists.txt to use a newer version of CMake and add project name
      - Remove unnecessary include directories
      
      Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc
      
      - Update include paths to use relative paths instead of absolute paths
      
      * Update submodule for 3rdparty/tvm
      
      * update
      
      * load dll first
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * git keep update
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * refactor code structure
      
      * Update Readme
      
      * CMakeLists Customized
      
      * update readme
      
      * update README
      
      * update readme
      
      * update usage
      
      * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import
      
      * annotate lower transform global func with `transform` prefix
      
      * Migrate Simplify Pass from tilelang tvm branch
      
      * enhance system environment handling with __init__ and CMake
      
      * Initial commit
      
      * CODE_OF_CONDUCT.md committed
      
      * LICENSE committed
      
      * README.md committed
      
      * SECURITY.md committed
      
      * SUPPORT.md committed
      
      * CODE_OF_CONDUCT Commit
      
      * LICENSE Commit
      
      * SECURITY Commit
      
      * SUPPORT Commit
      
      * Modify Support
      
      * Update README.md
      
      * security ci update
      
      * remove examples
      
      * Update and implement clang-format
      
      * add composable kernel components
      
      * Migrate from latest update
      
      * submodule update
      
      * Test update
      
      * Update License
      
      * Spell check
      
      * lint fix
      
      * add clang-tidy to apply static analysis for c source
      
      * update tilelang examples
      
      * Update Install Docs
      
      * Refactor filetree
      
      * Enhance Install
      
      * conflict resloved
      
      * annotate_version
      
      * Initial Update
      
      * test fix
      
      * install
      
      * Implement setup.py
      
      * lint fix
      
      * Separate Init
      
      * Separate test
      
      * docker file commit
      
      * add logo
      
      * Update Readme and Examples
      
      * update readme
      
      * update logo
      
      * Implement AMD Installation
      
      * Add License
      
      * Update AMD MI300x Benchmark
      
      * update README
      
      * update mi300 benchmark scripts
      
      * update ignore
      
      * enhance build scirpt
      
      * update image
      
      * enhance setup.py to remove duplicated libraries
      
      * remove debug files
      
      * update readme
      
      * update image
      
      * update gemm examples
      
      * update flashattention README
      
      * readme update
      
      * add cmake into requirements
      
      * libinfo fix
      
      * auto update submodule
      
      * lint fix
      
      * Fix AMD Build and Test
      
      * Update check for transpose attribute for CDNA Arch
      
      * typo fix for amd
      
      * Implement Matmul Benchmark
      
      * Refactor Code
      
      * [TypoFix] Fix GEMM Example
      
      * [Docs] Init Linear Attention README
      
      * [TYPO] Typo fix
      
      * [Lint] Lint Fix
      
      * enhance example with intrinsics
      
      * [Enhancement] Improve Buffer Collection during IR Parser
      
      * [Dev] Introduce Current classmethod to get current frame
      
      * submodule update
      
      * fake test pass update
      
      * support thread_extent_api
      
      * code optimize
      
      * Add GEMM function implementation for matrix multiplication
      
      * Update logging format to reflect TileLang in logger messages
      
      * Refactor CMakeLists.txt for improved readability and set default build type to Release
      
      * Support Gemm SS Primitives Implementation
      
      * [README] Upload Tile Language Logo (#5)
      
      * update logo
      
      * Update README.md to enhance formatting and center the title
      
      ---------
      Co-authored-by: default avatarmicrosoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
      Co-authored-by: default avatarMicrosoft Open Source <microsoftopensource@users.noreply.github.com>
      Co-authored-by: default avatarYu Cheng <yu.cheng@pku.edu.cn>
      57ab687c