- 17 Apr, 2025 1 commit
-
-
Lei Wang authored
* Update CI configuration to run pytest with automatic parallelization using the '-n auto' option. * Enhance Cython JIT Adapter Compilation Logic - Improved the locking mechanism during the compilation of the Cython JIT adapter to prevent race conditions. - Added checks to determine if another process has already compiled the library, reducing unnecessary recompilation. - Cleaned up the code by removing redundant imports and ensuring proper handling of temporary files during compilation failures. - Updated vectorization logic in loop_vectorize.cc to allow optional simplification of vectorized expressions. This update enhances performance and reliability in the JIT compilation process. * lint fix * Update CI configuration to run pytest with 4 parallel jobs instead of auto-detection * Add pytest markers for serial execution in MHA tests - Added @pytest.mark.serial to multiple MHA test functions to ensure they run sequentially. - This change improves test reliability by preventing potential race conditions during execution. * Update TVM submodule and enhance vectorization logic in loop_vectorize.cc - Updated the TVM submodule to the latest commit. - Modified the vectorization logic to include optional simplification of vectorized expressions and added checks to ensure the usage of vectorized variables, improving performance and reliability in expression handling. * Remove @pytest.mark.serial from multiple MHA test functions to allow parallel execution. This change enhances test performance by enabling concurrent test runs while maintaining reliability. * Remove tvm_simplify_test.py file, eliminating the test for expression simplification in TVM. This cleanup helps streamline the codebase by removing unused test cases. * Remove unused pytest import from test_tilelang_kernel_mha.py to streamline the test file. * lint fix * Update TVM submodule and refine vectorization logic in loop_vectorize.cc - Updated the TVM submodule to the latest commit. - Adjusted the return statements in loop_vectorize.cc to improve expression handling and ensure consistency in the visitor pattern. * Refactor vectorization logic in loop_vectorize.cc - Removed the check for the usage of the vectorized variable in the vectorization logic, simplifying the expression handling. - This change enhances the clarity and efficiency of the vectorization process. * Enhance vectorization checks in loop_vectorize.cc - Added a check to ensure the vectorized expression uses the vectorized variable, improving the robustness of the vectorization logic. - This change refines the expression handling and ensures that only valid vectorized expressions are processed. * Implement non-local buffer checks for loop vectorization in layout_inference.cc - Added logic to check for non-local buffer loads and stores before applying vectorization to loops. This enhancement ensures that vectorization is only applied when appropriate, improving the correctness of the loop transformations. * Refactor buffer handling in pipeline planning and layout inference - Renamed GlobalCopyPatternDetector to BufferRegionCollector for clarity and updated its logic to collect buffer read/write regions. - Enhanced the handling of conditional expressions in pipeline planning, allowing for better management of stages related to conditional statements. - Improved the processing of buffer regions during read/write operations, ensuring accurate tracking of buffer usage across different stages. * Refactor vectorization checks in loop_vectorize.cc - Removed the check for the usage of the vectorized variable in the vectorization logic, simplifying the expression handling. - This change enhances the clarity and efficiency of the vectorization process, ensuring that valid vectorized expressions are processed without unnecessary checks.
-
- 14 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement][Pipeline] Improve pipeline stage information handling and copy stage detection - Added detailed documentation for the PipelineStageInfo structure to clarify its parameters. - Enhanced the VisitStmt_ method to handle annotations for pipeline order and stage more effectively. - Implemented logic to determine if a stage is used by a copy operation, adjusting the stage assignment accordingly. - Processed the tail copy stage to ensure correct ordering and stage assignment in the pipeline planning process. * lint fix
-
- 12 Apr, 2025 1 commit
-
-
Lei Wang authored
* Update legalize_safe_memory_access.cc * Add cache path handling and file locking in Cython adapter - Introduced a new cache path based on the code hash for the Cython JIT adapter, enhancing cache management. - Added a lock file mechanism to ensure safe access during cache operations, improving concurrency handling. - These changes aim to optimize the compilation process and prevent race conditions during library loading. * lint fix * refactor * refactor * Add GlobalCopyPatternDetector to identify global memory copy patterns - Introduced a new class, GlobalCopyPatternDetector, to detect specific memory copy patterns in statements. - Enhanced the PipelinePlanner to utilize this detector for determining copy stages based on global and local memory scopes. - Improved code clarity and maintainability by encapsulating detection logic within the new class. * Refactor copy stage detection logic in pipeline planning - Simplified the determination of copy stages by directly assigning the result of GlobalCopyPatternDetector to pinfo.copy_stage. - Removed redundant checks for read and write scopes, enhancing code clarity and maintainability. * lint fix
-
- 11 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Introduce logical operations `any_of` and `all_of` for buffer checks - Added new logical operations `any_of` and `all_of` to the TileLang language interface, allowing users to check conditions across buffer elements. - Implemented corresponding intrinsic calls for CUDA, enhancing the functionality of the TileLang framework. - Updated the `allocate.py` to handle boolean types correctly in shared memory allocations. - Introduced tests for the new logical operations to ensure correctness and performance. Co-authored-by:
Zhiwen Mo <zhiwen.mo25@ic.ac.uk> * lint fix --------- Co-authored-by:
Zhiwen Mo <zhiwen.mo25@ic.ac.uk>
-
- 12 Mar, 2025 1 commit
-
-
Lei Wang authored
* Optimize CMake build process with dynamic job count calculation - Modify build_csrc function to use 90% of available CPU cores - Ensure at least one job is used during compilation - Improve build performance by dynamically adjusting parallel job count * Optimize build_csrc function with multiprocessing module - Replace os.cpu_count() with multiprocessing.cpu_count() - Maintain existing 90% CPU utilization logic - Improve CPU core count calculation for build process * Add dynamic shape support with out_idx in Cython JIT kernel compilation - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation - Add support for resolving dynamic shape dimensions using input tensor references - Enhance flexibility of JIT kernel compilation with symbolic shape handling * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map - Improve debugging by providing context about missing symbolic dimensions - Maintain existing dynamic shape resolution logic * Fix Copy operation handling for scalar and multi-dimensional tensors - Add special handling for scalar tensor copy operations - Enhance error reporting in MakeIndices method with more detailed diagnostic information - Improve SIMT loop generation to support zero-dimensional tensors - Add explicit check and handling for scalar tensor scenarios * Refactor Copy operation code formatting and improve readability - Improve code formatting in MakeIndices and MakeSIMTLoop methods - Add line breaks to enhance readability of complex ICHECK statements - Simplify code structure in scalar tensor handling - Remove unnecessary whitespace and improve code alignment * Simplify GEMM example with direct kernel compilation - Update copyright header to Tile-AI Corporation - Remove Profiler import and usage - Replace tilelang.lower() with tilelang.compile() - Simplify kernel execution workflow - Update kernel source retrieval method * Enhance block sparse attention implementation - Update `blocksparse_flashattn` to use 2 stages for improved performance. - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency. - Modify condition checks in the kernel to utilize boolean values. - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention. - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling. * Refactor and clean up code formatting across multiple files - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`. - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`. - Updated comments and documentation for clarity in `__init__.py` and `phase.py`. - Ensured consistent formatting and style across the codebase.
-
- 11 Jan, 2025 2 commits
-
-
Lei Wang authored
* README.md fixed * update test ci * Lint and Typo Fix * Clang Format Lint Fix
-
Lei Wang authored
* Add format.sh script for code formatting and linting * docs update * center align the title * lint fix * add ignore * Add .gitignore for 3rdparty directory * Add requirements-dev.txt, requirements-test.txt, and requirements.txt * 3rdparty * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h * Refactor CMakeLists.txt and include statements - Update CMakeLists.txt to use a newer version of CMake and add project name - Remove unnecessary include directories Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc - Update include paths to use relative paths instead of absolute paths * Update submodule for 3rdparty/tvm * update * load dll first * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * git keep update * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * refactor code structure * Update Readme * CMakeLists Customized * update readme * update README * update readme * update usage * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import * annotate lower transform global func with `transform` prefix * Migrate Simplify Pass from tilelang tvm branch * enhance system environment handling with __init__ and CMake * Initial commit * CODE_OF_CONDUCT.md committed * LICENSE committed * README.md committed * SECURITY.md committed * SUPPORT.md committed * CODE_OF_CONDUCT Commit * LICENSE Commit * SECURITY Commit * SUPPORT Commit * Modify Support * Update README.md * security ci update * remove examples * Update and implement clang-format * add composable kernel components * Migrate from latest update * submodule update * Test update * Update License * Spell check * lint fix * add clang-tidy to apply static analysis for c source * update tilelang examples * Update Install Docs * Refactor filetree * Enhance Install * conflict resloved * annotate_version * Initial Update * test fix * install * Implement setup.py * lint fix * Separate Init * Separate test * docker file commit * add logo * Update Readme and Examples * update readme * update logo * Implement AMD Installation * Add License * Update AMD MI300x Benchmark * update README * update mi300 benchmark scripts * update ignore * enhance build scirpt * update image * enhance setup.py to remove duplicated libraries * remove debug files * update readme * update image * update gemm examples * update flashattention README * readme update * add cmake into requirements * libinfo fix * auto update submodule * lint fix * Fix AMD Build and Test * Update check for transpose attribute for CDNA Arch * typo fix for amd * Implement Matmul Benchmark * Refactor Code * [TypoFix] Fix GEMM Example * [Docs] Init Linear Attention README * [TYPO] Typo fix * [Lint] Lint Fix * enhance example with intrinsics * [Enhancement] Improve Buffer Collection during IR Parser * [Dev] Introduce Current classmethod to get current frame * submodule update * fake test pass update * support thread_extent_api * code optimize * Add GEMM function implementation for matrix multiplication * Update logging format to reflect TileLang in logger messages * Refactor CMakeLists.txt for improved readability and set default build type to Release * Support Gemm SS Primitives Implementation * [README] Upload Tile Language Logo (#5) * update logo * Update README.md to enhance formatting and center the title --------- Co-authored-by:
microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com> Co-authored-by:
Microsoft Open Source <microsoftopensource@users.noreply.github.com> Co-authored-by:
Yu Cheng <yu.cheng@pku.edu.cn>
-