"docs/archive_en_US/TrialExample/Trials.md" did not exist on "3afe882a965fb19a740d2020cd8200f7d8e9ca89"
- 01 Apr, 2025 1 commit
-
-
Yu Cheng authored
* [Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding * [Bugfix] Fix logic error in ReduceOp when handling CUDA architecture - Added a check for the existence of the target attribute "arch" to ensure that there is no undefined behavior when handling the specific architecture "sm_90". This change improves the robustness and compatibility of the code.
-
- 26 Mar, 2025 1 commit
-
-
Yu Cheng authored
- Added NoSetMaxNReg as a new TIR built-in to indicate no register hint for warp-specialized branches. - Updated the warp specialization rewriter to handle the new NoSetMaxNReg operation, allowing for improved register management. - Enhanced the Python interface to include NoSetMaxNReg for consistency with TIR operations.
-
- 24 Mar, 2025 2 commits
-
-
Lei Wang authored
* [Refactor] Improve flash attention example and layout comparison logic - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code. - Updated the handling of `lse_local_split` to utilize parallel processing for better performance. - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example. - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons. * lint fix * [Enhancement] Add support for shared memory scope in Fill operation - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation. - Implemented parallel operation and layout inference for improved performance in shared memory scenarios. - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
-
Lei Wang authored
* Fix indentation in JIT adapter wrapper to ensure consistent formatting of return statement in generated C code. * Enhance Fill Operation in TileLang - Updated the Fill constructor to support BufferLoad instances, adding checks for ramp indices and ensuring only stride 1 ramps are processed. - Introduced a region array to manage the bounds of the fill operation, improving error checking for static regions. - Modified the MakeSIMTLoop method to utilize the new region array for loop variable bounds, enhancing flexibility in kernel generation. - Updated the fill and clear functions in fill.py to accept both tir.Buffer and tir.BufferRegion types, improving usability and type handling. * Refactor Fill Operation and Improve Readability - Simplified the Fill constructor by enhancing the handling of BufferLoad instances and ensuring proper checks for ramp indices. - Improved error messages for region size checks to enhance clarity. - Cleaned up formatting in the Fill method for better readability. - Added a blank line in the matmul function test to improve code organization. - Introduced a blank line in the fill function to enhance readability in fill.py. * Add matrix multiplication functionality and test in TileLang - Introduced a new test file `test_tilelang_language_clear.py` that implements a matrix multiplication function using TileLang's primitives. - The `matmul` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types. - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation. - Updated the `__init__.py` in the utils module to include `map_torch_type`, enhancing type handling for tensor operations. * lint fix
-
- 20 Mar, 2025 1 commit
-
-
Lei Wang authored
* remove llvm build * [Refactor] Update kernel compilation and profiling in examples - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation. - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency. - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations. * lint fix * License Update * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields. - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability. * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files - Improved comment alignment and readability in `cuda.h`. - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability. * lint fix * lint fix * lint fix * lint fix * fix * License update * [Enhancement] Update JITKernel to use artifact for kernel source - Assigned the generated artifact to `self.artifact` for better management. - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling. * lint fix * Add @tilelang.testing.requires_llvm decorator to vectorization tests * Enhance setup.py and env.py for library management - Added functionality to remove original files after copying in CMakeBuild. - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration. * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py * Refactor CMakeBuild file handling in setup.py - Added a check to ensure the target library directory exists before copying .so files. - Improved the logic for creating the target directory and copying files to enhance robustness. * bugfix * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement. * lint fix * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility. * lint fix * Add support for C target in device code generation - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function. * [Enhancement] Implement auto-clear cache feature based on environment variable * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing. * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing. * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true. * [Refactor] Update kernel invocation and import paths in tests and cache * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result. * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`. * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability. * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class. * Enhanced overall code formatting to align with project standards. * [Enhancement] Add bfloat16 test case and improve kernel caching logic * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`. * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading. * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management. * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database. * Improved code formatting and readability across several files. * lint fix * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
-
- 19 Mar, 2025 1 commit
-
-
Yu Cheng authored
* [Enhancement] Add zero initialization option to GEMM operations - Introduced a new `zero_init` parameter to the GEMM function, allowing for optional zero initialization of the accumulator. - Updated the GEMM implementation across various CUDA architectures to support the new parameter. - Modified the Python interface for GEMM to include the `zero_init` argument, enhancing flexibility in kernel execution. - Ensured compatibility with existing functionality while improving initialization control for performance optimization. * rename zero_init to clear_accum * lint
-
- 18 Mar, 2025 2 commits
-
-
Yu Cheng authored
* [BugFix] Fix bug of missing MBarrierExpectTX * [Dev] Implement FlashAttention3 Backward - Added a new example for Flash Attention using pipelined WGMMA, including forward and backward pass implementations. - Introduced functions for forward and backward processing, leveraging tilelang for optimized tensor operations. - Enhanced the attention mechanism with support for both causal and non-causal configurations. - Included command-line arguments for batch size, number of heads, context size, and head dimension for flexibility in testing. - Updated GEMM operations to support a new `wg_wait` parameter for improved synchronization in kernel execution.
-
Lei Wang authored
* [Feature] Add reduce_max functionality and corresponding tests * Introduced a new test file for the reduce_max operation in the tilelang language module. * Implemented the reduce_max functionality using T.prim_func, including local memory allocation and result copying. * Added tests for various input sizes and data types to ensure correctness of the reduce_max implementation. * Enhanced profiling assertions to validate the output against reference implementations. * Fix whitespace issues in reduce_max test file for improved readability * [Refactor] Update DebugOutput methods to return strings instead of void * Modified DebugOutput methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to return std::string instead of void, enhancing usability for logging and debugging. * Updated corresponding header files to reflect the new return types. * Improved layout inference error messages by incorporating DebugOutput for better clarity in layout conflicts. * lint fix * Fix typo in matmul function: changed loop from T.Parallel to T.grid for correct parallel execution in webgpu code generation tests. * [Enhancement] Improve layout inference conflict handling in ParallelOp * Updated the layout inference logic in ParallelOp to better handle conflicts for local.fragment buffers. * Added checks to ensure that layout conflicts are reported only when both source and destination buffers are defined, improving clarity in error messages. * Enhanced the overall robustness of the layout inference process by addressing specific cases where conflicts may arise. * [Feature] Add IsEqual methods for layout comparison * Introduced IsEqual methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to facilitate structural equality checks, allowing for optional index comparison. * Enhanced layout inference logic in Copy and ParallelOp to utilize the new IsEqual methods for better conflict detection in local.fragment layouts. * Improved error messages for layout conflicts to provide clearer guidance on potential issues.houm * [Refactor] Update profiler usage in benchmark_nsa_fwd.py and improve layout inference in elem.cc and parallel.cc * Modified the profiler call in benchmark_nsa_fwd.py to streamline latency measurement. * Updated layout inference logic in elem.cc and parallel.cc to use const pointers for FragmentNode, enhancing type safety and clarity. * Improved error messages in layout conflict checks to provide better guidance on potential issues. * [Refactor] Clean up pointer formatting in layout inference files * Standardized pointer formatting for FragmentNode in elem.cc and parallel.cc to improve code readability. * Minor adjustments to error message formatting in layout conflict checks for better clarity.
-
- 16 Mar, 2025 1 commit
-
-
zqh-wz authored
* add test for issue 101 * use ss_smem_selector from cutlass * fix mismatch between smem layout and mma * only fix for sm90 * Add CUDA requirements to GEMM thread tests * lint fix --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 14 Mar, 2025 1 commit
-
-
Lei Wang authored
* Enhance error message for constant size stack allocation in CUDA codegen. Include the actual constant size and buffer variable name in the error output for better debugging. * Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture - Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support. - Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types. - Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture. - Include necessary header files for built-in operations and layout inference improvements. * lint fix * Remove unused builtin.h include directive * Update include path for builtin.h
-
- 13 Mar, 2025 1 commit
-
-
zqh-wz authored
* upgrade cutlass to upstream v3.8.0 * Implement fp8 gemm and add example script * Fix dtype retrieval with map_torch_type for fp8 inputs * Disable vectorization of fp8 values * Make MMA declaration compatible with cutlass 3.4.0+ * Add test for fp8 T.gemm * fix indent * fix indent * Add copyright and license header * Add copyright and license header * lint fix * Refactor matmul_nt and assert_matmul_correctness functions for improved readability by consolidating parameter definitions and adjusting formatting. * clang format lint --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
- 12 Mar, 2025 2 commits
-
-
Yu Cheng authored
- Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization - Add new builtin operations in op/builtin.cc and op/builtin.h - Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls - Update CUDA codegen to support new TMA store synchronization intrinsics - Add Python language bindings for new TMA store synchronization operations
-
Lei Wang authored
* Optimize CMake build process with dynamic job count calculation - Modify build_csrc function to use 90% of available CPU cores - Ensure at least one job is used during compilation - Improve build performance by dynamically adjusting parallel job count * Optimize build_csrc function with multiprocessing module - Replace os.cpu_count() with multiprocessing.cpu_count() - Maintain existing 90% CPU utilization logic - Improve CPU core count calculation for build process * Add dynamic shape support with out_idx in Cython JIT kernel compilation - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation - Add support for resolving dynamic shape dimensions using input tensor references - Enhance flexibility of JIT kernel compilation with symbolic shape handling * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map - Improve debugging by providing context about missing symbolic dimensions - Maintain existing dynamic shape resolution logic * Fix Copy operation handling for scalar and multi-dimensional tensors - Add special handling for scalar tensor copy operations - Enhance error reporting in MakeIndices method with more detailed diagnostic information - Improve SIMT loop generation to support zero-dimensional tensors - Add explicit check and handling for scalar tensor scenarios * Refactor Copy operation code formatting and improve readability - Improve code formatting in MakeIndices and MakeSIMTLoop methods - Add line breaks to enhance readability of complex ICHECK statements - Simplify code structure in scalar tensor handling - Remove unnecessary whitespace and improve code alignment
-
- 09 Mar, 2025 1 commit
-
-
Lei Wang authored
* Add TMA lowering configuration option and update copyright notices This commit introduces a new configuration option to disable TMA (Tensor Memory Access) lowering and updates copyright notices across multiple files. Key changes include: - Add `kDisableTMALower` configuration option in builtin.h and builtin.cc - Update copyright notices from Microsoft Corporation to Tile-AI Corporation - Modify `LowerArgs` struct to include `disable_tma_lower` flag - Update JIT compilation interfaces to support pass configuration - Enhance error reporting in bulk copy lowering - Propagate pass configuration through various adapter layers * lint fix
-
- 05 Mar, 2025 1 commit
-
-
Yu Cheng authored
[Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 (#141) - Remove redundant `acc_s_0` fragment in flash attention kernel - Simplify memory copy and reduction operations - Reorder memory copy and scaling steps for improved performance - Add Hopper-specific synchronization method in CUDA reduce template - Update reduce operation to use architecture-specific synchronization
-
- 27 Jan, 2025 1 commit
-
-
Yu Cheng authored
* [Dev] Add FlashDecoding example * [CI][Test] Add test cases for tilelang kernel convolution * [CI][Test] Add test cases for tilelang kernel FlashAttention * Reduce the number of stages to ensure the shared memory allocation is valid * Temporarily remove the dim128 case * lint * update einops in requirements-dev.txt * update einops in requirements-test.txt * remove einops in requirements-dev.txt * [CI][Test] Add test cases for tilelang transform ClusterPlanning * [CI][Test] Add test cases for tilelang transform LowerHopperIntrin
-
- 17 Jan, 2025 1 commit
-
-
Lei Wang authored
* README.md fixed * test fix * cpu backend update * cpu test case
-
- 11 Jan, 2025 2 commits
-
-
Lei Wang authored
* README.md fixed * update test ci * Lint and Typo Fix * Clang Format Lint Fix
-
Lei Wang authored
* Add format.sh script for code formatting and linting * docs update * center align the title * lint fix * add ignore * Add .gitignore for 3rdparty directory * Add requirements-dev.txt, requirements-test.txt, and requirements.txt * 3rdparty * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h * Refactor CMakeLists.txt and include statements - Update CMakeLists.txt to use a newer version of CMake and add project name - Remove unnecessary include directories Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc - Update include paths to use relative paths instead of absolute paths * Update submodule for 3rdparty/tvm * update * load dll first * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * git keep update * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * refactor code structure * Update Readme * CMakeLists Customized * update readme * update README * update readme * update usage * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import * annotate lower transform global func with `transform` prefix * Migrate Simplify Pass from tilelang tvm branch * enhance system environment handling with __init__ and CMake * Initial commit * CODE_OF_CONDUCT.md committed * LICENSE committed * README.md committed * SECURITY.md committed * SUPPORT.md committed * CODE_OF_CONDUCT Commit * LICENSE Commit * SECURITY Commit * SUPPORT Commit * Modify Support * Update README.md * security ci update * remove examples * Update and implement clang-format * add composable kernel components * Migrate from latest update * submodule update * Test update * Update License * Spell check * lint fix * add clang-tidy to apply static analysis for c source * update tilelang examples * Update Install Docs * Refactor filetree * Enhance Install * conflict resloved * annotate_version * Initial Update * test fix * install * Implement setup.py * lint fix * Separate Init * Separate test * docker file commit * add logo * Update Readme and Examples * update readme * update logo * Implement AMD Installation * Add License * Update AMD MI300x Benchmark * update README * update mi300 benchmark scripts * update ignore * enhance build scirpt * update image * enhance setup.py to remove duplicated libraries * remove debug files * update readme * update image * update gemm examples * update flashattention README * readme update * add cmake into requirements * libinfo fix * auto update submodule * lint fix * Fix AMD Build and Test * Update check for transpose attribute for CDNA Arch * typo fix for amd * Implement Matmul Benchmark * Refactor Code * [TypoFix] Fix GEMM Example * [Docs] Init Linear Attention README * [TYPO] Typo fix * [Lint] Lint Fix * enhance example with intrinsics * [Enhancement] Improve Buffer Collection during IR Parser * [Dev] Introduce Current classmethod to get current frame * submodule update * fake test pass update * support thread_extent_api * code optimize * Add GEMM function implementation for matrix multiplication * Update logging format to reflect TileLang in logger messages * Refactor CMakeLists.txt for improved readability and set default build type to Release * Support Gemm SS Primitives Implementation * [README] Upload Tile Language Logo (#5) * update logo * Update README.md to enhance formatting and center the title --------- Co-authored-by:
microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com> Co-authored-by:
Microsoft Open Source <microsoftopensource@users.noreply.github.com> Co-authored-by:
Yu Cheng <yu.cheng@pku.edu.cn>
-