- 03 Jul, 2025 1 commit
-
-
botbw authored
* [experimental] add a draft gemm_sp * [3rdparty] bump cutlass to v3.9.3 * [lint] run format.sh * [chore] rebase * [chore] use abs path * [gemm_sp] add metadata layout * [ci] add more example * [lint] run format.sh * [chore] polish * [chore] move gemm_sp to experimental * [chore] polish * [lint] run format.sh * [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test * Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy. * Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process. * Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality. * Implement Test * [Enhancement] Update GEMM SP and SM89 templates for improved functionality * Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture. * Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets. * Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types. * Added conditional inclusion for GEMM SP header based on CUDA architecture version. * lint fix * [gemm_sp] support more layout and data types * Enhancement: sync T.gemm_sp's layout inference with T.gemm * Enhancement: support more block_k in compress util * [Enhancement] enable block_k=64 * [Lint] run format.sh * [Enhancement] compressor support more dtype * Enhancement: enable block_K=32 * [Lint] format.sh * [Fixbug] fix shape * Refactor: sync gemm * [Enhancement] enable transpose * [Enhancement] enable fp8_e4m3 * [Enhancement] enable int8 * [Lint] run format.sh * [Benchmark] add gemm_sp benchmark * [Example] fix 256 threads hang * [CI] fix ci * [Chore] resolve gemini feedback * [Benchmark] increase search space * [Lint] format * [CI] skip sparse tensor core related tests as only sm90 is supported * [CI] pass local run * Update gemm_sm89.h * lint fix * lint fix * [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags - Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation. - Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag. - Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected. - Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch. - Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module. * Update test_compress_utils.py --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 11 Jun, 2025 1 commit
-
-
Yu Cheng authored
* [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation - Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid. - Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code. - Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels. - Updated relevant documentation and comments to reflect the addition of new features and usage instructions. * [Example] Add MLA Decode With Persistent Threadblock Example * [Feature] Introduce Persistent Loop and Update GEMM Example - Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution. - Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication. - Introduced a `loop_break` intrinsic for better control flow within persistent loops. - Updated relevant files to support the new features, including changes in code generation and language interface. * lint fix
-
- 25 May, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516) * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions. * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity. * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process. * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability. * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies. * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions. * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise. * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios. * Add dynamic matrix multiplication example and corresponding test * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling. * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example. * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness. * lint fix * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device. * Updated the `__init__.py` file to include the new function in the module exports. * lint fix * Add global barrier state and expectation handling in CUDA code generation * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator. * Updated `Finish` method to declare the global barrier state if needed. * Implemented handling for `EvaluateNode` to initialize the barrier expectation. * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method. * Enhanced CUDA FP8 type definitions for better alignment and structure.
-
- 20 Mar, 2025 1 commit
-
-
Lei Wang authored
* remove llvm build * [Refactor] Update kernel compilation and profiling in examples - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation. - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency. - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations. * lint fix * License Update * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields. - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability. * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files - Improved comment alignment and readability in `cuda.h`. - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability. * lint fix * lint fix * lint fix * lint fix * fix * License update * [Enhancement] Update JITKernel to use artifact for kernel source - Assigned the generated artifact to `self.artifact` for better management. - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling. * lint fix * Add @tilelang.testing.requires_llvm decorator to vectorization tests * Enhance setup.py and env.py for library management - Added functionality to remove original files after copying in CMakeBuild. - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration. * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py * Refactor CMakeBuild file handling in setup.py - Added a check to ensure the target library directory exists before copying .so files. - Improved the logic for creating the target directory and copying files to enhance robustness. * bugfix * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement. * lint fix * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility. * lint fix * Add support for C target in device code generation - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function. * [Enhancement] Implement auto-clear cache feature based on environment variable * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing. * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing. * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true. * [Refactor] Update kernel invocation and import paths in tests and cache * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result. * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`. * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability. * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class. * Enhanced overall code formatting to align with project standards. * [Enhancement] Add bfloat16 test case and improve kernel caching logic * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`. * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading. * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management. * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database. * Improved code formatting and readability across several files. * lint fix * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
-
- 27 Feb, 2025 1 commit
-
-
Lei Wang authored
* refactor code * enhance tutorial * Enhance error handling and code generation in CUDA and TileLang components This commit introduces several improvements across multiple files: - Added more informative error messages in GEMM layout checks - Updated CUDA codegen to support more flexible function signature generation - Improved TMA descriptor initialization and kernel dispatch logic - Refined library generation and source code parsing utilities - Enhanced error handling in various adapter and wrapper classes * Add thread tag validation for warp specialization Introduce a ThreadTagChecker to validate that a PrimFunc only uses threadIdx.x before applying warp specialization. This prevents unintended transformations on kernels with complex thread binding and provides a clear warning to users about potential issues with warp specialization. * Update TileLang Profiling and Compilation in Flash Decoding Examples Refactor the profiling and compilation workflow in two flash decoding example scripts: - Replace `tilelang.lower()` and `tilelang.Profiler()` with `tilelang.compile()` - Simplify profiler initialization using `get_profiler()` - Update method calls to use the new profiler and compiled kernel objects - Maintain existing performance benchmarking and validation logic * Refactor and clean up code formatting in TileLang testing and adapter modules This commit includes several code style and formatting improvements: - Adjust whitespace and line breaks in test files - Improve code formatting in CUDA source wrapper and adapter utilities - Enhance readability of function calls and argument handling - Remove unnecessary whitespace and standardize indentation - Simplify function signatures and argument parsing * Refactor CUDA codegen and improve code formatting This commit includes several improvements to CUDA code generation and formatting: - Enhance function signature generation in CodeGenTileLangCUDA - Improve code formatting and readability in CUDA-related files - Simplify parameter handling and type annotations - Clean up whitespace and line breaks in codegen and layout files --------- Co-authored-by:Ubuntu <dlisuser@h100testl730RPS.xu5snccwrbtejcqqalluoku5hb.xx.internal.cloudapp.net>
-
- 06 Feb, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add VectorizeLoop function and update imports for compatibility * [CI][Test] Improve test cases for vectorization and fix typos in parser comments * lint fix * Fix incorrect module reference for VectorizeLoop transformation * Refactor vectorize_loop transformation by removing unused extent mutation logic * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen * Fix formatting in CUDA FP8 header file for consistency * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity * Update submodule 'tvm' to latest commit for improved functionality * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule. * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files. * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency * Add CUDA requirements to FP8 test cases and update references for clarity * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py * Add CUDA requirements and FP8 test cases for matmul and gemv simulations * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
-
- 11 Jan, 2025 2 commits
-
-
Lei Wang authored
* README.md fixed * update test ci * Lint and Typo Fix * Clang Format Lint Fix
-
Lei Wang authored
* Add format.sh script for code formatting and linting * docs update * center align the title * lint fix * add ignore * Add .gitignore for 3rdparty directory * Add requirements-dev.txt, requirements-test.txt, and requirements.txt * 3rdparty * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h * Refactor CMakeLists.txt and include statements - Update CMakeLists.txt to use a newer version of CMake and add project name - Remove unnecessary include directories Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc - Update include paths to use relative paths instead of absolute paths * Update submodule for 3rdparty/tvm * update * load dll first * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * git keep update * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * refactor code structure * Update Readme * CMakeLists Customized * update readme * update README * update readme * update usage * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import * annotate lower transform global func with `transform` prefix * Migrate Simplify Pass from tilelang tvm branch * enhance system environment handling with __init__ and CMake * Initial commit * CODE_OF_CONDUCT.md committed * LICENSE committed * README.md committed * SECURITY.md committed * SUPPORT.md committed * CODE_OF_CONDUCT Commit * LICENSE Commit * SECURITY Commit * SUPPORT Commit * Modify Support * Update README.md * security ci update * remove examples * Update and implement clang-format * add composable kernel components * Migrate from latest update * submodule update * Test update * Update License * Spell check * lint fix * add clang-tidy to apply static analysis for c source * update tilelang examples * Update Install Docs * Refactor filetree * Enhance Install * conflict resloved * annotate_version * Initial Update * test fix * install * Implement setup.py * lint fix * Separate Init * Separate test * docker file commit * add logo * Update Readme and Examples * update readme * update logo * Implement AMD Installation * Add License * Update AMD MI300x Benchmark * update README * update mi300 benchmark scripts * update ignore * enhance build scirpt * update image * enhance setup.py to remove duplicated libraries * remove debug files * update readme * update image * update gemm examples * update flashattention README * readme update * add cmake into requirements * libinfo fix * auto update submodule * lint fix * Fix AMD Build and Test * Update check for transpose attribute for CDNA Arch * typo fix for amd * Implement Matmul Benchmark * Refactor Code * [TypoFix] Fix GEMM Example * [Docs] Init Linear Attention README * [TYPO] Typo fix * [Lint] Lint Fix * enhance example with intrinsics * [Enhancement] Improve Buffer Collection during IR Parser * [Dev] Introduce Current classmethod to get current frame * submodule update * fake test pass update * support thread_extent_api * code optimize * Add GEMM function implementation for matrix multiplication * Update logging format to reflect TileLang in logger messages * Refactor CMakeLists.txt for improved readability and set default build type to Release * Support Gemm SS Primitives Implementation * [README] Upload Tile Language Logo (#5) * update logo * Update README.md to enhance formatting and center the title --------- Co-authored-by:
microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com> Co-authored-by:
Microsoft Open Source <microsoftopensource@users.noreply.github.com> Co-authored-by:
Yu Cheng <yu.cheng@pku.edu.cn>
-