- 11 Jun, 2025 1 commit
-
-
Yu Cheng authored
* [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation - Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid. - Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code. - Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels. - Updated relevant documentation and comments to reflect the addition of new features and usage instructions. * [Example] Add MLA Decode With Persistent Threadblock Example * [Feature] Introduce Persistent Loop and Update GEMM Example - Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution. - Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication. - Introduced a `loop_break` intrinsic for better control flow within persistent loops. - Updated relevant files to support the new features, including changes in code generation and language interface. * lint fix
-
- 09 Jun, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Update AutoTuner and JIT compilation arguments * Added functionality to return compile arguments in the JIT implementation, enhancing the autotuner's caching capabilities. * Modified `CompileArgs` and `AutotuneResult` classes to support optional `out_idx` parameter, improving flexibility in compile argument handling. * Refactored the `_AutoTunerImplementation` to utilize the new compile arguments, ensuring better integration and performance during tuning processes. * Update tilelang/autotuner/param.py Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * remove redundant comments * Refactor kernel source retrieval and logging levels * Updated `AutotuneResult` to use `kernel.get_kernel_source()` instead of `kernel.adapter.get_kernel_source()`. * Changed logging level in `KernelCache` from `ERROR` to `DEBUG` for improved verbosity during kernel caching operations. * Removed unnecessary verbose logging in JIT compilation process to streamline output. * Merge branch 'main' of https://github.com/tile-ai/tilelang into bugfix_autotune_0604 * lint fix --------- Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
- 07 Jun, 2025 2 commits
-
-
Yu Cheng authored
* [Enhancement] Fix multi-version buffer index in nested-loop * [Feature] Support persistent kernels and add persistent GEMM example * lint fix * lint fix * [CI] Remove test_tilelang_transform_annotate_device_regions.py
-
Lei Wang authored
* Add tf32 casting functionality to GEMM templates - Introduced a `cast_float_to_tf32` function to convert float32 values to tfloat32 format across gemm_sm80, gemm_sm89, and gemm_sm90 templates. - Implemented conditional casting in relevant sections of the GEMM operations to ensure compatibility with tfloat32 types. - Enhanced the handling of tensor views to support the new casting logic, improving performance and accuracy in matrix operations. * lint fix * Refactor tfloat32 casting logic in GEMM templates - Replaced the `is_tfloat32` boolean with `need_tfloat32_cast` to improve clarity and accuracy in determining when to cast float32 to tfloat32. - Updated relevant sections in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to utilize the new casting logic, enhancing compatibility with tfloat32 types. - Ensured consistent application of casting across tensor views, improving performance and correctness in matrix operations. * Refactor GEMM template functions for improved readability - Simplified the function signature of `body_rs` in both `gemm_sm80` and `gemm_sm90` templates for better clarity. - Adjusted the casting logic in `gemm_sm90` to ensure consistent application of `cast_float_to_tf32` across tensor views, enhancing performance and maintainability. * Enhance tf32 casting logic in GEMM templates - Updated the `cast_float_to_tf32` function in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to conditionally apply the casting only if the input is finite, improving robustness. - Simplified the `need_tfloat32_cast` logic to clarify the conditions under which tfloat32 casting is required, enhancing code readability and maintainability. * Refactor GEMM template functions and layout inference logic - Removed the `cast_float_to_tf32` function from `gemm_sm90` and updated the `body_sr` function to streamline the casting process for tensor views, enhancing code clarity and maintainability. - Improved layout inference in `layout_inference.cc` by adding checks for the layout map's definition, ensuring robustness in handling layout annotations. - Simplified the handling of layout maps in the `annotate_layout` function, allowing for more flexible layout definitions and error handling.
-
- 06 Jun, 2025 1 commit
-
-
xs-keju authored
* [CI] Add CI test for flash_attention examples * Update example_gqa_fwd_bshd.py * Update example_mha_fwd_bshd_wgmma_pipelined.py * [CI] Added conditional annotations for tests in flash_attention * [CI] Added conditional annotations for tests in flash_attention --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 05 Jun, 2025 3 commits
-
-
Gabriel Wu authored
* [wip] feat: add nvrtc backend * [wip] fix: handle out_idx * [wip] refactor: move lib logic to libgen * feat: cache for nvrtc backend * fmt: run format * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: get kernel source * refactor: speedup pyimport * Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning. * Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity. * Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates. * fix tl_templates * ensure CUDA context --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Zhengju Tang authored
* [CI] Add FusedMoE example * Lint * Fix import bug * Fix comment bug * Update example_fusedmoe_torch.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Lei Wang authored
* Update VERSION to 0.1.5 * Add DEBUG_MODE support in setup.py and update CMake build type; enhance pypi.Dockerfile with git installation
-
- 04 Jun, 2025 6 commits
-
-
alex_xiao authored
* [CI]Add norm and layout_plot * fix lint * Remove obsolete test files for RMS normalization and plot layout, streamlining the testing suite. * Add make_mma_load_base_layout function to create MMA result layouts - Introduced a new function `make_mma_load_base_layout` for generating layout functions for storing MMA results in fragment buffers. - Added detailed docstring explaining parameters, return values, and potential exceptions. - Implemented logic for handling different data types and matrix configurations, including assertions for input validation. - Defined internal functions for mapping fragment indices to threads and local indices, enhancing the layout functionality. * Enhance MMA load test with additional imports and functionality - Added imports for `tilelang.language`, `Literal`, `Callable`, `DataType`, `IndexMap`, and `get_mma_micro_size` to support extended functionality. - Improved the `make_mma_load_base_layout` function by ensuring it can handle various data types and configurations. - Updated the test function `test_mma_load_base_layout` to validate the layout for float16 matrix A. * Fix formatting in test_fragment_mma_load_a.py by adding a blank line for improved readability. * Add RMS normalization functions to test_rms_norm.py - Introduced `rms_norm` and `rms_norm_splitk` functions for RMS normalization, enhancing the testing capabilities. - Implemented kernel functions with shared memory allocation and parallel processing for improved performance. - Updated the test function to validate the new RMS normalization implementations. * Add reference program for RMS normalization in test_rms_norm.py - Introduced `ref_program` function to provide a reference implementation for RMS normalization. - This addition enhances the testing framework by allowing comparisons against a known reference output. * Enhance RMS normalization tests with additional imports and formatting - Added import for `tilelang.language` to support extended functionality in `test_rms_norm.py`. - Improved code readability by adding blank lines for better separation of code sections. * Update RMS normalization test parameters and enhance layout plotting - Increased matrix dimensions in `test_rms_norm` to 8192 for improved performance testing. - Removed obsolete test functions in `test_fragment_mma_load_a.py` to streamline the test suite. - Enhanced layout plotting functionality by ensuring proper visualization of base, warp, and block layouts in `test_fragment_mma_load_a.py`. * Refactor RMS normalization test parameters and improve layout plotting readability - Simplified the parameters in `test_rms_norm` by removing `blk_k` for clarity. - Enhanced code readability in `test_fragment_mma_load_a.py` by adjusting the formatting of the `block_layout` definition and removing the unused `warp_cols` variable. * Enhance RMS normalization with split-k implementation and additional profiling - Added a new function `test_rms_norm_splitk` to test the split-k variant of RMS normalization. - Updated the main RMS normalization script to include profiling for the split-k implementation. - Ensured all checks pass with appropriate latency measurements for both reference and tile-lang implementations. * Remove obsolete test file `test_fragment_mma_load_a.py` to streamline the test suite. * Refactor `rms_norm.py` to streamline benchmarking output and remove redundant code. Comment out the `plot_layout` call in `fragment_mma_load_a.py` for clarity. * Refactor `test_rms_norm.py` by removing redundant test function `test_rms_norm_splitk` to streamline the test suite and improve clarity. --------- Co-authored-by:Your Name <you@example.com>
-
Tong WU authored
* Add linear attention examples. * Add license * Remove comments * Run yapf and ruff
-
Lei Wang authored
* [Enhancement] Update AutoTuner and JIT compilation arguments * Added functionality to return compile arguments in the JIT implementation, enhancing the autotuner's caching capabilities. * Modified `CompileArgs` and `AutotuneResult` classes to support optional `out_idx` parameter, improving flexibility in compile argument handling. * Refactored the `_AutoTunerImplementation` to utilize the new compile arguments, ensuring better integration and performance during tuning processes. * Update tilelang/autotuner/param.py Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * remove redundant comments * Update tilelang/jit/__init__.py Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
Lei Wang authored
* Enhance Layout * strict update * lint fix * Refactor layout inference by removing unnecessary logging statements in `parallel.cc` and `layout_inference.cc`. This cleanup enhances code readability and reduces log clutter during layout inference steps. * lint fix * Refactor file copying logic in setup.py to simplify directory creation and file copying process. Removed unnecessary existence check before copying source files to the target directory.
-
Lei Wang authored
* [Enhancement] Add support for new FP8 types in HIP code generation * Updated `PrintConst` function in `codegen_hip.cc` to handle `float8_e4m3fnuz` type. * Introduced new functions in `hip_fp8.h` for creating FP8 types, including `make_fp8_e4_4_t` and `make_fp8_e4_8_t`, enhancing type handling for FP8 data structures. * Improved overall compatibility and performance for FP8 data types in HIP. * workaround for competition * enhance autotune * autotune cache fix * Implement validation for unused keys in AutoTuner configuration * Added a check in the AutoTuner class to raise a ValueError if there are unused keys in the configuration, enhancing error handling and ensuring configuration integrity. * lint fix * revert changes of threads * Update pipelining in `example_mla_decode.py` to improve performance * Changed the number of stages in the pipelined loop from 0 to 2, enhancing the efficiency of the attention mechanism in the decoding process. * Enhance Cython kernel validation by adding tensor attribute checks * Updated the `CythonKernelWrapper` to include dedicated methods for validating tensor device, dtype, and static shape. * Modified the `forward` method to utilize these new validation methods, improving error handling and ensuring input integrity. * Updated the `lambda_forward` function in `CythonKernelAdapter` to reflect changes in validation parameters.
-
Lei Wang authored
* Remove unused 2D continuous cumulative sum example and related functions from the cumsum module. * lint fix * fix split k example * Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation * Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice
-
- 03 Jun, 2025 2 commits
- 02 Jun, 2025 1 commit
-
-
Wenhao Xie authored
* [Doc] Include DeepWiki badge in README * include more badges
-
- 01 Jun, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add support for FP8 types in CUDA and HIP code generation * Updated `GetFP8Type` function in `codegen_cuda.cc` and `codegen_hip.cc` to handle new FP8 types, including `kFloat8_e4m3fnuz`. * Introduced a new header file `hip_fp8.h` for FP8 type definitions in HIP. * Modified type mappings in `dlpack.py` and `mfma_macro_generator.py` to accommodate new FP8 types. * Enhanced type handling in `TLHIPSourceWrapper` and `tensor.py` for better integration with FP8 types. * Added necessary includes and logic to support FP8 in the code generation process, improving performance and compatibility with FP8 data types. * lint fix * Update src/target/codegen_hip.cc Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update tilelang/intrinsics/mfma_macro_generator.py Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * workaround * fix * Update submodule TVM to latest commit 587028ffebfff0ded520f8f90d62f0f6b165906c * bug fix * Refactor tilelang matrix multiplication to support transposition and packing options. Adjusted shared memory shapes and loading logic for A and B matrices. Updated test cases to validate new functionality. * Refactor assertion function for tilelang matrix multiplication to improve readability by formatting parameters and aligning code. Cleaned up whitespace in intrinsic layout functions for consistency. * Update bfloat16 type definitions in common.h and gemm.h for consistency. Changed __hip_bfloat16 to hip_bfloat16 and updated MfmaTraits specialization accordingly. * lint fix --------- Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
- 31 May, 2025 1 commit
-
-
Lei Wang authored
-
- 29 May, 2025 1 commit
-
-
Lei Wang authored
* Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code * Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process. * Cleaned up unnecessary whitespace in the function for improved readability. * Enhanced the overall clarity and maintainability of the code. * Refactor AllocateNode handling in vectorize_loop.cc * Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic. * Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability. * Improved overall readability by eliminating unnecessary comments and code related to extent handling. * Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing. * Add buffer allocation planning step in OptimizeForTarget function * Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process. * This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management. * Update submodule TVM to latest commit db50d4e, ensuring alignment with upstream changes. * Add L2 persistent annotation support and related functionality * Introduced a new file `lower_l2_persistent_annotation.cc` to handle the lowering of L2 persistent annotations. * Added functions to annotate L2 hit ratios for buffers, ensuring compatibility with global buffer requirements. * Updated the `LowerAndLegalize` function to include the new L2 persistent map lowering step. * Enhanced CUDA driver with a function to retrieve the maximum size of the persisting L2 cache. * Modified the `TLCUDASourceWrapper` class to integrate L2 persistent map handling during kernel launches. These changes improve the framework's ability to manage L2 cache optimizations, enhancing performance for CUDA applications. * lint fix
-
- 28 May, 2025 4 commits
-
-
Lei Wang authored
* Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code * Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process. * Cleaned up unnecessary whitespace in the function for improved readability. * Enhanced the overall clarity and maintainability of the code. * Refactor AllocateNode handling in vectorize_loop.cc * Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic. * Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability. * Improved overall readability by eliminating unnecessary comments and code related to extent handling. * Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing. * Add buffer allocation planning step in OptimizeForTarget function * Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process. * This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management.
-
yyttt6 authored
-
Lei Wang authored
* [Enhancement] Add commit ID to versioning and improve logging initialization * Updated `get_tilelang_version` to include an optional commit ID in the version string. * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process. * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash. * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity. * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging. * [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching * Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability. * Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations. * Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking. * Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process. * Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity. * Minor adjustments in related modules to support the new caching and profiling features. * [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules * Consolidated import statements and removed unnecessary line breaks for better readability. * Standardized function argument formatting across the AutoTuner and CompileArgs classes. * Enhanced consistency in the use of whitespace and indentation throughout the codebase. * Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability. * Ensured that all changes adhere to the project's coding style guidelines. * [Refactor] Remove redundant type hints in AutoTuner modules * Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`. * Improved code readability and maintainability by streamlining type imports across the AutoTuner module. * [Refactor] Update AutoTuner configuration for improved profiling and target detection * Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings. * Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection. * Removed redundant target specifications in certain instances to streamline the configuration process. * Improved overall clarity and maintainability of the autotuning logic in various example scripts. * [Refactor] Simplify code formatting and improve readability in example scripts * Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity. * Removed unnecessary line breaks and standardized argument placement across multiple files. * Enhanced overall code readability and maintainability in autotuning examples and performance scripts. * [Refactor] Update JIT decorator usage across multiple files * Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code. * Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase. * Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions. * [Refactor] Standardize JIT decorator formatting across benchmark and example scripts * Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks. * Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts. * Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.
-
Lei Wang authored
* Refactor convolution example to streamline configuration and remove unused code * Updated the `check_hopper` function to properly check for CUDA availability and compute capability. * Removed the `get_configs` and `get_best_config` functions, simplifying the example by eliminating unused autotuning logic. * Adjusted argument parsing in the `main` function to directly compile the convolution kernel without autotuning options. * Cleaned up the code for better readability and maintainability. * Update examples/convolution/example_convolution.py Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
- 27 May, 2025 2 commits
-
-
Leslin authored
* [CI] Add gemm and gemm_fp8 example to CI * Fix lint via format.sh * Resolved the issues with profiler API usage and parse_args
-
Yu Cheng authored
* Introduced an `AttrFrame` for warp specialization in the IR, enhancing the handling of warp-specific optimizations. * Refactored the `VisitStmt_` method in `warp_specialized_rewriter.cc` to check for the new warp specialization attribute, improving the detection of warp specialization conditions. * Removed outdated code related to condition checks in `IfThenElseNode`, streamlining the specialization logic.
-
- 26 May, 2025 4 commits
-
-
Lei Wang authored
* Updated `get_tilelang_version` to include an optional commit ID in the version string. * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process. * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash. * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity. * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.
-
Lei Wang authored
[Refactor] Reorganize Thread Synchronization Steps to make sure global synchronization can be correctly lowered (#521) * [Refactor] Reorganize Thread Synchronization Steps in OptimizeForTarget Function * Removed redundant thread synchronization steps for "global" and "shared" memory, streamlining the optimization process. * Reintroduced necessary synchronization for "shared" and "shared.dyn" after the injection of PTX async copy, ensuring correct memory access patterns. * Enhanced overall clarity and maintainability of the OptimizeForTarget function by restructuring the order of operations. * [Refactor] Reorder Thread Synchronization and PTX Async Copy in OptimizeForTarget Function * Removed redundant global thread synchronization step and adjusted the order of operations for shared memory synchronization. * Ensured that the PTX async copy injection occurs after the global thread sync, improving memory access validity. * Enhanced clarity and maintainability of the OptimizeForTarget function by restructuring synchronization steps.
-
Lei Wang authored
* [Enhancement] Add atomic addition functions for FLOAT16x2 and FLOAT16x4 in CUDA * Introduced `AtomicAddx2` and `AtomicAddx4` functions for performing atomic addition operations on double-width float types in CUDA. * Updated `customize.py` to include the new `atomic_addx4` function for external calls. * Modified `__init__.py` to export the new atomic addition function, ensuring accessibility in the module. * lint fix
-
Lei Wang authored
* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516) * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions. * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity. * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process. * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability. * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies. * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions. * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise. * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios. * Add dynamic matrix multiplication example and corresponding test * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling. * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example. * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness. * lint fix * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device. * Updated the `__init__.py` file to include the new function in the module exports. * lint fix * Add global barrier state and expectation handling in CUDA code generation * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator. * Updated `Finish` method to declare the global barrier state if needed. * Implemented handling for `EvaluateNode` to initialize the barrier expectation. * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method. * Enhanced CUDA FP8 type definitions for better alignment and structure. * Enhance CUDA FP8 type handling and debug printing * Updated `cuda_fp8.h` to replace NVidia's FP8 types with Cute's FP8 types for better compatibility and structure. * Added specializations for `debug_print_var` and `debug_print_buffer_value` functions to support the new FP8 types, improving debugging capabilities for these data types. * Updated `debug.h` to include the new `cuda_fp8.h` header for access to the FP8 type definitions. * Refactor CUDA code generation to remove unnecessary managed qualifier for global barrier state * Updated the `Finish` method in `codegen_cuda.cc` to declare the global barrier state without the `__managed__` qualifier, simplifying the declaration. * Added a new `sync_global` function in `builtin.py` to synchronize all threads in a block, enhancing synchronization capabilities in the TileLang framework. * Remove deprecated CUDA kernel and Python script for FP8 E4M3 casting * Deleted the `cast_to_fp8_e4m3_kernel` CUDA kernel implementation and its corresponding Python script, streamlining the codebase by removing unused components related to FP8 E4M3 type casting. * This cleanup enhances maintainability and reduces potential confusion regarding obsolete code. * lint fix
-
- 25 May, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516) * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions. * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity. * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process. * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability. * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies. * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions. * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise. * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios. * Add dynamic matrix multiplication example and corresponding test * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling. * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example. * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness. * lint fix * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device. * Updated the `__init__.py` file to include the new function in the module exports. * lint fix * Add global barrier state and expectation handling in CUDA code generation * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator. * Updated `Finish` method to declare the global barrier state if needed. * Implemented handling for `EvaluateNode` to initialize the barrier expectation. * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method. * Enhanced CUDA FP8 type definitions for better alignment and structure.
-
- 24 May, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516) * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions. * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity. * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process. * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability. * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies. * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions. * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise. * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios. * Add dynamic matrix multiplication example and corresponding test * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling. * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example. * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness. * lint fix * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device. * Updated the `__init__.py` file to include the new function in the module exports. * lint fix
-
- 23 May, 2025 4 commits
-
-
Taoyu Zhu authored
* fix deepgemm example * fix deepgemm example * make format * Update example_deepgemm_fp8_2xAcc.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Yu Cheng authored
* Introduced `example_grouped_gemm_fwd.py` and `example_grouped_gemm_bwd.py` to demonstrate grouped matrix multiplication with forward and backward operations. * Implemented functions for grouped GEMM, input construction, and validation against PyTorch's implementation. * Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions. * Included a test function to validate the functionality with various input scenarios.
-
Yu Cheng authored
* Introduced a new example script `example_grouped_gemm.py` demonstrating grouped matrix multiplication using TileLang and PyTorch. * Implemented functions for performing grouped GEMM, constructing inputs, and validating results against PyTorch's implementation. * Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions. * Included a test function to validate the grouped GEMM functionality with various input scenarios.
-
Lei Wang authored
[Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness Analysis and Scope Management (#508) * Introduced a new StmtAttr structure to track the scope level of statements, enhancing the liveness analysis process. * Updated the UpdateStmtAttr function to manage statement attributes effectively during memory allocation visits. * Modified the VisitStmt_ methods to utilize the new scope level tracking, ensuring accurate memory access patterns. * Refactored the LivenessAnalysis and PlanMemory functions to incorporate statement attributes, improving the handling of gen and kill points in memory management. * Added a new helper function allow_warp_specialized in phase.py to conditionally enable warp specialization based on pass context and target, addressing potential bugs in the MergeSharedMemoryAllocations pass. * Enhanced the OptimizeForTarget function to conditionally apply the MergeSharedMemoryAllocations pass based on warp specialization settings, improving robustness in memory allocation strategies.
-
- 22 May, 2025 3 commits
-
-
Lei Wang authored
* Added a new attribute `kPaddingMap` in `builtin.h` for managing padding annotations. * Enhanced `SafeMemorysRewriter` to utilize an annotated padding map for buffer stores, improving memory access safety. * Implemented checks in `layout_inference.cc` to ensure buffers are correctly referenced during layout mapping. * Introduced a new test file for validating the padding annotation functionality in TileLang.
-
Lei Wang authored
* [Refactor] Enhance GEMM warp partitioning logic for improved performance and flexibility * Updated the warp partitioning logic in `Gemm::ComputeWarpPartition` to better handle various GEMM policies, including FullRow, FullCol, and Square. * Implemented checks to dynamically adjust warp allocation based on matrix dimensions, ensuring optimal performance. * Introduced a new `SelectCopy` template to streamline memory access patterns in CUDA templates, enhancing compatibility across different architectures. * Refactored the Python `GemmWarpPolicy` class to align with the updated C++ logic, improving clarity and maintainability in warp allocation strategies. * [Refactor] Optimize matrix multiplication parameters and performance in quickstart example * Updated thread count in the kernel context from 256 to 128 to enhance performance. * Increased block sizes for matrix dimensions (M, N, block_M, block_N) to 1024 and 128 respectively, improving computational efficiency. * Adjusted the pipeline stages in the GEMM loop from 0 to 3 for better parallel execution. * Cleaned up comments for clarity and corrected a typo in the memory copy comment. * [Refactor] Simplify Copy type selection in OperandTraits for improved clarity * Replaced the conditional Copy type definition with a new SelectCopy template in OperandTraits, enhancing readability and maintainability of the code. * This change streamlines the logic for selecting memory copy patterns based on matrix dimensions and warp configurations.
-
Lei Wang authored
* Modified `makeBufferWithLayout` to include a `var_remap` parameter for improved variable remapping during buffer creation. * Enhanced buffer load and store operations to utilize the new variable remapping logic, ensuring correct buffer references. * Commented out a check in `ThreadExtent` for clarity, maintaining functionality while improving code readability.
-
- 21 May, 2025 1 commit
-
-
Lei Wang authored
[Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization (#507) * [Refactor] Update reduce functions to support default dimension values and improve dimension handling * Added a helper function `_legalize_dim` to handle negative dimension values in reduction operations. * Updated `reduce_max`, `reduce_min`, `reduce_sum`, `reduce_abssum`, and `reduce_absmax` functions to accept a default dimension value of -1, enhancing usability and flexibility in buffer reduction operations. * Ensured consistent dimension handling across all reduction functions for improved clarity and correctness. * Update submodule `tvm` to latest commit c2921fd, ensuring compatibility with recent changes. * [Refactor] Enhance ReduceOp and JITKernel for improved dimension handling and initialization * Updated ReduceOp to handle 1D reduction cases and ensure correct dimension checks, improving robustness in reduction operations. * Initialized prim_func in JITKernel to enhance clarity and prevent potential null reference issues. * Added whitespace for better code readability in reduce.py.
-