"include/vscode:/vscode.git/clone" did not exist on "f0831350d15c3d368d7ae321dd08441d6569086e"
- 26 Apr, 2025 3 commits
-
-
Lei Wang authored
[Enhancement] Simplify vectorization process in loop_vectorize.cc and add atomic add test (#436) (#439) * Removed redundant simplification step in vectorization logic to streamline performance. * Introduced a new test for atomic addition in TileLang, validating functionality with a reference implementation using PyTorch.
-
yyttt6 authored
* yes * [Bugfix] fix the unexpected keyword error of autotune * format * test
-
Lei Wang authored
* [Enhancement] Update reduce operations to support clear option in sum and abs sum (#436) * Modified reduce_sum and reduce_absmax functions to include a clear parameter, allowing for accumulation on existing values. * Updated ReduceOp::Lower method to handle initialization and buffer duplication based on the clear flag for sum and abs sum operations. * Added new tests for reduce_sum and reduce_max with clear functionality to ensure correctness in various scenarios. * Enhanced documentation for reduce functions to clarify the behavior of the clear parameter. * lint fix * Update tensor type annotations in test_tilelang_transform_annotate_device_regions.py from Buffer to Tensor * Update tensor type in reduce sum tests from float16 to float32 for improved precision
-
- 25 Apr, 2025 3 commits
-
-
Lei Wang authored
* [Enhancement] Improve error handling in layout inference and update profiler type in tests * Added a detailed error message in the layout inference for local.fragment to clarify the requirement for trans_B. * Updated the profiler type in the cumulative sum test from TensorSupplyType.One to TensorDistributionType.Randn for better profiling accuracy. * lint fix * [Refactor] Update OperandTraits to include num_warp_n parameter * Modified OperandTraits templates across gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional num_warp_n parameter for improved flexibility in layout and copy operations. * Adjusted Copy type selection based on the new parameter to enhance performance and adaptability in various scenarios. * lint fix * [Refactor] Update DispatchInstruction templates to include N parameter * Modified DispatchInstruction templates in gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional N parameter, enhancing flexibility in tile size calculations. * Adjusted MMA_Group definitions to use std::min for improved handling of warp sizes, ensuring better performance and adaptability in various scenarios. * [Refactor] Simplify store buffer scope checks in pipeline planning * Removed redundant condition for 'local' scope in the store buffer checks, streamlining the logic for identifying global copy patterns. * Enhanced code clarity by reducing complexity in the conditional statements.
-
Lei Wang authored
* [Enhancement] Improve error handling in layout inference and update profiler type in tests * Added a detailed error message in the layout inference for local.fragment to clarify the requirement for trans_B. * Updated the profiler type in the cumulative sum test from TensorSupplyType.One to TensorDistributionType.Randn for better profiling accuracy. * lint fix * [Refactor] Update OperandTraits to include num_warp_n parameter * Modified OperandTraits templates across gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional num_warp_n parameter for improved flexibility in layout and copy operations. * Adjusted Copy type selection based on the new parameter to enhance performance and adaptability in various scenarios. * lint fix * [Refactor] Update DispatchInstruction templates to include N parameter * Modified DispatchInstruction templates in gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional N parameter, enhancing flexibility in tile size calculations. * Adjusted MMA_Group definitions to use std::min for improved handling of warp sizes, ensuring better performance and adaptability in various scenarios.
-
Lei Wang authored
* [Refactor] Adjust layout inference calculations in Gemm and ParallelOp * Updated block size calculation in Gemm to account for the range of thread bounds, improving accuracy in layout inference. * Simplified layout conflict error messages in ParallelOp for better clarity, enhancing debugging experience. * Removed redundant buffer checks in ParallelOp layout inference logic, streamlining the code. * [Refactor] Clean up layout inference logic in Gemm and ParallelOp * Removed unnecessary warning log in Gemm related to WGMMA conditions, streamlining the layout inference process. * Commented out redundant checks in ParallelOp's layout inference, improving code clarity while maintaining functionality. * Enhanced error messages in ParallelOp to provide clearer context for layout conflicts, aiding in debugging efforts. * lint fix * [Enhancement] Improve cumulative sum functionality and annotations handling * Updated the `cumsum` function to include detailed documentation and error handling for dimension bounds. * Modified the `run_cumsum` test to utilize a random tensor supply type for profiling, enhancing test robustness. * Added annotations to the fused loop in `loop_fusion_utils.h`, ensuring proper metadata is preserved during loop fusion. * lint fix
-
- 24 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Adjust layout inference calculations in Gemm and ParallelOp * Updated block size calculation in Gemm to account for the range of thread bounds, improving accuracy in layout inference. * Simplified layout conflict error messages in ParallelOp for better clarity, enhancing debugging experience. * Removed redundant buffer checks in ParallelOp layout inference logic, streamlining the code. * [Refactor] Clean up layout inference logic in Gemm and ParallelOp * Removed unnecessary warning log in Gemm related to WGMMA conditions, streamlining the layout inference process. * Commented out redundant checks in ParallelOp's layout inference, improving code clarity while maintaining functionality. * Enhanced error messages in ParallelOp to provide clearer context for layout conflicts, aiding in debugging efforts. * lint fix
-
- 23 Apr, 2025 2 commits
-
-
Lei Wang authored
* [Enhancement] Improve layout inference in Copy operation (#426) * Updated the Copy operation to infer layouts at multiple levels (kCommon, kStrict, kFree) for enhanced flexibility in layout optimization. * Added detailed documentation for layout inference levels in ParallelOp, clarifying their purposes and use cases. * Refactored layout inference logic to accommodate new levels, improving overall robustness and performance in parallel operations. * lint fix
-
Lei Wang authored
* Update submodule 'tvm' to latest commit f4a8f9b * lint fix
-
- 22 Apr, 2025 7 commits
-
-
Lei Wang authored
-
Lei Wang authored
* [Feature] Implement CumSum operation in TileLang * Added CumSumOp class for cumulative sum operations, including argument validation and lowering logic. * Introduced CumSum2D template for CUDA, supporting both forward and reverse cumulative sums. * Created tests for CumSum functionality in shared memory and fragment contexts. * Updated language interface to include cumsum operation, enhancing the reduction capabilities of TileLang. * Refactored reduce.py to support cumsum functionality with appropriate memory allocation and copying mechanisms. * lint fix
-
Yu Cheng authored
* Introduced logic to check for TMA+WS enablement based on annotations in the pipeline planning stage. * Enhanced the handling of `order_anno` and `stage_anno` to determine if TMA+WS is activated, improving flexibility in loop processing. * Refactored the existing code to maintain clarity while integrating the new feature.
-
FeiyangChen authored
-
Yu Cheng authored
* Updated the layout inference in ParallelOp to improve the selection of source buffers for layout accuracy. * Introduced logic to choose the read source buffer based on the number of indices, ensuring more precise layout inference. * Refactored the loop handling to maintain clarity and improve the overall robustness of the layout inference process.
-
Zhao Wu authored
* Support to find Cython path more automatically * lint fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* [Enhancement] Introduce thread range management in layout and operation handling * Added `SetThreadRange` method to `FragmentNode` for managing thread ranges. * Updated `LayoutNode::Inverse` to provide more informative error messages. * Refactored layout inference and operation lowering to utilize `thread_bounds` instead of `block_size`, enhancing flexibility for thread management. * Introduced new tests for tilelang operations to validate thread range functionality and ensure correctness in parallel execution scenarios. * lint fix * [Refactor] Improve thread variable handling in layout inference and operation lowering * Removed workaround for undefined thread_var in layout inference, ensuring proper handling of thread bounds. * Updated logic to define thread bounds based on the presence of thread_var, enhancing robustness in thread management. * Refactored thread_var initialization in lower_tile_op to maintain consistency across the codebase. * [Refactor] Update thread variable handling in layout inference and operation lowering * Refactored thread variable checks to ensure bounds are only accessed when defined, improving safety and clarity. * Initialized thread_var with a default range to prevent undefined behavior. * Updated logic in lower_tile_op to align with new thread variable handling, enhancing consistency across the codebase.
-
- 21 Apr, 2025 3 commits
-
-
alex_xiao authored
-
Lei Wang authored
* Introduced a new function `get_nvcc_compiler` in nvcc.py to obtain the path to the nvcc compiler. * Updated LibraryGenerator to use `get_nvcc_compiler` instead of hardcoding the nvcc command, improving maintainability and flexibility.
-
Lei Wang authored
* [New Feature] Add FP8 Flash Attention Implementation (#412) * Introduce a new example script for FP8 Flash Attention in `example_mla_decode_kv_fp8.py`, showcasing the use of tilelang for efficient attention computation. * Implement the `flashattn` function with optimized memory management and kernel execution. * Include a reference program for comparison and performance evaluation. * Add command-line argument parsing for batch size, number of heads, and dimensions to facilitate testing and experimentation. * Enhance the overall structure and readability of the code. This addition aims to improve the performance of attention mechanisms in deep learning models by leveraging FP8 precision and optimized kernel execution. * lint fix * optimize quick start * lint fix
-
- 19 Apr, 2025 4 commits
-
-
You Jiacheng authored
* [Language] make linter and type checker happy with mocking * Apply suggestions from code review Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> * Refactor BaseTensor class in proxy.py to implement __getitem__ and __setitem__ methods, enhancing type checking and linting compliance. Added method stubs for from_ptr and other subclasses for improved clarity and maintainability. * Refactor type imports in proxy.py to enhance clarity and maintainability by replacing the built-in Self type with the Self type from typing_extensions. --------- Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* Phase out attr * Remove unused dependencies from requirements files * Update TVM submodule to latest commit 4776d31
-
Andy Luo authored
* Update Installation.md * Update installation prerequisites in documentation * cu128 Docker Image --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
Andy <andyluo918@gmail.com>
-
Lei Wang authored
* Update TVM submodule and enhance vectorization logic in loop_vectorize.cc - Updated the TVM submodule to the latest commit. - Simplified the vectorization process by ensuring that the vectorized expression is simplified after vectorization, improving expression handling. - Added checks in loop_fusion_utils.h to prevent fusion of loops with non-power-of-2 extents, enhancing robustness in loop transformations. * lint fix
-
- 18 Apr, 2025 2 commits
-
-
Lei Wang authored
-
Andy Luo authored
* Update Installation.md * Update installation prerequisites in documentation --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 17 Apr, 2025 2 commits
-
-
Lei Wang authored
* Update CI configuration to run pytest with automatic parallelization using the '-n auto' option. * Enhance Cython JIT Adapter Compilation Logic - Improved the locking mechanism during the compilation of the Cython JIT adapter to prevent race conditions. - Added checks to determine if another process has already compiled the library, reducing unnecessary recompilation. - Cleaned up the code by removing redundant imports and ensuring proper handling of temporary files during compilation failures. - Updated vectorization logic in loop_vectorize.cc to allow optional simplification of vectorized expressions. This update enhances performance and reliability in the JIT compilation process. * lint fix * Update CI configuration to run pytest with 4 parallel jobs instead of auto-detection * Add pytest markers for serial execution in MHA tests - Added @pytest.mark.serial to multiple MHA test functions to ensure they run sequentially. - This change improves test reliability by preventing potential race conditions during execution. * Update TVM submodule and enhance vectorization logic in loop_vectorize.cc - Updated the TVM submodule to the latest commit. - Modified the vectorization logic to include optional simplification of vectorized expressions and added checks to ensure the usage of vectorized variables, improving performance and reliability in expression handling. * Remove @pytest.mark.serial from multiple MHA test functions to allow parallel execution. This change enhances test performance by enabling concurrent test runs while maintaining reliability. * Remove tvm_simplify_test.py file, eliminating the test for expression simplification in TVM. This cleanup helps streamline the codebase by removing unused test cases. * Remove unused pytest import from test_tilelang_kernel_mha.py to streamline the test file. * lint fix * Update TVM submodule and refine vectorization logic in loop_vectorize.cc - Updated the TVM submodule to the latest commit. - Adjusted the return statements in loop_vectorize.cc to improve expression handling and ensure consistency in the visitor pattern. * Refactor vectorization logic in loop_vectorize.cc - Removed the check for the usage of the vectorized variable in the vectorization logic, simplifying the expression handling. - This change enhances the clarity and efficiency of the vectorization process. * Enhance vectorization checks in loop_vectorize.cc - Added a check to ensure the vectorized expression uses the vectorized variable, improving the robustness of the vectorization logic. - This change refines the expression handling and ensures that only valid vectorized expressions are processed. * Implement non-local buffer checks for loop vectorization in layout_inference.cc - Added logic to check for non-local buffer loads and stores before applying vectorization to loops. This enhancement ensures that vectorization is only applied when appropriate, improving the correctness of the loop transformations. * Refactor buffer handling in pipeline planning and layout inference - Renamed GlobalCopyPatternDetector to BufferRegionCollector for clarity and updated its logic to collect buffer read/write regions. - Enhanced the handling of conditional expressions in pipeline planning, allowing for better management of stages related to conditional statements. - Improved the processing of buffer regions during read/write operations, ensuring accurate tracking of buffer usage across different stages. * Refactor vectorization checks in loop_vectorize.cc - Removed the check for the usage of the vectorized variable in the vectorization logic, simplifying the expression handling. - This change enhances the clarity and efficiency of the vectorization process, ensuring that valid vectorized expressions are processed without unnecessary checks.
-
Zhengju Tang authored
-
- 16 Apr, 2025 6 commits
-
-
Oscar Savolainen authored
* Add bf16 support for AMD in quickstart example * Reduced git diff * Move bf16 vector definition into common.h * Added unit tests for basic AMD bf16 matmul * lint fix --------- Co-authored-by:
OscarSavNS <oscar.savolainen@nscale.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
Cunxiao Ni authored
* [Enhancement] Move T.any_of and T.all_of op registration from python into cpp * format * add license
-
Zhengju Tang authored
* [BugFix] Address should aligned with access size in tail split * Lint * Lint
-
dependabot[bot] authored
Bumps [transformers](https://github.com/huggingface/transformers) from 4.40 to 4.48.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.40.0...v4.48.0 ) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.48.0 dependency-type: direct:production ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
Lei Wang authored
* Update copyright notice in example_mha_bwd_wgmma_pipelined.py to reflect Tile-AI Corporation ownership. * lint fix
-
Lei Wang authored
* make it python 3.8- happy * [Enhancement] Improve loop partitioning and vectorization logic in layout inference and loop vectorization - Enhanced the VisitStmt_ method to support local buffer handling in parallel loops, allowing for register usage without explicit thread binding. - Updated loop vectorization logic to simplify expressions and ensure accurate vector size calculations, improving performance and clarity in the vectorization process. * lint fix * [Refactor] Update warp size checks and enhance warp partitioning logic in GEMM - Changed warp_n size check from 16 to 8 in gemm_layouts.cc to improve compatibility with specific configurations. - Refactored warp partitioning logic in gemm.cc to prioritize N dimension for better performance based on aspect ratio. - Introduced a new CompileArgs dataclass in autotuner to streamline compile argument management and improve code clarity. * lint fix * [Enhancement] Initialize jit_compile in AutoTuner class - Added initialization for jit_compile attribute in the AutoTuner class to ensure it is set to None by default. - Updated the assignment logic for jit_compile to prevent overwriting an existing compile function, enhancing the flexibility of the AutoTuner's compilation process.
-
- 15 Apr, 2025 2 commits
-
-
Lei Wang authored
* make it python 3.8- happy * [Enhancement] Improve loop partitioning and vectorization logic in layout inference and loop vectorization - Enhanced the VisitStmt_ method to support local buffer handling in parallel loops, allowing for register usage without explicit thread binding. - Updated loop vectorization logic to simplify expressions and ensure accurate vector size calculations, improving performance and clarity in the vectorization process. * lint fix
-
Yu Cheng authored
Added detailed error messages in the InferLayout method to provide better context when layout conflicts occur. This includes the body of the operation that triggered the error, aiding in debugging and layout validation.
-
- 14 Apr, 2025 3 commits
-
-
Yu Cheng authored
Updated SyncPatternMap to use vectors for acquire and release, enhancing flexibility in handling synchronization patterns. Improved barrier handling logic for both producer and consumer cases, ensuring accurate synchronization in the pipeline.
-
Lei Wang authored
* [Enhancement][Pipeline] Improve pipeline stage information handling and copy stage detection - Added detailed documentation for the PipelineStageInfo structure to clarify its parameters. - Enhanced the VisitStmt_ method to handle annotations for pipeline order and stage more effectively. - Implemented logic to determine if a stage is used by a copy operation, adjusting the stage assignment accordingly. - Processed the tail copy stage to ensure correct ordering and stage assignment in the pipeline planning process. * lint fix
-
Lei Wang authored
* Update README.md for deepseek_mla: Refine performance comparison details and add acknowledgment section. Adjusted performance metrics for TileLang, highlighting its efficiency over Triton and assembly kernels. Included gratitude to the AMD ROCm team for their contributions. * Update README.md for deepseek_mla: Clarify performance metrics for TileLang, specifying the range of performance parity with hand-optimized assembly kernels. This adjustment enhances the accuracy of the comparative analysis against Triton implementations.
-
- 13 Apr, 2025 2 commits
-
-
Zhengju Tang authored
[Dynamic Symbolic] Add pass_config to customize vectorization and tail split [Pytest Fix] Wrap tests in dynamic benchmark
-
Zhengju Tang authored
* [Dynamic Symbolic] Add pass_config to customize vectorization and tail split * Lint * Only check for vectorized dimension. Add docs. * Lint * Update comment for cache directory in .gitignore * Use CUTLASS convention to represent dynamic alignment. Fix bugs * Add benchmark examples * Add more benchmarks. Fix accumulate type bug. * Lint * Lint * Test Lint * Lint * Test Lint * Lint * Fix typo * Lint * Lint --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-