"src/vscode:/vscode.git/clone" did not exist on "ba410ae3525f8506468325c70e7936b7c4f0a225"
- 23 Mar, 2025 3 commits
-
-
Lei Wang authored
* Bump version to 0.1.3 * Refactor Docker script to streamline installation commands - Removed the installation of the Python environment and CMake from the Docker run command, simplifying the execution process. - Updated the command to focus on pip installation and running tox for testing across multiple Python versions.
-
Lei Wang authored
- Updated `ref_program` in `benchmark_matmul.py` to remove the unused parameter `C`, simplifying the function signature. - Changed logging level in `autotuner/__init__.py` from `INFO` to `DEBUG` for more detailed logging during autotuning. - Modified the error handling in the autotuner to provide clearer messages and log errors at the debug level. - Enhanced error reporting in the JIT adapter by adding detailed context to error messages in `cython_wrapper.pyx` when kernel calls fail.
-
Lei Wang authored
* [Enhancement] Introduce caching control and frame management in TileLang - Added cache control functions (`enable_cache`, `disable_cache`, `is_cache_enabled`) in `env.py` to manage kernel caching behavior. - Updated `kernel_cache.py` to utilize the cache state, preventing unnecessary kernel compilation when caching is disabled. - Introduced a new `frame.py` module to manage LetFrame instances, including a stack for variable-value mapping and enhanced frame management. - Updated imports in various modules to accommodate new caching and frame functionalities, improving overall organization and clarity. * [Refactor] Clean up and enhance caching and frame management in TileLang - Added spacing for improved readability in `env.py` and `frame.py`. - Refactored `LetFrame` class to enhance clarity in buffer region assignment. - Ensured consistent formatting and organization across caching control and frame management functions. * [Feature] Add matrix multiplication functionality in TileLang - Introduced a new test file `test_tilelang_language_alias.py` that implements a matrix multiplication function using TileLang's primitives. - The `matmul` function defines a kernel for performing tile-level GEMM operations, with support for customizable block sizes and data types. - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation. - Updated `gemm.py` to allow `tir.Buffer` or `tir.Var` as valid argument types for the `gemm` function, enhancing flexibility in argument handling. * [Refactor] Improve formatting and readability in test_tilelang_language_alias.py - Adjusted spacing and alignment in the `matmul` and `run_matmul` functions for better readability. - Cleaned up unnecessary blank lines and ensured consistent formatting throughout the file. - Enhanced overall code clarity without altering functionality.
-
- 22 Mar, 2025 5 commits
-
-
Chaofan Lin authored
* fix tune args * lint * Refactor gemm example and autotuner logging - Updated `ref_program` in `example_gemm.py` to return the result of matrix multiplication instead of modifying an input parameter. - Changed logging filename in `__init__.py` from 'out.log' to 'autotuner.log' for better clarity. - Modified JIT kernel compilation process to include `out_idx` directly in the adapter creation, enhancing flexibility. - Improved validation of `result_idx` in `BaseKernelAdapter` to ensure it falls within valid bounds. * Refactor `ref_program` in `benchmark_matmul_intrinsic.py` to use the `@` operator for matrix multiplication instead of `torch.matmul`, simplifying the implementation by removing the unused parameter `C`. --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Yichen Yan authored
* use auditwheel to get correct manylinux wheels * fix * make py3.8 happy * trivial updates * Add typing.Tuple import and update annotations * fmt * Remove unused import and update type hints * lint fix --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
You Jiacheng authored
* move compilation outside critical section * lint fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* Add GPU kernel for 2D continuous cumulative sum in TileLang example - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum. - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations. - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations. - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function. * Refactor TileLang examples and enhance kernel compilation - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking. - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability. - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management. - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames. - Cleaned up import statements in `__init__.py` for better organization and clarity. * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example - Added additional spacing for improved readability in `example_tilelang_cumsum.py`. - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations. * Refactor CUDA post-processing callback registration in TileLang - Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility. - Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability. - Added debug prints to the CUDA code generation process for better traceability during development. - Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation. - Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling. * lint fix * Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters - Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio. - Implemented a function to generate various kernel configurations for autotuning. - Refactored the main execution block to support both autotuned and default configurations. - Improved the block mask generation to accommodate specified sparsity levels. - Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.
-
Lei Wang authored
* Add GPU kernel for 2D continuous cumulative sum in TileLang example - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum. - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations. - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations. - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function. * Refactor TileLang examples and enhance kernel compilation - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking. - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability. - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management. - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames. - Cleaned up import statements in `__init__.py` for better organization and clarity. * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example - Added additional spacing for improved readability in `example_tilelang_cumsum.py`. - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
-
- 21 Mar, 2025 2 commits
-
-
Lei Wang authored
* [Enhancement] Add matrix multiplication functions for integer and float variables in Cython JIT - Introduced `matmul_int_variable` and `matmul_float_variable` functions to support matrix multiplication with dynamic shapes and additional parameters. - Implemented corresponding `run_matmul_int_variable` and `run_matmul_float_variable` functions for testing. - Updated test cases to validate the new matrix multiplication implementations. - Enhanced error handling in library initialization and compilation processes across various modules. - Improved dynamic memory handling in CUDA kernel initialization to provide better error reporting. * lint fix * optimize * Support var defiine * lint fix * Update TVM submodule and add alloc_variable function to allocate local variables in TileLang - Updated the TVM submodule to the latest commit. - Introduced `alloc_variable` function in `allocate.py` to support local variable allocation with specified data types and scopes. * lint fix * Refactor variable allocation functions for consistency - Renamed `alloc_variable` to `alloc_var` across multiple files for improved consistency. - Updated corresponding test functions to reflect the new naming convention. - Adjusted imports in `__init__.py` to align with the changes.
-
yyttt6 authored
* add autotune to example_gemm.py * add autotune to example_gemm.py * add autotune to example_gemm.py * add autotune to example_gemm.py
-
- 20 Mar, 2025 3 commits
-
-
Lei Wang authored
* [Enhancement] Add matrix multiplication functions for integer and float variables in Cython JIT - Introduced `matmul_int_variable` and `matmul_float_variable` functions to support matrix multiplication with dynamic shapes and additional parameters. - Implemented corresponding `run_matmul_int_variable` and `run_matmul_float_variable` functions for testing. - Updated test cases to validate the new matrix multiplication implementations. - Enhanced error handling in library initialization and compilation processes across various modules. - Improved dynamic memory handling in CUDA kernel initialization to provide better error reporting. * lint fix * optimize
-
Lei Wang authored
-
Lei Wang authored
* remove llvm build * [Refactor] Update kernel compilation and profiling in examples - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation. - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency. - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations. * lint fix * License Update * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields. - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability. * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files - Improved comment alignment and readability in `cuda.h`. - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability. * lint fix * lint fix * lint fix * lint fix * fix * License update * [Enhancement] Update JITKernel to use artifact for kernel source - Assigned the generated artifact to `self.artifact` for better management. - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling. * lint fix * Add @tilelang.testing.requires_llvm decorator to vectorization tests * Enhance setup.py and env.py for library management - Added functionality to remove original files after copying in CMakeBuild. - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration. * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py * Refactor CMakeBuild file handling in setup.py - Added a check to ensure the target library directory exists before copying .so files. - Improved the logic for creating the target directory and copying files to enhance robustness. * bugfix * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement. * lint fix * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility. * lint fix * Add support for C target in device code generation - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function. * [Enhancement] Implement auto-clear cache feature based on environment variable * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing. * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing. * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true. * [Refactor] Update kernel invocation and import paths in tests and cache * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result. * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`. * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability. * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class. * Enhanced overall code formatting to align with project standards. * [Enhancement] Add bfloat16 test case and improve kernel caching logic * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`. * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading. * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management. * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database. * Improved code formatting and readability across several files. * lint fix * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
-
- 19 Mar, 2025 5 commits
-
-
Chenghua authored
* [Example] Modify tuning configurations for FlashAttention example * [Examples] formatting example_gqa_fwd_bshd.py * [Examples] Implement elementwise add kernel * [Doc] Update ElementWise Operators document * [Examples] Replace the example of elementwise add.
-
alex_xiao authored
* [Dev] Add database mechanism to cache * [Dev] Fix database cache and test for it * [Dev] Refactor env.py to use TILELANG_CACHE_DIR and remove extra comment. * [Refactor] Improve code formatting and readability in multiple files * [Enhancement] Add execution backend options and improve kernel adapter initialization * [Refactor] Rename cached function to cached_kernel and update related references * [Enhancement] Enable target and target_host parameters in kernel loading and improve gemm test case * [Enhancement] Update kernel compilation to specify execution backend as "cython" * [Refactor] Rename cached_kernel to cached and update references in the codebase * [Enhancement] Un-comment and add test cases for matrix multiplication correctness; improve kernel caching logic and remove redundant code * [Refactor] Clean up code formatting and improve readability in cache and adapter modules * [Refactor] Remove unused imports * [Refactor] Update cached function signature to use PrimFunc and Optional types for improved type safety * [Refactor] Update cached function calls to use PrimFunc and improve parameter handling * [Refactor] Clean up import statements and improve code formatting in cache and kernel test files * Update tilelang/jit/kernel.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Yuxi Chi authored
[Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value to `minBlocksPerMultiprocesor ` (#248)
-
Yu Cheng authored
* [Enhancement] Add zero initialization option to GEMM operations - Introduced a new `zero_init` parameter to the GEMM function, allowing for optional zero initialization of the accumulator. - Updated the GEMM implementation across various CUDA architectures to support the new parameter. - Modified the Python interface for GEMM to include the `zero_init` argument, enhancing flexibility in kernel execution. - Ensured compatibility with existing functionality while improving initialization control for performance optimization. * rename zero_init to clear_accum * lint
-
Wenhao Xie authored
* [Typo] Fix formatting in installation instructions in README.md * [Enhancement] Improve CUDA path detection and update configuration handling * fix typo * remove IS_WINDOWS constant * lint fix * Improve error messages for CUDA detection failure * lint fix * lint fix * Fix .gitignore to correctly include venv directory * [Doc] Add instructions for installing nightly version of TileLang * update installation instructions * update install instruction * fix bug of mismatching dtype in testing and set the default value of check_dtype in torch_assert_close to true * lint fix * fix bug * use map_torch_type
-
- 18 Mar, 2025 4 commits
-
-
Yu Cheng authored
* [BugFix] Fix bug of missing MBarrierExpectTX * [Dev] Implement FlashAttention3 Backward - Added a new example for Flash Attention using pipelined WGMMA, including forward and backward pass implementations. - Introduced functions for forward and backward processing, leveraging tilelang for optimized tensor operations. - Enhanced the attention mechanism with support for both causal and non-causal configurations. - Included command-line arguments for batch size, number of heads, context size, and head dimension for flexibility in testing. - Updated GEMM operations to support a new `wg_wait` parameter for improved synchronization in kernel execution.
-
Wenhao Xie authored
* [Typo] Fix formatting in installation instructions in README.md * [Enhancement] Improve CUDA path detection and update configuration handling * fix typo * remove IS_WINDOWS constant * lint fix * Improve error messages for CUDA detection failure * lint fix * lint fix * Fix .gitignore to correctly include venv directory * [Doc] Add instructions for installing nightly version of TileLang * update installation instructions * update install instruction * [Refactor] Enhance tensor comparison by integrating attribute checks and equalization in torch_assert_close * add mismatch info * set the default value of check_dtype to false * fix import * set equal_nan to true by default in torch_assert_close
-
Lei Wang authored
* [Feature] Add reduce_max functionality and corresponding tests * Introduced a new test file for the reduce_max operation in the tilelang language module. * Implemented the reduce_max functionality using T.prim_func, including local memory allocation and result copying. * Added tests for various input sizes and data types to ensure correctness of the reduce_max implementation. * Enhanced profiling assertions to validate the output against reference implementations. * Fix whitespace issues in reduce_max test file for improved readability * [Refactor] Update DebugOutput methods to return strings instead of void * Modified DebugOutput methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to return std::string instead of void, enhancing usability for logging and debugging. * Updated corresponding header files to reflect the new return types. * Improved layout inference error messages by incorporating DebugOutput for better clarity in layout conflicts. * lint fix * Fix typo in matmul function: changed loop from T.Parallel to T.grid for correct parallel execution in webgpu code generation tests. * [Enhancement] Improve layout inference conflict handling in ParallelOp * Updated the layout inference logic in ParallelOp to better handle conflicts for local.fragment buffers. * Added checks to ensure that layout conflicts are reported only when both source and destination buffers are defined, improving clarity in error messages. * Enhanced the overall robustness of the layout inference process by addressing specific cases where conflicts may arise. * [Feature] Add IsEqual methods for layout comparison * Introduced IsEqual methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to facilitate structural equality checks, allowing for optional index comparison. * Enhanced layout inference logic in Copy and ParallelOp to utilize the new IsEqual methods for better conflict detection in local.fragment layouts. * Improved error messages for layout conflicts to provide clearer guidance on potential issues.houm * [Refactor] Update profiler usage in benchmark_nsa_fwd.py and improve layout inference in elem.cc and parallel.cc * Modified the profiler call in benchmark_nsa_fwd.py to streamline latency measurement. * Updated layout inference logic in elem.cc and parallel.cc to use const pointers for FragmentNode, enhancing type safety and clarity. * Improved error messages in layout conflict checks to provide better guidance on potential issues. * [Refactor] Clean up pointer formatting in layout inference files * Standardized pointer formatting for FragmentNode in elem.cc and parallel.cc to improve code readability. * Minor adjustments to error message formatting in layout conflict checks for better clarity.
-
Yu Cheng authored
-
- 17 Mar, 2025 7 commits
-
-
Lei Wang authored
* [Feature] Add reduce_max functionality and corresponding tests * Introduced a new test file for the reduce_max operation in the tilelang language module. * Implemented the reduce_max functionality using T.prim_func, including local memory allocation and result copying. * Added tests for various input sizes and data types to ensure correctness of the reduce_max implementation. * Enhanced profiling assertions to validate the output against reference implementations. * Fix whitespace issues in reduce_max test file for improved readability
-
Lei Wang authored
* [Enhancement] Simplify kernel source extraction in JIT adapters - Updated CtypesKernelAdapter and CythonKernelAdapter to include kernel_global_source for improved source code retrieval. - Modified the _legalize_result_idx method in BaseKernelAdapter to accept Optional[List[int]] for better flexibility. - Added comments to clarify the purpose of kernel_global_source in both adapters, enhancing code readability and maintainability. * [Refactor] Update parameter type in CythonKernelAdapter constructor - Changed the parameter type from List[TensorType] to List[KernelParam] in the CythonKernelAdapter's __init__ method to enhance type consistency and align with recent refactoring efforts across modules.
-
Lei Wang authored
- Renamed `clamp` to `clamp_within_bounds` and `clamp_v2` to `clamp_value_range` for improved clarity. - Updated dtype handling in `clamp_value_range` to use the correct tensor dtype. - Modified test cases to reflect the new function names and ensure proper dtype conversion using `map_torch_type`. - Enhanced the reference program for clamping to utilize the updated tensor dtype, improving accuracy in tests.
-
Lei Wang authored
* [Refactor] Enhance argument handling in TLCUDASourceWrapper and TLCPUSourceWrapper - Updated `func_call_args` to accept an optional `desc_name_map` parameter for improved descriptor handling. - Modified `generate_tma_descriptor_args` to utilize the `desc_name_map`, ensuring correct mapping of descriptor names. - Cleaned up the logic for generating function call arguments and TMA descriptor initialization, enhancing code clarity and maintainability. * lint fix
-
Lei Wang authored
* Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture - Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support. - Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types. - Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture. - Include necessary header files for built-in operations and layout inference improvements. * Refactor parameter formatting in CUDA matrix load functions for consistency - Adjusted parameter alignment in `ptx_ldmatrix_x1`, `ptx_ldmatrix_x2`, `ptx_ldmatrix_x4`, and their transposed counterparts for improved readability. - Added a blank line in `get_tensor_supply` function in `tensor.py` to enhance code clarity. * Enhance tensor supply generation in `get_tensor_supply` function - Introduced handling for unsigned integer and float8 tensor types, allowing for specific random tensor generation based on data type. - Updated logic to return appropriate random tensors for different data types, improving flexibility and functionality of tensor supply generation. - Refactored existing conditions for clarity and maintainability. * Fix tensor supply generation logic in `get_tensor_supply` function - Updated the variable reference from `tensor` to `param` to ensure correct handling of tensor data types. - Improved the accuracy of unsigned integer and float8 checks for tensor supply generation, enhancing functionality and reliability. * Enhance tensor supply checks in `get_tensor_supply` function - Updated the logic for identifying unsigned integers and float8 types by using `removeprefix` on the dtype string, improving accuracy in tensor supply generation. - Ensured better handling of tensor data types for more reliable random tensor generation based on the updated checks. * Enhance KernelParam functionality and improve tensor supply checks - Added methods `is_unsigned` and `is_float8` to the `KernelParam` class for better type identification of parameters. - Updated the `get_tensor_supply` function to utilize the new methods, improving clarity and accuracy in tensor supply generation based on parameter types.
-
Wenhao Xie authored
* [Typo] Fix formatting in installation instructions in README.md * [Enhancement] Improve CUDA path detection and update configuration handling * fix typo * remove IS_WINDOWS constant * lint fix * Improve error messages for CUDA detection failure * lint fix * lint fix * Fix .gitignore to correctly include venv directory * [Doc] Add instructions for installing nightly version of TileLang * update installation instructions * update install instruction
-
Yuxi Chi authored
* add fp8 gemm 2xAcc and deepgemm example. * format deepgemm example. * fix the fotmat lint. * format with the updated format.sh
-
- 16 Mar, 2025 3 commits
-
-
Yu Cheng authored
- Replaced instances of `tilelang.lower` and `tilelang.Profiler` with `tilelang.compile` and the new profiler interface in multiple example files. - Enhanced the kernel compilation process to utilize the updated API, improving consistency and maintainability. - Adjusted benchmarking logic to use the new profiler methods for better clarity and functionality in performance testing. - Cleaned up whitespace and improved formatting for better readability across the modified files.
-
zqh-wz authored
* add test for issue 101 * use ss_smem_selector from cutlass * fix mismatch between smem layout and mma * only fix for sm90 * Add CUDA requirements to GEMM thread tests * lint fix --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Lei Wang authored
* [Refactor] Update KernelParam integration across modules - Replaced instances of TensorType with KernelParam in various modules to standardize parameter handling. - Updated JITKernel, BaseKernelAdapter, and CythonKernelAdapter to utilize KernelParam for improved type consistency. - Enhanced Profiler class to include KernelParam in its parameters, ensuring better integration with the new parameter structure. - Adjusted tensor handling in utility functions to accommodate the new KernelParam type, improving overall code clarity and maintainability. - Updated copyright headers to reflect the correct organization. * [Refactor] Clean up whitespace in kernel, profiler, and tensor modules - Added blank lines for improved readability in kernel.py, __init__.py, and tensor.py. - Enhanced code clarity by ensuring consistent formatting across these modules. * [Enhancement] Add detailed docstrings to KernelParam and Profiler classes - Enhanced KernelParam class with comprehensive docstrings for better understanding of its purpose and methods. - Updated Profiler class to include detailed docstrings for its attributes and methods, improving code documentation and usability. - Removed unused do_bench function to streamline the profiler module and improve clarity. * [Refactor] Update type hints in do_bench function and clean up whitespace in profiler module - Changed type hints for grad_to_none and quantiles parameters in do_bench function to use Optional for better clarity. - Added a blank line in __init__.py for improved readability and consistency in the profiler module. * [Refactor] Update type hint in do_bench function for consistency - Changed the return type hint in the do_bench function from a union type to a more explicit List type for better clarity and consistency in type annotations. * [Refactor] Update return type hint in do_bench function for clarity - Changed the return type hint in the do_bench function from a union type to Union[float, List[float]] for improved clarity and consistency in type annotations. * [Enhancement] Add func property to Profiler class for adapter access - Introduced a new property `func` in the Profiler class to provide access to the adapter, ensuring that the adapter is set before retrieval. This enhancement improves the usability of the Profiler class by allowing easier access to the adapter functionality. * [Refactor] Update kernel compilation and profiling in tests - Replaced instances of `TL.lower` and `TL.Profiler` with `tilelang.compile` and the new profiler interface across multiple test files. - Enhanced the kernel compilation process to utilize the updated API, improving consistency and maintainability in the testing framework. - Updated assertions to use the new profiler methods for better clarity and functionality in performance testing. * [Refactor] Simplify kernel invocation and remove unused parameters in tests - Updated the kernel invocation in `test_tilelang_dynamic_symbolic.py` to directly assign the result to `C`, improving clarity. - Removed the `execution_backend` parameter from `tilelang.compile` calls in `test_tilelang_jit_callback.py` and `test_tilelang_jit_gemm.py` for consistency with the updated API. - Commented out the call to `tilelang.testing.main()` in `test_tilelang_jit_callback.py` and replaced it with a direct call to `test_gemm_jit_kernel()` to streamline test execution. - Adjusted the dtype mapping in `TorchDLPackKernelAdapter` to use the parameter's dtype directly, enhancing code simplicity. * [Refactor] Remove unused imports in test files for cleaner code - Eliminated unnecessary imports of `tilelang` as `TL` in various test files to enhance code clarity and maintainability. - Updated multiple test files to streamline the codebase and reduce potential confusion from unused references. * [Refactor] Simplify kernel invocation in tilelang kernel test - Updated the kernel invocation in `test_tilelang_kernel_bf16_gemm_mma.py` to directly assign the result to `C`, enhancing code clarity and consistency with recent changes in the API. * [Refactor] Simplify kernel invocation in tilelang kernel tests - Updated kernel invocations in multiple test files to directly assign the result to `C`, improving code clarity and consistency with the updated API. - Removed unnecessary initialization of `C` as a zero tensor, streamlining the code further. * [Refactor] Update kernel invocation in tilelang transform tests - Replaced the use of `TL.Profiler` with `tilelang.compile` in `test_tilelang_transform_simplify.py`, enhancing code clarity and consistency with the updated API. - Streamlined the kernel invocation process by directly assigning the result to `C`, improving readability and maintainability of the test code.
-
- 15 Mar, 2025 2 commits
-
-
Lei Wang authored
- Modified `format.sh` to first check for the upstream repository before falling back to the local `origin/main` branch for determining the merge base. - Improved logic to handle cases where the upstream repository is not available, ensuring robust file formatting across branches.
-
Lei Wang authored
* [Enhancement] Improve device handling in Cython kernel adapter and wrapper - Updated `CythonKernelAdapter` to support dynamic device assignment based on target type (CUDA, HIP, or CPU). - Enhanced `CythonKernelWrapper` to include device management, ensuring tensors are allocated on the correct device. - Added error handling for unsupported target types to improve robustness. * [Enhancement] Add buffer device mapping in Cython kernel adapter and wrapper - Introduced `buffer_device_map` in `CythonKernelAdapter` to associate buffer variables with their respective devices. - Updated `CythonKernelWrapper` to utilize the new buffer device mapping for device checks during tensor allocation. - Enhanced error handling for device mismatches to ensure tensors are allocated on the correct device, improving robustness and flexibility in device management.
-
- 14 Mar, 2025 6 commits
-
-
Lei Wang authored
* [Feature] Add reshape and view functionalities to tilelang language module - Introduced new test files for reshape and view operations in the tilelang language. - Implemented reshape and view functions in the customize module, enhancing buffer manipulation capabilities. - Updated the language initialization to include the new functionalities. - Removed unnecessary import from test_tilelang_language_clamp.py for cleaner code. * Update copyright to Tile-AI Corporation * [Refactor] Clean up whitespace in test files for reshape and view functionalities - Removed unnecessary blank lines in `test_tilelang_language_reshape.py` and `test_tilelang_language_view.py` for improved readability. - Ensured consistent formatting across test files to enhance code clarity.
-
Yu Cheng authored
* [Dev] Implement IfStmtBinding and MergeIfStmt transformations - Add IfStmtBinding to bind If statements to each statement in SeqStmt, enhancing the handling of conditional statements. - Introduce MergeIfStmt to merge consecutive If statements within SeqStmt, optimizing the structure of conditional logic. - Update phase.py to apply IfStmtBinding and MergeIfStmt transformations for the "sm_90" target. - Enhance __init__.py with new functions for IfStmtBinding and MergeIfStmt, providing a clear interface for these transformations. * Update license header in if_stmt_binding.cc * Update license header in merge_if_stmt.cc --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Yuxuan Hu authored
-
Lei Wang authored
* Optimize CMake build process with dynamic job count calculation - Modify build_csrc function to use 90% of available CPU cores - Ensure at least one job is used during compilation - Improve build performance by dynamically adjusting parallel job count * Optimize build_csrc function with multiprocessing module - Replace os.cpu_count() with multiprocessing.cpu_count() - Maintain existing 90% CPU utilization logic - Improve CPU core count calculation for build process * Add dynamic shape support with out_idx in Cython JIT kernel compilation - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation - Add support for resolving dynamic shape dimensions using input tensor references - Enhance flexibility of JIT kernel compilation with symbolic shape handling * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map - Improve debugging by providing context about missing symbolic dimensions - Maintain existing dynamic shape resolution logic * Fix Copy operation handling for scalar and multi-dimensional tensors - Add special handling for scalar tensor copy operations - Enhance error reporting in MakeIndices method with more detailed diagnostic information - Improve SIMT loop generation to support zero-dimensional tensors - Add explicit check and handling for scalar tensor scenarios * Refactor Copy operation code formatting and improve readability - Improve code formatting in MakeIndices and MakeSIMTLoop methods - Add line breaks to enhance readability of complex ICHECK statements - Simplify code structure in scalar tensor handling - Remove unnecessary whitespace and improve code alignment * Simplify GEMM example with direct kernel compilation - Update copyright header to Tile-AI Corporation - Remove Profiler import and usage - Replace tilelang.lower() with tilelang.compile() - Simplify kernel execution workflow - Update kernel source retrieval method * Enhance block sparse attention implementation - Update `blocksparse_flashattn` to use 2 stages for improved performance. - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency. - Modify condition checks in the kernel to utilize boolean values. - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention. - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling. * Refactor and clean up code formatting across multiple files - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`. - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`. - Updated comments and documentation for clarity in `__init__.py` and `phase.py`. - Ensured consistent formatting and style across the codebase. * Add kernel source printing in example_tilelang_nsa_fwd.py and implement IfThenElse node replacement in inject_pipeline.cc - Added a print statement to output the kernel source in `example_tilelang_nsa_fwd.py` for debugging purposes. - Introduced a new function `replace_if_then_else` in `inject_pipeline.cc` to transform IfThenElse nodes while preserving attributes, enhancing the handling of conditional statements in the pipeline. * Refactor condition handling in inject_pipeline.cc - Change the data structure for mapping conditions to statements from a Map to an Array for improved performance and simplicity. - Update condition comparison logic to use StructuralEqual for better accuracy. - Enhance logging to provide detailed insights into condition changes and statement processing. - Adjust final statement construction to utilize the new data structure, ensuring correct handling of conditions and statements. * Improve logging and formatting in inject_pipeline.cc - Enhance logging statements for better clarity on condition changes and statement processing. - Adjust formatting for improved readability, including line breaks and consistent spacing. - Ensure accurate condition comparison and handling in the pipeline logic. * Refactor logging and clean up inject_pipeline.cc - Remove excessive logging statements to streamline the code and improve performance. - Simplify condition handling by eliminating unnecessary log outputs related to condition changes and statement processing. - Maintain the core functionality while enhancing code readability and maintainability. * Update Dockerfiles to specify exact version of libstdcxx-ng - Change installation command in multiple Dockerfiles to use `libstdcxx-ng=12` instead of `libstdcxx-ng-12` for consistency and to avoid potential issues with package resolution. - Ensure all Dockerfiles from cu118 to cu126 reflect this change for uniformity across builds. * Refactor and enhance examples and kernel handling - Adjusted the pipeline stages in `example_blocksparse_gemm.py` from 2 to 1 for improved performance. - Added kernel source printing in `benchmark_nsa_fwd.py` for better debugging and profiling insights. - Updated tensor allocation and parameter handling in `CtypesKernelAdapter` and `CythonKernelWrapper` to cache parameter dtypes and shapes, improving efficiency and clarity. - Enhanced the handling of dynamic shapes in the Cython JIT kernel compilation process. - Modified the benchmark script to accommodate new tensor output parameters and improved batch size defaults for testing. * Update copyright header in Cython wrapper to reflect Tile-AI Corporation * revert change
-
Chenghua authored
* [Example] Modify tuning configurations for FlashAttention example * [Examples] formatting example_gqa_fwd_bshd.py
-
Lei Wang authored
* Enhance error message for constant size stack allocation in CUDA codegen. Include the actual constant size and buffer variable name in the error output for better debugging. * Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture - Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support. - Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types. - Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture. - Include necessary header files for built-in operations and layout inference improvements. * lint fix * Remove unused builtin.h include directive * Update include path for builtin.h
-