- 13 Jun, 2025 1 commit
-
-
Lei Wang authored
- Modified the serialization of function scripts in both KernelCache and AutoTunerCache to include metadata by setting `show_meta=True` in `cloudpickle.dumps()`. This change enhances the hash key generation for kernel configurations, improving cache accuracy and consistency.
-
- 11 Jun, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Update AutoTuner and Profiler for improved kernel handling and output validation - Modified AutoTuner to store cache in a dedicated "autotuner" directory. - Enhanced kernel source code saving logic in AutotuneResult and AutoTunerCache to check for None before writing. - Updated Profiler to handle None outputs gracefully during tensor comparisons, improving robustness in output validation. * lint fix
-
- 09 Jun, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Update AutoTuner and JIT compilation arguments * Added functionality to return compile arguments in the JIT implementation, enhancing the autotuner's caching capabilities. * Modified `CompileArgs` and `AutotuneResult` classes to support optional `out_idx` parameter, improving flexibility in compile argument handling. * Refactored the `_AutoTunerImplementation` to utilize the new compile arguments, ensuring better integration and performance during tuning processes. * Update tilelang/autotuner/param.py Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * remove redundant comments * Refactor kernel source retrieval and logging levels * Updated `AutotuneResult` to use `kernel.get_kernel_source()` instead of `kernel.adapter.get_kernel_source()`. * Changed logging level in `KernelCache` from `ERROR` to `DEBUG` for improved verbosity during kernel caching operations. * Removed unnecessary verbose logging in JIT compilation process to streamline output. * Merge branch 'main' of https://github.com/tile-ai/tilelang into bugfix_autotune_0604 * lint fix --------- Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
- 05 Jun, 2025 1 commit
-
-
Gabriel Wu authored
* [wip] feat: add nvrtc backend * [wip] fix: handle out_idx * [wip] refactor: move lib logic to libgen * feat: cache for nvrtc backend * fmt: run format * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: get kernel source * refactor: speedup pyimport * Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning. * Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity. * Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates. * fix tl_templates * ensure CUDA context --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 28 May, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add commit ID to versioning and improve logging initialization * Updated `get_tilelang_version` to include an optional commit ID in the version string. * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process. * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash. * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity. * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging. * [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching * Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability. * Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations. * Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking. * Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process. * Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity. * Minor adjustments in related modules to support the new caching and profiling features. * [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules * Consolidated import statements and removed unnecessary line breaks for better readability. * Standardized function argument formatting across the AutoTuner and CompileArgs classes. * Enhanced consistency in the use of whitespace and indentation throughout the codebase. * Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability. * Ensured that all changes adhere to the project's coding style guidelines. * [Refactor] Remove redundant type hints in AutoTuner modules * Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`. * Improved code readability and maintainability by streamlining type imports across the AutoTuner module. * [Refactor] Update AutoTuner configuration for improved profiling and target detection * Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings. * Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection. * Removed redundant target specifications in certain instances to streamline the configuration process. * Improved overall clarity and maintainability of the autotuning logic in various example scripts. * [Refactor] Simplify code formatting and improve readability in example scripts * Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity. * Removed unnecessary line breaks and standardized argument placement across multiple files. * Enhanced overall code readability and maintainability in autotuning examples and performance scripts. * [Refactor] Update JIT decorator usage across multiple files * Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code. * Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase. * Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions. * [Refactor] Standardize JIT decorator formatting across benchmark and example scripts * Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks. * Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts. * Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.
-
- 26 May, 2025 1 commit
-
-
Lei Wang authored
* Updated `get_tilelang_version` to include an optional commit ID in the version string. * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process. * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash. * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity. * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.
-
- 06 May, 2025 2 commits
-
-
Lei Wang authored
* [Feature] Add cache directory management functions in tilelang.cache * Introduced `get_cache_dir` and `set_cache_dir` functions to manage the kernel cache directory. * Updated `KernelCache` class to store cache directory as a `Path` object for improved path handling. * Enhanced documentation with examples for new cache directory functions. * [Refactor] Update cache imports in tilelang.__init__.py * Added `set_cache_dir` and `get_cache_dir` functions to the import statement for improved cache directory management. * This change enhances the accessibility of cache directory management functions within the module.
-
Lei Wang authored
* [Enhancement] Introduce pass_configs parameter for kernel compilation * Added a new `pass_configs` parameter to the `tilelang.compile` function to allow for more flexible kernel compilation configurations. * Updated related classes and methods to accommodate the new parameter, ensuring compatibility across the codebase. * Enhanced the `torch_assert_close` function to include customizable tensor names for better debugging output. * Refactored input handling in example scripts to streamline the process of obtaining inputs for kernel execution. * lint fix
-
- 27 Apr, 2025 1 commit
-
-
Gabriel Wu authored
* Fix typo * bugfix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 31 Mar, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add support for CUDA architecture 8.9 in GEMM template - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware. - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs. * lintfix * [Refactor] Clean up includes in gemm_sm89.h - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization. - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order. * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access. - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk. - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability. - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management. * lint fix
-
- 28 Mar, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h - Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5. - Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding. * [Enhancement] Improve documentation and add detailed docstrings across multiple modules - Updated the `__init__.py` file to enhance module documentation, providing clarity on auto-tuning functionalities. - Added comprehensive docstrings to the `JITContext`, `AutotuneResult`, and `AutoTuner` classes, detailing their attributes and methods. - Enhanced memory allocation utilities in `allocate.py` with detailed descriptions for each allocation function. - Improved documentation for various intrinsic operations in `builtin.py`, `copy.py`, `customize.py`, `frame.py`, `gemm.py`, `memscope.py`, and `reduce.py`, ensuring clear explanations of parameters and return values. - Refactored the `KernelCache` class to improve clarity and maintainability, including detailed comments and docstrings for methods. - Overall, these changes aim to enhance code readability and provide better guidance for future developers and users of the Tile-AI framework.
-
- 25 Mar, 2025 1 commit
-
-
Lei Wang authored
- Changed the cache key generation to use the serialized script of the function instead of the function object itself, improving the uniqueness of cache keys.
-
- 23 Mar, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Introduce caching control and frame management in TileLang - Added cache control functions (`enable_cache`, `disable_cache`, `is_cache_enabled`) in `env.py` to manage kernel caching behavior. - Updated `kernel_cache.py` to utilize the cache state, preventing unnecessary kernel compilation when caching is disabled. - Introduced a new `frame.py` module to manage LetFrame instances, including a stack for variable-value mapping and enhanced frame management. - Updated imports in various modules to accommodate new caching and frame functionalities, improving overall organization and clarity. * [Refactor] Clean up and enhance caching and frame management in TileLang - Added spacing for improved readability in `env.py` and `frame.py`. - Refactored `LetFrame` class to enhance clarity in buffer region assignment. - Ensured consistent formatting and organization across caching control and frame management functions. * [Feature] Add matrix multiplication functionality in TileLang - Introduced a new test file `test_tilelang_language_alias.py` that implements a matrix multiplication function using TileLang's primitives. - The `matmul` function defines a kernel for performing tile-level GEMM operations, with support for customizable block sizes and data types. - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation. - Updated `gemm.py` to allow `tir.Buffer` or `tir.Var` as valid argument types for the `gemm` function, enhancing flexibility in argument handling. * [Refactor] Improve formatting and readability in test_tilelang_language_alias.py - Adjusted spacing and alignment in the `matmul` and `run_matmul` functions for better readability. - Cleaned up unnecessary blank lines and ensured consistent formatting throughout the file. - Enhanced overall code clarity without altering functionality.
-
- 22 Mar, 2025 2 commits
-
-
You Jiacheng authored
* move compilation outside critical section * lint fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* Add GPU kernel for 2D continuous cumulative sum in TileLang example - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum. - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations. - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations. - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function. * Refactor TileLang examples and enhance kernel compilation - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking. - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability. - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management. - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames. - Cleaned up import statements in `__init__.py` for better organization and clarity. * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example - Added additional spacing for improved readability in `example_tilelang_cumsum.py`. - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
-
- 20 Mar, 2025 1 commit
-
-
Lei Wang authored
* remove llvm build * [Refactor] Update kernel compilation and profiling in examples - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation. - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency. - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations. * lint fix * License Update * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields. - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability. * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files - Improved comment alignment and readability in `cuda.h`. - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability. * lint fix * lint fix * lint fix * lint fix * fix * License update * [Enhancement] Update JITKernel to use artifact for kernel source - Assigned the generated artifact to `self.artifact` for better management. - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling. * lint fix * Add @tilelang.testing.requires_llvm decorator to vectorization tests * Enhance setup.py and env.py for library management - Added functionality to remove original files after copying in CMakeBuild. - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration. * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py * Refactor CMakeBuild file handling in setup.py - Added a check to ensure the target library directory exists before copying .so files. - Improved the logic for creating the target directory and copying files to enhance robustness. * bugfix * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement. * lint fix * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility. * lint fix * Add support for C target in device code generation - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function. * [Enhancement] Implement auto-clear cache feature based on environment variable * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing. * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing. * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true. * [Refactor] Update kernel invocation and import paths in tests and cache * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result. * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`. * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability. * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class. * Enhanced overall code formatting to align with project standards. * [Enhancement] Add bfloat16 test case and improve kernel caching logic * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`. * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading. * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management. * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database. * Improved code formatting and readability across several files. * lint fix * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
-
- 19 Mar, 2025 1 commit
-
-
alex_xiao authored
* [Dev] Add database mechanism to cache * [Dev] Fix database cache and test for it * [Dev] Refactor env.py to use TILELANG_CACHE_DIR and remove extra comment. * [Refactor] Improve code formatting and readability in multiple files * [Enhancement] Add execution backend options and improve kernel adapter initialization * [Refactor] Rename cached function to cached_kernel and update related references * [Enhancement] Enable target and target_host parameters in kernel loading and improve gemm test case * [Enhancement] Update kernel compilation to specify execution backend as "cython" * [Refactor] Rename cached_kernel to cached and update references in the codebase * [Enhancement] Un-comment and add test cases for matrix multiplication correctness; improve kernel caching logic and remove redundant code * [Refactor] Clean up code formatting and improve readability in cache and adapter modules * [Refactor] Remove unused imports * [Refactor] Update cached function signature to use PrimFunc and Optional types for improved type safety * [Refactor] Update cached function calls to use PrimFunc and improve parameter handling * [Refactor] Clean up import statements and improve code formatting in cache and kernel test files * Update tilelang/jit/kernel.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-