- 23 Mar, 2025 1 commit
-
-
Lei Wang authored
- Updated `ref_program` in `benchmark_matmul.py` to remove the unused parameter `C`, simplifying the function signature. - Changed logging level in `autotuner/__init__.py` from `INFO` to `DEBUG` for more detailed logging during autotuning. - Modified the error handling in the autotuner to provide clearer messages and log errors at the debug level. - Enhanced error reporting in the JIT adapter by adding detailed context to error messages in `cython_wrapper.pyx` when kernel calls fail.
-
- 22 Mar, 2025 1 commit
-
-
Chaofan Lin authored
* fix tune args * lint * Refactor gemm example and autotuner logging - Updated `ref_program` in `example_gemm.py` to return the result of matrix multiplication instead of modifying an input parameter. - Changed logging filename in `__init__.py` from 'out.log' to 'autotuner.log' for better clarity. - Modified JIT kernel compilation process to include `out_idx` directly in the adapter creation, enhancing flexibility. - Improved validation of `result_idx` in `BaseKernelAdapter` to ensure it falls within valid bounds. * Refactor `ref_program` in `benchmark_matmul_intrinsic.py` to use the `@` operator for matrix multiplication instead of `torch.matmul`, simplifying the implementation by removing the unused parameter `C`. --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 20 Mar, 2025 1 commit
-
-
Lei Wang authored
* remove llvm build * [Refactor] Update kernel compilation and profiling in examples - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation. - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency. - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations. * lint fix * License Update * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields. - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability. * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files - Improved comment alignment and readability in `cuda.h`. - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability. * lint fix * lint fix * lint fix * lint fix * fix * License update * [Enhancement] Update JITKernel to use artifact for kernel source - Assigned the generated artifact to `self.artifact` for better management. - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling. * lint fix * Add @tilelang.testing.requires_llvm decorator to vectorization tests * Enhance setup.py and env.py for library management - Added functionality to remove original files after copying in CMakeBuild. - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration. * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py * Refactor CMakeBuild file handling in setup.py - Added a check to ensure the target library directory exists before copying .so files. - Improved the logic for creating the target directory and copying files to enhance robustness. * bugfix * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement. * lint fix * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility. * lint fix * Add support for C target in device code generation - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function. * [Enhancement] Implement auto-clear cache feature based on environment variable * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing. * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing. * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true. * [Refactor] Update kernel invocation and import paths in tests and cache * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result. * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`. * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability. * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class. * Enhanced overall code formatting to align with project standards. * [Enhancement] Add bfloat16 test case and improve kernel caching logic * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`. * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading. * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management. * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database. * Improved code formatting and readability across several files. * lint fix * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
-
- 06 Mar, 2025 2 commits
-
-
Chaofan Lin authored
* [Carver] Multi-Threads Compilation for Fast Auto Tuning * Add progress bar for compilation * lint
-
Lei Wang authored
* [Refactor] Consolidate GemmWarpPolicy Enum and Add Utility Method - Move GemmWarpPolicy from copy.py and gemm.py to primitives/gemm/base.py - Implement from_warp_partition class method to determine warp policy - Add docstring with examples for policy determination - Remove duplicate GemmWarpPolicy class definitions * [Enhancement] Add TensorCore Intrinsic Matrix Multiplication Benchmarks - Implement two new matrix multiplication benchmark scripts: 1. `benchmark_matmul_intrinsic.py`: Uses TensorCore intrinsics with advanced configuration 2. `benchmark_matmul.py`: Provides a more generic matrix multiplication benchmark - Add support for roller-based configuration generation in both benchmarks - Enhance MMA macro generator to handle 2D and 4D output buffer layouts - Implement flexible autotuning configurations with multiple parameters - Support different data types and accumulation modes - Add command-line arguments for matrix dimensions and roller configuration * lint fix * Fix roller hints generation in get_roller_hints_from_func - Simplify roller hints generation logic - Ensure policy-based configuration is always emitted when a policy is available - Remove redundant None check for roller hints * Add shared memory for matrix multiplication in benchmark and quickstart examples - Modify benchmark_matmul.py and quickstart.py to include C_shared allocation - Change accumulation dtype from float16 to float in benchmark_matmul.py - Update matrix multiplication kernels to use shared memory for result storage - Enable CUDA kernel source printing in quickstart example
-