- 12 Apr, 2025 3 commits
-
-
Lei Wang authored
* Remove debug print statement from OptimizeForTarget function and enhance library loading mechanism in Cython adapter. Implemented file locking during cache access and added checks for library size before loading. Introduced temporary file handling for safer compilation of Cython JIT adapter. * Update comments in Cython adapter for clarity and consistency. Changed Chinese comments to English for better readability and understanding of the code's functionality, specifically regarding file handling and compilation processes. * Refactor comments in Cython adapter for improved clarity. Updated comment on cache file deletion for consistency and removed unnecessary whitespace in file handling section.
-
Lei Wang authored
Remove debug print statement from OptimizeForTarget function and enhance library loading mechanism in Cython adapter. Implemented file locking during cache access and added checks for library size before loading. Introduced temporary file handling for safer compilation of Cython JIT adapter. (#377)
-
Lei Wang authored
* [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang - Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism. - Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use. - Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization. - Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming. * doc update * [Add] Enhance AMD FlashMLA implementation and documentation - Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation. - Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation. - Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks. * lint fix * [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies - Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators. - Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling. - Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks. - Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.
-
- 11 Apr, 2025 3 commits
-
-
Lei Wang authored
* [Enhancement] Add variable check in GlobalMemChecker for safe memory access validation - Introduced a check in the GlobalMemChecker to determine if the index used in memory access has any variable components, enhancing the safety of memory access validation. - Updated the condition handling in store operations to ensure that only boolean conditions are processed, improving type safety and error handling in memory operations. * [Refactor] Rename VecAllocAccess to TLVecAllocAccess and enhance buffer access handling - Renamed the `VecAllocAccess` class to `TLVecAllocAccess` for clarity in its purpose. - Improved the handling of buffer access by mutating extents and rewriting access in the body, ensuring compatibility with vectorized operations. - Added a TODO comment to suggest moving this pass to occur before StorageFlatten/FlattenBuffer for better optimization. - Introduced a print statement in the phase optimization process for debugging purposes. * lint fix
-
pigKiller authored
* add hip setup support * add env.find_hip func * Delete install_hip.sh as we already have install_rocm.sh * modify hip to rocm --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Lei Wang authored
* [Enhancement] Introduce logical operations `any_of` and `all_of` for buffer checks - Added new logical operations `any_of` and `all_of` to the TileLang language interface, allowing users to check conditions across buffer elements. - Implemented corresponding intrinsic calls for CUDA, enhancing the functionality of the TileLang framework. - Updated the `allocate.py` to handle boolean types correctly in shared memory allocations. - Introduced tests for the new logical operations to ensure correctness and performance. Co-authored-by:
Zhiwen Mo <zhiwen.mo25@ic.ac.uk> * lint fix --------- Co-authored-by:
Zhiwen Mo <zhiwen.mo25@ic.ac.uk>
-
- 10 Apr, 2025 3 commits
-
-
Haodong Tian authored
* [Bugfix] Adjust Autotuner threadpool `max_workers` limit to available CPUs * [Example] Small fix on example_blocksparse_gemm.py
-
Lei Wang authored
* [Add] Introduce benchmark scripts for MLA decoding with AMD support - Added three new benchmark scripts: `benchmark_mla_decode_amd_tilelang.py`, `benchmark_mla_decode_amd_torch.py`, and `benchmark_mla_decode_amd_triton.py` to evaluate the performance of the MLA decoding mechanism across different frameworks. - Each script includes implementations for attention calculation, performance profiling, and output validation against reference implementations. - Enhanced command-line argument parsing for customizable input parameters, including batch size, number of heads, and dimensions. - Integrated performance comparison functionality to facilitate benchmarking between different implementations. * lint fix * lint fix --------- Co-authored-by:Zhiwen Mo <zhiwen.mo25@ic.ac.uk>
-
Lei Wang authored
[Enhancement] Update kernel declaration pattern to support launch bounds in match_declare_kernel function (#366) - Modified the regex pattern in `match_declare_kernel` to accommodate optional `__launch_bounds__` specifications, enhancing the function's ability to match kernel declarations accurately. - This change improves the flexibility of kernel matching in the source code, allowing for more complex kernel definitions. Co-authored-by:Zhiwen Mo <zhiwen.mo25@ic.ac.uk>
-
- 09 Apr, 2025 5 commits
-
-
Lei Wang authored
* [Refactor] Update AutoTuner run method and timeout handling - Modified the `run` method to reduce the default timeout from 100 to 30 seconds for improved responsiveness. - Changed the `get_input_tensors_supply` call to disable output generation, enhancing performance during tensor supply retrieval. - Refactored the latency measurement to streamline the benchmarking process, ensuring proper timeout handling with `ThreadPoolExecutor`. - Added logging for timeout occurrences to aid in debugging and performance analysis. * bug fix * lint fix
-
Lei Wang authored
* [Bugfix] Correct dynamic shared memory size error handling in HIP wrapper - Updated the error handling logic in `PREDEF_ATTRIBUTE_SET_DYNAMIC_MEMORY_HIP` to check if the dynamic shared memory size exceeds the maximum limit of 65536. - Improved error message clarity by specifying the function name and the attempted size, ensuring better debugging information. - Ensured the function returns 0 upon successful setting of the dynamic shared memory size. * [Add] Implement example for MLA decoding with AMD support - Introduced a new example script `example_mla_decode_amd.py` demonstrating the use of the flash attention mechanism with AMD hardware. - Implemented functions for attention calculation, including support for split processing and combining outputs. - Added command-line argument parsing for customizable input parameters such as batch size, number of heads, and dimensions. - Included a reference implementation for validation against the Tile-AI output, ensuring correctness of the implementation. - Enhanced performance profiling and output comparison for debugging and optimization purposes. * lint fix
-
Lei Wang authored
- Updated the error handling logic in `PREDEF_ATTRIBUTE_SET_DYNAMIC_MEMORY_HIP` to check if the dynamic shared memory size exceeds the maximum limit of 65536. - Improved error message clarity by specifying the function name and the attempted size, ensuring better debugging information. - Ensured the function returns 0 upon successful setting of the dynamic shared memory size.
-
Yuqing Xia authored
[Example] Handle Scenarios in Which a Threadblock is Assigned Only Invalid Block Indices for Sparse Attention (#361) * Fix issue where threadblock with only invalid blocks produces incorrect output. * fix score scale * format
-
Yu Cheng authored
* Added a new example script `example_gemm_autotune.py` to demonstrate autotuning for matrix multiplication (GEMM) using TileLang. * Implemented functions for generating configurations, selecting the best configuration, and benchmarking performance. * Refactored the existing `matmul` function to support dynamic configuration parameters and improved kernel compilation. * Updated the main execution block to include command-line argument parsing for matrix dimensions and autotuning options. * Enhanced the example to validate results against a reference implementation, ensuring correctness in matrix multiplication operations.
-
- 08 Apr, 2025 5 commits
-
-
Lei Wang authored
[Enhancement] Support pass config `disable_warp_specialize` to disable auto specialization on hopper (#357) * [Enhancement] Add warp specialization configuration option and update related functionality * [Add] Introduced a new pass configuration option `kDisableWarpSpecialized` to control warp specialization behavior. * [Refactor] Updated `WarpSpecializedRewriter` and `WSCodeEmitter` to utilize the new configuration option, allowing for more flexible optimization strategies. * [Update] Modified the optimization pipeline in `phase.py` to include pipeline planning when warp specialization is disabled, enhancing performance with async copy. * [Documentation] Updated JIT compilation parameters to reflect the new configuration option for better clarity. * lint fix * [Add] Implement test for GEMM with warp specialization configuration * Introduced a new test file `test_tilelang_pass_config_disable_warp_specialized.py` to validate the functionality of the warp specialization configuration option. * Added a `run_gemm` function to execute matrix multiplication tests with and without warp specialization, ensuring correctness through profiling against reference results. * Included a specific test case for GEMM with float16 data types, enhancing test coverage for the new configuration feature. * [Refactor] Improve formatting in test_tilelang_pass_config_disable_warp_specialized.py * Reformatted the `tilelang.compile` call in the `run_gemm` function for better readability by breaking it into multiple lines. * Added a blank line for improved code structure and clarity in the `test_gemm_f16f16f16_nn` function.
-
Yu Cheng authored
- Modified the `group_per_split_token_cast_to_fp8` function to support `bfloat16`, `float`, and `float16` data types. - Updated local fragment allocations to use the new `accum_dtype` for consistency. - Enhanced the main execution block to handle different tensor data types based on the specified `dtype`, improving flexibility in tensor operations.
-
Lei Wang authored
* [Add] Create Dockerfile for ROCm environment setup * Introduced a new Dockerfile for setting up a ROCm environment with PyTorch. * Configured the working directory and installed necessary packages including Miniconda, Python, and development tools. * Cloned the tilelang repository and executed the ROCm installation script. * Set environment variables for compatibility and performance optimization. * [Remove] Delete Dockerfile for ROCm environment setup * Removed the Dockerfile used for setting up a ROCm environment with PyTorch. * Updated README to reflect changes in Docker image naming conventions for AMD GPU support.
-
Lei Wang authored
* [Refactor] Update import structure in benchmark_mla.py - Moved the import of `flash_mla` functions to the `run_flash_mla` function for better encapsulation. - Added a comment for `flashinfer` installation to clarify dependencies. - Cleaned up unused imports to enhance code readability. * lint fix
-
Lei Wang authored
* [Refactor] Implement thread-local storage for FrameStack in frame.py and kernel.py - Replaced global FrameStack instances with thread-local storage to prevent cross-thread interference. - Introduced `_get_let_stack` and `_get_current_stack` functions to manage thread-local FrameStack instances in LetFrame and KernelLaunchFrame classes. - Updated all relevant methods to utilize the new thread-local stacks, ensuring thread safety in frame management. * lint fix
-
- 07 Apr, 2025 3 commits
-
-
alex_xiao authored
* [Dev] Add database mechanism to cache * [Dev] Fix database cache and test for it * [Dev] Refactor env.py to use TILELANG_CACHE_DIR and remove extra comment. * [Refactor] Improve code formatting and readability in multiple files * [Enhancement] Add execution backend options and improve kernel adapter initialization * [Refactor] Rename cached function to cached_kernel and update related references * [Enhancement] Enable target and target_host parameters in kernel loading and improve gemm test case * [Enhancement] Update kernel compilation to specify execution backend as "cython" * [Refactor] Rename cached_kernel to cached and update references in the codebase * [Enhancement] Un-comment and add test cases for matrix multiplication correctness; improve kernel caching logic and remove redundant code * [Refactor] Clean up code formatting and improve readability in cache and adapter modules * [Refactor] Remove unused imports * [Refactor] Update cached function signature to use PrimFunc and Optional types for improved type safety * [Refactor] Update cached function calls to use PrimFunc and improve parameter handling * [Refactor] Clean up import statements and improve code formatting in cache and kernel test files * [Refactor] Update cache key generation to use function source code for hashing * [Update] Update subproject commit for TVM * [Update] Import inspect module in kernel_cache.py * [Update] Change default execution backend to 'cython' in JITKernel * redo tvm * [Update] Add SHA256 hash for function parameters in KernelCache * [Bugfix] fix merge error * [Feat] Rearrange script for key generation * [Bugfix] Delete extra files * [Refactor] Improve code readability and formatting in kernel_cache.py * [Refactor] Remove unused sorting function from KernelCache and simplify binary serialization * Update submodule tvm
-
Lei Wang authored
* [Enhancement] Update GEMM examples and autotuner for improved performance - Modified `example_gemm_intrinsics.py` to enhance matrix multiplication configurations, increasing warp sizes and adjusting data types for better performance. - Updated the kernel compilation process to utilize the new `tilelang.compile` method and improved latency measurement with the profiler. - Refactored `example_gemm.py` to include a new autotuning configuration and ensure consistency in latency checks against reference results. - Adjusted tensor supply generation in `tilelang/utils/tensor.py` to use `torch.randn` for better randomness in tensor initialization. - Enhanced the `JITContext` in `tilelang/autotuner/__init__.py` to replace the profiler with a kernel instance for performance measurement, improving the overall structure of the autotuner. * bug fix * fix * [Enhancement] Update convolution tests and profiling assertions - Added a random seed setting for reproducibility in convolution tests. - Removed several redundant convolution test cases to streamline the testing process. - Updated the assertion in the matrix multiplication profiling to include a maximum mismatched ratio for improved accuracy in results. - Enabled the main testing function for better test execution. * lint fix
-
Lei Wang authored
* [Refactor] Update GEMM Fragment Layout and Improve Matrix Multiplication Functionality - Adjusted the layout configuration in `gemm_layouts.cc` to correct the repetition parameters for warp and block layouts, enhancing the efficiency of the GEMM fragment generation. - Refactored the `matmul_rs` function in `test_tilelang_test_amd.py` to improve readability by restructuring the function signature and ensuring consistent formatting. - Updated the test execution call to run the new `test_gemm_rs_f16f32f32_nt` function, enhancing test coverage for the GEMM functionality. * lint fix * bugfix
-
- 06 Apr, 2025 4 commits
-
-
Yu Cheng authored
- Modified the `group_per_split_token_cast_to_fp8` function to include a conditional check for batch sizes, ensuring that the scaling factor is applied only when within the valid range. This change enhances the robustness of the FP8 conversion process for grouped per-split tokens.
-
Lei Wang authored
* [Refactor] Clean up whitespace in CUDA-related files - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability. - This change enhances the overall organization of the codebase without altering functionality. * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`. - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement. - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration. - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations. * lint fix * Enhance variable creation by associating data types in IR and layout files, and introduce ExpandIndexDataType transformation - Updated variable creation in `ir.cc`, `gemm_layouts.cc`, and `elem.cc` to include data types for better type safety. - Added a new transformation `ExpandIndexDataType` to promote integer types to int64 where necessary, improving compatibility and performance. - Integrated the new transformation into the optimization pipeline in `phase.py`. - Documented the new transformation in `__init__.py` for clarity. * lint fix * Add configuration option for index bitwidth and remove ExpandIndexDataType transformation - Introduced a new pass configuration option `kConfigIndexBitwidth` to allow customization of index bitwidth. - Updated the optimization pipeline in `phase.py` to utilize the new configuration option instead of the removed `ExpandIndexDataType` transformation. - Documented the new configuration option in the JIT compilation function's parameters for clarity. - Removed the `ExpandIndexDataType` transformation implementation from the codebase to streamline the transformation process. * lint fix * Refactor index bitwidth configuration handling - Updated the `ConfigIndexBitwidth` pass to only apply the bitwidth transformation if the configuration option is defined, preventing potential errors with undefined values. - Changed the default value of `tl.config_index_bitwidth` in the JIT compilation function's parameters from 32 to None for better clarity and flexibility. * lint fix * lint fix --------- Co-authored-by:LeiWang1999 <wyatuestc@gmail.com>
-
YizhaoGao authored
* [Example] Add triton block sparse gqa decode * lint fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* Enhance error checking in RegionOp and buffer_load_to_tile_region - Added detailed error messages to the index size check in `RegionOp` to aid debugging. - Implemented a check in `buffer_load_to_tile_region` to ensure the length of indices matches extents, with a fallback to expand extents if necessary. This improves robustness in handling buffer loads with mismatched dimensions. * lint fix
-
- 05 Apr, 2025 4 commits
-
-
Yu Cheng authored
Implements FP8 type conversion functionality for grouped per-split tokens. The script includes several helper functions for handling tensor TMA alignment and FP8 conversion, enhancing support for FP8 data types and providing performance benchmarks. This change provides users with more flexible examples of FP8 operations.
-
yeh-sudo authored
This pull request includes a change to the `gemv.md` file. The changes add heading level to title in the document to make the heading level right.
-
Lei Wang authored
* [Enhancement] Introduce CUDA driver module and refactor CUDA device handling - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities. - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities. - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management. - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes. * [Refactor] Clean up whitespace in CUDA-related files - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability. - This change enhances the overall organization of the codebase without altering functionality. * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`. - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement. - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration. - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations. * lint fix * Update submodule and enhance FP8 type handling in CUDA codegen - Updated the TVM submodule to the latest commit. - Modified FP8 type handling in `codegen_cuda.cc` to use more descriptive type codes. - Improved constant printing for FP8 and bfloat16 types, ensuring correct representation in generated code. - Added error handling for missing configuration keys in the AutoTuner class. * lint fix * Remove print statement from example script * lint fix * fix --------- Co-authored-by:LeiWang1999 <wyatuestc@gmail.com>
-
Yuqing Xia authored
* add example gqa decode wgmma pipelined * add sparse gqa * support num split * support num split * add if condition * add heuristic num split * clean code * add ref * fix bug * add torch ref * fix bug * integrate to torch * symbolic * clean mask * rm actual_num_blocks * clean code * get num_sm via torch * add sparse gqa decode example * format * rm example_gqa_decode_wgmma_pipelined.py * Add license headers to example scripts * format * Remove commented-out cache disabling lines --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 04 Apr, 2025 6 commits
-
-
Lei Wang authored
* [Enhancement] Add new matrix multiplication functions and tests for GEMM with transpose options - Introduced `matmul_rs` function for flexible matrix multiplication with optional transposition. - Added `run_gemm_rs` function to facilitate testing of the new matrix multiplication implementation. - Expanded test coverage for GEMM with additional cases for transposition configurations. - Corrected index usage in `gemm.h` to ensure proper matrix layout handling. These changes enhance the GEMM functionality and improve testing capabilities for various matrix configurations. * [Enhancement] Add Composable Kernel Path Handling in Environment Setup - Introduced support for the Composable Kernel by adding a new environment variable `TL_COMPOSABLE_KERNEL_PATH`. - Updated the environment setup to check for the existence of the Composable Kernel and log a warning if not found. - Modified the `LibraryGenerator` to include the Composable Kernel include directory during compilation for HIP targets. These changes improve the integration of the Composable Kernel into the TileLang environment, enhancing flexibility for users.
-
Yu Cheng authored
- Introduced a new local fragment for squared values to improve performance. - Updated the computation of the RMS normalization to use the new fragment, enhancing memory efficiency. - Refactored the final multiplication step to operate on the local fragment instead of shared memory. - Added a configuration option to the kernel compilation for better control over TMA lowering. These changes enhance the efficiency and clarity of the RMS normalization implementation.
-
Lei Wang authored
[Enhancement] Add new matrix multiplication functions and tests for GEMM with transpose options (#331) - Introduced `matmul_rs` function for flexible matrix multiplication with optional transposition. - Added `run_gemm_rs` function to facilitate testing of the new matrix multiplication implementation. - Expanded test coverage for GEMM with additional cases for transposition configurations. - Corrected index usage in `gemm.h` to ensure proper matrix layout handling. These changes enhance the GEMM functionality and improve testing capabilities for various matrix configurations.
-
Lei Wang authored
- Added a manual seed for reproducibility in PyTorch. - Refactored local variable allocations for better memory management. - Enhanced parallel processing in the flashattn function to improve performance. - Updated layout annotations for clarity and efficiency. These changes optimize the flash attention mechanism and ensure consistent behavior across runs.
-
Zhengju Tang authored
* [Dynamic Symbolic] Adaptively vectorize with different condition expressions * Format * Format * Format * Format * Add MIT License headers to Python files * Simplify return statement in loop vectorization --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Lei Wang authored
* [Enhancement] Update GEMM and ROCm Integration - Removed the restriction on transposing matrix B for CDNA in `gemm.cc`, allowing for more flexible matrix operations. - Added a new debug header file `debug.h` for enhanced debugging capabilities in ROCm kernels. - Updated `codegen_hip.cc` to include the new debug header and improved handling of float16 and bfloat16 types in vector element stores. - Refactored `rt_mod_hip.cc` to return a ROCM module directly from `BuildTileLangHIPWithoutCompile`, enhancing the module creation process. - Introduced a new ROCm utility in `rocm.py` for linking and managing ROCm paths, improving the build process for ROCm applications. - Updated tests to reflect changes in GEMM configurations and ensure compatibility with the new features. These changes enhance the flexibility and debugging capabilities of the GEMM operations and improve the integration with the ROCm backend. * [Fix] Corrected syntax error in pyproject.toml and improved error message formatting in rocm.py - Added missing quotation mark for "HSA" in the `select` section of `pyproject.toml`. - Simplified the error message formatting in `get_rocm_arch` function of `rocm.py` for better readability and consistency. * lint fix * Update tilelang/jit/adapter/wrapper.py Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> * lint fix --------- Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com>
-
- 03 Apr, 2025 4 commits
-
-
botbw authored
* [bug] fix T.abs on float16 * [lint] lint
-
Lei Wang authored
* [Enhancement] Introduce CUDA driver module and refactor CUDA device handling - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities. - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities. - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management. - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes. * [Refactor] Clean up whitespace in CUDA-related files - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability. - This change enhances the overall organization of the codebase without altering functionality. * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`. - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement. - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration. - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations. * lint fix --------- Co-authored-by:LeiWang1999 <wyatuestc@gmail.com>
-
Chunan Zeng authored
-
yyttt6 authored
* refactor autotune * refactor autotune * refactor autotune * refactor autotune * format init.py * add tutorial for autotune * merge * merge * format analyzer * add readme for analyzer * format * [Tools] Summarize TFLOPS Information from a tilelang program * Summarize TFLOPS Information from a tilelang program
-