1. 08 Apr, 2025 5 commits
    • Lei Wang's avatar
      [Enhancement] Support pass config `disable_warp_specialize` to disable auto... · 7fdcedd0
      Lei Wang authored
      [Enhancement] Support pass config `disable_warp_specialize` to disable auto specialization on hopper (#357)
      
      * [Enhancement] Add warp specialization configuration option and update related functionality
      
      * [Add] Introduced a new pass configuration option `kDisableWarpSpecialized` to control warp specialization behavior.
      * [Refactor] Updated `WarpSpecializedRewriter` and `WSCodeEmitter` to utilize the new configuration option, allowing for more flexible optimization strategies.
      * [Update] Modified the optimization pipeline in `phase.py` to include pipeline planning when warp specialization is disabled, enhancing performance with async copy.
      * [Documentation] Updated JIT compilation parameters to reflect the new configuration option for better clarity.
      
      * lint fix
      
      * [Add] Implement test for GEMM with warp specialization configuration
      
      * Introduced a new test file `test_tilelang_pass_config_disable_warp_specialized.py` to validate the functionality of the warp specialization configuration option.
      * Added a `run_gemm` function to execute matrix multiplication tests with and without warp specialization, ensuring correctness through profiling against reference results.
      * Included a specific test case for GEMM with float16 data types, enhancing test coverage for the new configuration feature.
      
      * [Refactor] Improve formatting in test_tilelang_pass_config_disable_warp_specialized.py
      
      * Reformatted the `tilelang.compile` call in the `run_gemm` function for better readability by breaking it into multiple lines.
      * Added a blank line for improved code structure and clarity in the `test_gemm_f16f16f16_nn` function.
      7fdcedd0
    • Yu Cheng's avatar
      [Enhancement] Update group_per_split_token_cast_to_fp8 to support multiple data types (#356) · a686f0f1
      Yu Cheng authored
      - Modified the `group_per_split_token_cast_to_fp8` function to support `bfloat16`, `float`, and `float16` data types.
      - Updated local fragment allocations to use the new `accum_dtype` for consistency.
      - Enhanced the main execution block to handle different tensor data types based on the specified `dtype`, improving flexibility in tensor operations.
      a686f0f1
    • Lei Wang's avatar
      [AMD][Docker] Create Dockerfile for ROCm environment setup (#355) · c58cbfbb
      Lei Wang authored
      * [Add] Create Dockerfile for ROCm environment setup
      
      * Introduced a new Dockerfile for setting up a ROCm environment with PyTorch.
      * Configured the working directory and installed necessary packages including Miniconda, Python, and development tools.
      * Cloned the tilelang repository and executed the ROCm installation script.
      * Set environment variables for compatibility and performance optimization.
      
      * [Remove] Delete Dockerfile for ROCm environment setup
      
      * Removed the Dockerfile used for setting up a ROCm environment with PyTorch.
      * Updated README to reflect changes in Docker image naming conventions for AMD GPU support.
      c58cbfbb
    • Lei Wang's avatar
      [Typo] Replace `kernel.func` with `kernel` in mla benchmark scripts (#354) · 6d44c465
      Lei Wang authored
      * [Refactor] Update import structure in benchmark_mla.py
      
      - Moved the import of `flash_mla` functions to the `run_flash_mla` function for better encapsulation.
      - Added a comment for `flashinfer` installation to clarify dependencies.
      - Cleaned up unused imports to enhance code readability.
      
      * lint fix
      6d44c465
    • Lei Wang's avatar
      [Refactor] Implement thread-local storage for FrameStack in frame.py and kernel.py (#352) · 7aa34977
      Lei Wang authored
      * [Refactor] Implement thread-local storage for FrameStack in frame.py and kernel.py
      
      - Replaced global FrameStack instances with thread-local storage to prevent cross-thread interference.
      - Introduced `_get_let_stack` and `_get_current_stack` functions to manage thread-local FrameStack instances in LetFrame and KernelLaunchFrame classes.
      - Updated all relevant methods to utilize the new thread-local stacks, ensuring thread safety in frame management.
      
      * lint fix
      7aa34977
  2. 07 Apr, 2025 3 commits
    • alex_xiao's avatar
      [Bugfix] Compile/"cached" still not loading cached kernel for example in example_mha_bwd (#339) · 8dc1d7df
      alex_xiao authored
      * [Dev] Add database mechanism to cache
      
      * [Dev] Fix database cache and test for it
      
      * [Dev] Refactor env.py to use TILELANG_CACHE_DIR and remove extra comment.
      
      * [Refactor] Improve code formatting and readability in multiple files
      
      * [Enhancement] Add execution backend options and improve kernel adapter initialization
      
      * [Refactor] Rename cached function to cached_kernel and update related references
      
      * [Enhancement] Enable target and target_host parameters in kernel loading and improve gemm test case
      
      * [Enhancement] Update kernel compilation to specify execution backend as "cython"
      
      * [Refactor] Rename cached_kernel to cached and update references in the codebase
      
      * [Enhancement] Un-comment and add test cases for matrix multiplication correctness; improve kernel caching logic and remove redundant code
      
      * [Refactor] Clean up code formatting and improve readability in cache and adapter modules
      
      * [Refactor] Remove unused imports
      
      * [Refactor] Update cached function signature to use PrimFunc and Optional types for improved type safety
      
      * [Refactor] Update cached function calls to use PrimFunc and improve parameter handling
      
      * [Refactor] Clean up import statements and improve code formatting in cache and kernel test files
      
      * [Refactor] Update cache key generation to use function source code for hashing
      
      * [Update] Update subproject commit for TVM
      
      * [Update] Import inspect module in kernel_cache.py
      
      * [Update] Change default execution backend to 'cython' in JITKernel
      
      * redo tvm
      
      * [Update] Add SHA256 hash for function parameters in KernelCache
      
      * [Bugfix] fix merge error
      
      * [Feat] Rearrange script for key generation
      
      * [Bugfix] Delete extra files
      
      * [Refactor] Improve code readability and formatting in kernel_cache.py
      
      * [Refactor] Remove unused sorting function from KernelCache and simplify binary serialization
      
      * Update submodule tvm
      8dc1d7df
    • Lei Wang's avatar
      [AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler (#344) · f005db9f
      Lei Wang authored
      * [Enhancement] Update GEMM examples and autotuner for improved performance
      
      - Modified `example_gemm_intrinsics.py` to enhance matrix multiplication configurations, increasing warp sizes and adjusting data types for better performance.
      - Updated the kernel compilation process to utilize the new `tilelang.compile` method and improved latency measurement with the profiler.
      - Refactored `example_gemm.py` to include a new autotuning configuration and ensure consistency in latency checks against reference results.
      - Adjusted tensor supply generation in `tilelang/utils/tensor.py` to use `torch.randn` for better randomness in tensor initialization.
      - Enhanced the `JITContext` in `tilelang/autotuner/__init__.py` to replace the profiler with a kernel instance for performance measurement, improving the overall structure of the autotuner.
      
      * bug fix
      
      * fix
      
      * [Enhancement] Update convolution tests and profiling assertions
      
      - Added a random seed setting for reproducibility in convolution tests.
      - Removed several redundant convolution test cases to streamline the testing process.
      - Updated the assertion in the matrix multiplication profiling to include a maximum mismatched ratio for improved accuracy in results.
      - Enabled the main testing function for better test execution.
      
      * lint fix
      f005db9f
    • Lei Wang's avatar
      [Bugfix] Fix Transposed Fragment Layout for amd GEMM_RS matrix core (#346) · 0acb8586
      Lei Wang authored
      * [Refactor] Update GEMM Fragment Layout and Improve Matrix Multiplication Functionality
      
      - Adjusted the layout configuration in `gemm_layouts.cc` to correct the repetition parameters for warp and block layouts, enhancing the efficiency of the GEMM fragment generation.
      - Refactored the `matmul_rs` function in `test_tilelang_test_amd.py` to improve readability by restructuring the function signature and ensuring consistent formatting.
      - Updated the test execution call to run the new `test_gemm_rs_f16f32f32_nt` function, enhancing test coverage for the GEMM functionality.
      
      * lint fix
      
      * bugfix
      0acb8586
  3. 06 Apr, 2025 4 commits
    • Yu Cheng's avatar
      [Bugfix] Fix X_amax Correctness Issue in Group Cast FP8 (#345) · 847a461b
      Yu Cheng authored
      - Modified the `group_per_split_token_cast_to_fp8` function to include a conditional check for batch sizes, ensuring that the scaling factor is applied only when within the valid range. This change enhances the robustness of the FP8 conversion process for grouped per-split tokens.
      847a461b
    • Lei Wang's avatar
      [Enhancement] Support index bit width configuration (#343) · 70546adc
      Lei Wang authored
      
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      * Enhance variable creation by associating data types in IR and layout files, and introduce ExpandIndexDataType transformation
      
      - Updated variable creation in `ir.cc`, `gemm_layouts.cc`, and `elem.cc` to include data types for better type safety.
      - Added a new transformation `ExpandIndexDataType` to promote integer types to int64 where necessary, improving compatibility and performance.
      - Integrated the new transformation into the optimization pipeline in `phase.py`.
      - Documented the new transformation in `__init__.py` for clarity.
      
      * lint fix
      
      * Add configuration option for index bitwidth and remove ExpandIndexDataType transformation
      
      - Introduced a new pass configuration option `kConfigIndexBitwidth` to allow customization of index bitwidth.
      - Updated the optimization pipeline in `phase.py` to utilize the new configuration option instead of the removed `ExpandIndexDataType` transformation.
      - Documented the new configuration option in the JIT compilation function's parameters for clarity.
      - Removed the `ExpandIndexDataType` transformation implementation from the codebase to streamline the transformation process.
      
      * lint fix
      
      * Refactor index bitwidth configuration handling
      
      - Updated the `ConfigIndexBitwidth` pass to only apply the bitwidth transformation if the configuration option is defined, preventing potential errors with undefined values.
      - Changed the default value of `tl.config_index_bitwidth` in the JIT compilation function's parameters from 32 to None for better clarity and flexibility.
      
      * lint fix
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      70546adc
    • YizhaoGao's avatar
      [Example] Add triton block sparse gqa decode (#341) · bee5618e
      YizhaoGao authored
      
      
      * [Example] Add triton block sparse gqa decode
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      bee5618e
    • Lei Wang's avatar
      [Enhancement] Support region padding when convert buffer load to buffer region (#342) · 10804a0d
      Lei Wang authored
      * Enhance error checking in RegionOp and buffer_load_to_tile_region
      
      - Added detailed error messages to the index size check in `RegionOp` to aid debugging.
      - Implemented a check in `buffer_load_to_tile_region` to ensure the length of indices matches extents, with a fallback to expand extents if necessary. This improves robustness in handling buffer loads with mismatched dimensions.
      
      * lint fix
      10804a0d
  4. 05 Apr, 2025 4 commits
    • Yu Cheng's avatar
      [Dev] Add Group Cast FP8 Example (#338) · 73885cfd
      Yu Cheng authored
      Implements FP8 type conversion functionality for grouped per-split tokens. The script includes several helper functions for handling tensor TMA alignment and FP8 conversion, enhancing support for FP8 data types and providing performance benchmarks. This change provides users with more flexible examples of FP8 operations.
      73885cfd
    • yeh-sudo's avatar
      [Doc] Fix typo and heading level in GEMV tutorial (#337) · 17386d7d
      yeh-sudo authored
      This pull request includes a change to the `gemv.md` file. The changes
      add heading level to title in the document to make the heading level
      right.
      17386d7d
    • Lei Wang's avatar
      [Enhancement] Enhance FP8/FP4 type handling in CUDA codegen (#323) · 89725f7f
      Lei Wang authored
      
      
      * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
      
      - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
      - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
      - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
      - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      * Update submodule and enhance FP8 type handling in CUDA codegen
      
      - Updated the TVM submodule to the latest commit.
      - Modified FP8 type handling in `codegen_cuda.cc` to use more descriptive type codes.
      - Improved constant printing for FP8 and bfloat16 types, ensuring correct representation in generated code.
      - Added error handling for missing configuration keys in the AutoTuner class.
      
      * lint fix
      
      * Remove print statement from example script
      
      * lint fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      89725f7f
    • Yuqing Xia's avatar
      [Example] Add sparse gqa decode example (#332) · 8fdfdf03
      Yuqing Xia authored
      
      
      * add example gqa decode wgmma pipelined
      
      * add sparse gqa
      
      * support num split
      
      * support num split
      
      * add if condition
      
      * add heuristic num split
      
      * clean code
      
      * add ref
      
      * fix bug
      
      * add torch ref
      
      * fix bug
      
      * integrate to torch
      
      * symbolic
      
      * clean mask
      
      * rm actual_num_blocks
      
      * clean code
      
      * get num_sm via torch
      
      * add sparse gqa decode example
      
      * format
      
      * rm example_gqa_decode_wgmma_pipelined.py
      
      * Add license headers to example scripts
      
      * format
      
      * Remove commented-out cache disabling lines
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      8fdfdf03
  5. 04 Apr, 2025 6 commits
    • Lei Wang's avatar
      [AMD] Fix for missing composable kernel include path when compile kernels on amd gpus (#334) · eb757608
      Lei Wang authored
      * [Enhancement] Add new matrix multiplication functions and tests for GEMM with transpose options
      
      - Introduced `matmul_rs` function for flexible matrix multiplication with optional transposition.
      - Added `run_gemm_rs` function to facilitate testing of the new matrix multiplication implementation.
      - Expanded test coverage for GEMM with additional cases for transposition configurations.
      - Corrected index usage in `gemm.h` to ensure proper matrix layout handling.
      
      These changes enhance the GEMM functionality and improve testing capabilities for various matrix configurations.
      
      * [Enhancement] Add Composable Kernel Path Handling in Environment Setup
      
      - Introduced support for the Composable Kernel by adding a new environment variable `TL_COMPOSABLE_KERNEL_PATH`.
      - Updated the environment setup to check for the existence of the Composable Kernel and log a warning if not found.
      - Modified the `LibraryGenerator` to include the Composable Kernel include directory during compilation for HIP targets.
      
      These changes improve the integration of the Composable Kernel into the TileLang environment, enhancing flexibility for users.
      eb757608
    • Yu Cheng's avatar
      [Refactor] Optimize RMS normalization kernel in rms_norm.py (#333) · 85e411c8
      Yu Cheng authored
      - Introduced a new local fragment for squared values to improve performance.
      - Updated the computation of the RMS normalization to use the new fragment, enhancing memory efficiency.
      - Refactored the final multiplication step to operate on the local fragment instead of shared memory.
      - Added a configuration option to the kernel compilation for better control over TMA lowering.
      
      These changes enhance the efficiency and clarity of the RMS normalization implementation.
      85e411c8
    • Lei Wang's avatar
      [Enhancement] Add new matrix multiplication functions and tests for GEMM with... · 9e5a757e
      Lei Wang authored
      [Enhancement] Add new matrix multiplication functions and tests for GEMM with transpose options (#331)
      
      - Introduced `matmul_rs` function for flexible matrix multiplication with optional transposition.
      - Added `run_gemm_rs` function to facilitate testing of the new matrix multiplication implementation.
      - Expanded test coverage for GEMM with additional cases for transposition configurations.
      - Corrected index usage in `gemm.h` to ensure proper matrix layout handling.
      
      These changes enhance the GEMM functionality and improve testing capabilities for various matrix configurations.
      9e5a757e
    • Lei Wang's avatar
      [Enhancement] Improve flashattn function in example_gqa_decode.py (#329) · 32060ecd
      Lei Wang authored
      - Added a manual seed for reproducibility in PyTorch.
      - Refactored local variable allocations for better memory management.
      - Enhanced parallel processing in the flashattn function to improve performance.
      - Updated layout annotations for clarity and efficiency.
      
      These changes optimize the flash attention mechanism and ensure consistent behavior across runs.
      32060ecd
    • Zhengju Tang's avatar
      [Dynamic Symbolic] Adaptively vectorize with different condition expressions (#326) · 5ee58ec7
      Zhengju Tang authored
      
      
      * [Dynamic Symbolic] Adaptively vectorize with different condition expressions
      
      * Format
      
      * Format
      
      * Format
      
      * Format
      
      * Add MIT License headers to Python files
      
      * Simplify return statement in loop vectorization
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      5ee58ec7
    • Lei Wang's avatar
      [AMD] Adapt rocm and support `T.gemm` with transpose_b=False for amd backend (#327) · eab47249
      Lei Wang authored
      
      
      * [Enhancement] Update GEMM and ROCm Integration
      
      - Removed the restriction on transposing matrix B for CDNA in `gemm.cc`, allowing for more flexible matrix operations.
      - Added a new debug header file `debug.h` for enhanced debugging capabilities in ROCm kernels.
      - Updated `codegen_hip.cc` to include the new debug header and improved handling of float16 and bfloat16 types in vector element stores.
      - Refactored `rt_mod_hip.cc` to return a ROCM module directly from `BuildTileLangHIPWithoutCompile`, enhancing the module creation process.
      - Introduced a new ROCm utility in `rocm.py` for linking and managing ROCm paths, improving the build process for ROCm applications.
      - Updated tests to reflect changes in GEMM configurations and ensure compatibility with the new features.
      
      These changes enhance the flexibility and debugging capabilities of the GEMM operations and improve the integration with the ROCm backend.
      
      * [Fix] Corrected syntax error in pyproject.toml and improved error message formatting in rocm.py
      
      - Added missing quotation mark for "HSA" in the `select` section of `pyproject.toml`.
      - Simplified the error message formatting in `get_rocm_arch` function of `rocm.py` for better readability and consistency.
      
      * lint fix
      
      * Update tilelang/jit/adapter/wrapper.py
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarCopilot <175728472+Copilot@users.noreply.github.com>
      eab47249
  6. 03 Apr, 2025 5 commits
    • botbw's avatar
      [Bugfix] add a patch to fix T.abs on float16 (#325) · 2cec52aa
      botbw authored
      * [bug] fix T.abs on float16
      
      * [lint] lint
      2cec52aa
    • Lei Wang's avatar
      [Feat] Enhance CUDA Property Handling (#322) · c0378aa9
      Lei Wang authored
      
      
      * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
      
      - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
      - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
      - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
      - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      c0378aa9
    • Chunan Zeng's avatar
    • yyttt6's avatar
      [Tools] Summarize TFLOPS Information from a tilelang program (#321) · 853898a7
      yyttt6 authored
      * refactor autotune
      
      * refactor autotune
      
      * refactor autotune
      
      * refactor autotune
      
      * format init.py
      
      * add tutorial for autotune
      
      * merge
      
      * merge
      
      * format analyzer
      
      * add readme for analyzer
      
      * format
      
      * [Tools] Summarize TFLOPS Information from a tilelang program
      
      * Summarize TFLOPS Information from a tilelang program
      853898a7
    • Yu Cheng's avatar
      [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support (#320) · 4b705eb2
      Yu Cheng authored
      * [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support
      
      * Added `example_per_token_cast_to_fp8.py` in examples/cast, providing token-wise FP8 quantization implementation.
      * Added `example_triton_cast_to_fp8.py` in examples/cast, providing Triton-based FP8 quantization implementation.
      * Added support for absolute maximum (absmax) reduction operation in reduce.cc and reduce.h.
      * Implemented `reduce_absmax` function in reduce.py, allowing absolute maximum reduction on input buffers.
      * Updated tilelang.language module to include the new `reduce_absmax` function.
      
      These changes enhance FP8 quantization capabilities and extend reduction operation support.
      
      * [Enhancement] Update per_token_cast_to_fp8 for improved FP8 quantization
      
      * Modified the `per_token_cast_to_fp8` function to support variable block sizes and improved memory layout annotations.
      * Adjusted the handling of absolute maximum values and scaling factors for better performance and accuracy.
      * Updated the main execution block to allow for larger matrix dimensions and refined the profiler setup for benchmarking.
      
      These changes enhance the flexibility and efficiency of the FP8 quantization process.
      
      * lint
      
      * [Dev] Update per_token_cast_fp8.py
      4b705eb2
  7. 02 Apr, 2025 1 commit
    • Lei Wang's avatar
      [CostModel] Introduce cuda driver api to get precise shared memory capacity (#317) · 3b660b67
      Lei Wang authored
      
      
      * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
      
      - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
      - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
      - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
      - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      3b660b67
  8. 01 Apr, 2025 2 commits
  9. 31 Mar, 2025 5 commits
    • Lei Wang's avatar
      [Bugfix] Fix layout conflict issue for gqa decoding examples (#314) · 0fd82ed5
      Lei Wang authored
      * Remove logging statement from LoopVectorizerDynamic Substitute method for cleaner output.
      
      * Refactor flashattn example to improve CUDA configuration handling
      
      - Updated the `flashattn` function in `example_gqa_decode.py` to utilize a heuristic configuration based on CUDA device capabilities, enhancing compatibility with different architectures.
      - Replaced local variable allocations with more efficient constructs and removed unnecessary logging statements for cleaner output.
      - Adjusted the `do_bench` method call to streamline performance profiling.
      
      * lint fix
      0fd82ed5
    • Lei Wang's avatar
      [Bugfix] Fix dynamic axis with variable extent (#311) · c30904ea
      Lei Wang authored
      * [Enhancement] Improve error message for RampNode in CUDA codegen
      
      - Updated the error message in the VisitExpr_ method for RampNode to include the specific Ramp node and lane count when the lane count exceeds the limit of 4. This change enhances debugging by providing clearer context for the error.
      - Refactored the loop vectorization logic in loop_vectorize_dynamic.cc to improve readability and maintainability, ensuring that dynamic vectorization checks are performed correctly and efficiently.
      
      * lint fix
      c30904ea
    • Lei Wang's avatar
      [Bugfix] Updated autotune usage in the examples to align with the latest changes (#309) · 66c7f6a1
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      
      * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings
      
      - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
      - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
      - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
      - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.
      
      * lint fix
      
      * typofix
      
      * [Refactor] Update matmul and flashattn function calls to return structured results
      
      - Modified the matmul and flashattn function calls to return a single object containing latency, configuration, and reference latency, improving code clarity and reducing the number of returned variables.
      - Updated all relevant instances in benchmark and example scripts to accommodate the new return structure, ensuring consistent usage across the codebase.
      
      * lint fix
      66c7f6a1
    • Lei Wang's avatar
      [Cache] Implement in-memory cache (#308) · 5802c01b
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      
      * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings
      
      - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
      - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
      - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
      - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.
      
      * lint fix
      5802c01b
    • Wenhao Xie's avatar
  10. 30 Mar, 2025 4 commits
    • Lei Wang's avatar
      [Enhancement] Add support for CUDA architecture 8.9 in GEMM template (#304) · edbb9b6d
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      edbb9b6d
    • Leslin's avatar
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError (#305) · 6e294de9
      Leslin authored
      
      
      * Update elementwise_add.py
      
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError
      
      * Update rms_norm.py
      
      [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError
      
      * Remove adapter argument from do_bench call
      
      * Remove adapter argument from do_bench call
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      6e294de9
    • Haodong Tian's avatar
      [Bugfix] Resolve autotuner bugs for blocksparse GEMM example (#300) · 92e8d5f4
      Haodong Tian authored
      * [Bugfix] Configure autotuner specific logger for correct level handling
      - Previously, logging relied on basicConfig, which configured the root logger. This caused the named autotuner logger to ignore DEBUG messages.
      - This commit sets up a dedicated logger for autotuner, correctly route DEBUG messages to 'autotuner.log' and INFO+ messages to the console.
      
      * [Bugfix] Fix tensor_supply for boolean type
      - Previously `get_tensor_supply` used `torch.randint(-2, 3)` as a fallback, which caused error when the dtype was `torch.bool`.
      - This commits adds an `is_boolean` check in `KernelParam` and updates `get_tensor_supply` to specifically use `torch.randint(0, 2)` for boolean dtypes.
      
      * [Bugfix] Always regenerate JIT inputs during tuning
      - Removes the caching for `self.jit_input_tensors` within `AutoTuner`. When different autotuning configurations can alter the required input tensor shapes or other properties, reusing cached inputs from a previous configuration lead to errors or incorrect assessments.
      - This change ensures that `profiler._get_inputs()` is called unconditionally for each configuration evaluation. Since `_get_inputs` is assumed to be relatively inexpensive, the potential overhead is considered acceptable.
      
      * [Example] Update example_blocksparse_gemm for autotuner
      
      * Run code formatter
      
      * [Feature] Enable custom tensor supply and input caching control in Autotuner
      - Previously, tensor generation was tied to `supply_type` and input caching behavior across configurations was less explicit/controlled.
      - This commit introduces a `supply_prog` parameter to allow providing a custom function for generating input tensors, overriding the default mechanism.
      - Adds a `cache_input_tensors` flag (default True) to control input tensor caching:
          - If True, tensors are generated once per configuration and reused for repetitions, with a check for potential shape mismatches between configurations.
          - If False, tensors are regenerated for every configuration trial.
      - Refactors internal input tensor handling using supplier functions for clarity.
      - Adds a `check_tensor_list_compatibility` utility for shape comparison.
      
      * [Example] Update example_blocksparse_gemm for autotuner
      
      * Run code formatter
      
      * [Example] Small fix in example_blocksparse_gemm
      
      * [Fix] Raise error if autotuning yields no valid configuration
      92e8d5f4
    • yyttt6's avatar
      [Example] Add autotune to conv example (#301) · 1873dc00
      yyttt6 authored
      
      
      * add autotune to example_gemm.py
      
      * add autotune to conv
      
      * still coding ...
      
      * version 0
      
      * version 0
      
      * version 0
      
      * refactor autotune
      
      * refactor autotune
      
      * add autotune to conv example
      
      * add conv template to carver
      
      * add conv template to carver
      
      * add conv template to carver
      
      * Update num_stages configuration values
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      1873dc00
  11. 29 Mar, 2025 1 commit