"...composable_kernel_onnxruntime.git" did not exist on "3e6c2610ae9256dc7e4118dbf2074e97487babe3"
  1. 25 Jun, 2025 1 commit
    • Cunxiao Ni's avatar
      [Example] Update examples to use @tilelang.jit (#597) · 3db18726
      Cunxiao Ni authored
      
      
      * [Example] Update kernel compilation in examples to use @tilelang.jit
      
      - Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead.
      - Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability.
      - Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed.
      
      * format
      
      * Update example_tilelang_sparse_gqa_decode_varlen_indice.py
      
      * Update example_dequant_gemm_fine_grained.py
      
      * Update example_gemm_autotune.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      3db18726
  2. 28 May, 2025 1 commit
    • Lei Wang's avatar
      [Autotune] Introduce cache mechanism for auto tuner (#527) · 7171aff6
      Lei Wang authored
      * [Enhancement] Add commit ID to versioning and improve logging initialization
      
      * Updated `get_tilelang_version` to include an optional commit ID in the version string.
      * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
      * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
      * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
      * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.
      
      * [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching
      
      * Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability.
      * Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations.
      * Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking.
      * Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process.
      * Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity.
      * Minor adjustments in related modules to support the new caching and profiling features.
      
      * [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules
      
      * Consolidated import statements and removed unnecessary line breaks for better readability.
      * Standardized function argument formatting across the AutoTuner and CompileArgs classes.
      * Enhanced consistency in the use of whitespace and indentation throughout the codebase.
      * Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability.
      * Ensured that all changes adhere to the project's coding style guidelines.
      
      * [Refactor] Remove redundant type hints in AutoTuner modules
      
      * Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`.
      * Improved code readability and maintainability by streamlining type imports across the AutoTuner module.
      
      * [Refactor] Update AutoTuner configuration for improved profiling and target detection
      
      * Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings.
      * Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection.
      * Removed redundant target specifications in certain instances to streamline the configuration process.
      * Improved overall clarity and maintainability of the autotuning logic in various example scripts.
      
      * [Refactor] Simplify code formatting and improve readability in example scripts
      
      * Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity.
      * Removed unnecessary line breaks and standardized argument placement across multiple files.
      * Enhanced overall code readability and maintainability in autotuning examples and performance scripts.
      
      * [Refactor] Update JIT decorator usage across multiple files
      
      * Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code.
      * Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase.
      * Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions.
      
      * [Refactor] Standardize JIT decorator formatting across benchmark and example scripts
      
      * Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks.
      * Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts.
      * Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.
      7171aff6
  3. 09 May, 2025 1 commit
  4. 31 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Updated autotune usage in the examples to align with the latest changes (#309) · 66c7f6a1
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      
      * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings
      
      - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
      - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
      - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
      - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.
      
      * lint fix
      
      * typofix
      
      * [Refactor] Update matmul and flashattn function calls to return structured results
      
      - Modified the matmul and flashattn function calls to return a single object containing latency, configuration, and reference latency, improving code clarity and reducing the number of returned variables.
      - Updated all relevant instances in benchmark and example scripts to accommodate the new return structure, ensuring consistent usage across the codebase.
      
      * lint fix
      66c7f6a1
  5. 28 Mar, 2025 1 commit
    • botbw's avatar
      [doc/example] add gemv doc and examples (#293) · ff3cfa59
      botbw authored
      * [doc/example] init gemv doc and examples
      
      * [example] add vectorized read
      
      * [example] use local register instead of smem
      
      * [example] add bench
      
      * [doc] update doc
      
      * [doc] refine doc
      
      * [lint] format code
      
      * [doc] add tips
      
      * [doc/example] fix typo
      
      * [example] use tmv_all_reduce
      
      * [doc] update doc accordingly
      
      * [doc] add benchmark table
      
      * [lint] format code
      ff3cfa59