"driver/src/device.cpp" did not exist on "81497a93a0840d5a1b5e84c1e47a90ae39d0fee6"
  1. 10 Oct, 2025 1 commit
  2. 15 Aug, 2025 1 commit
  3. 08 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] refactor autotune examples (#617) · d110d087
      Lei Wang authored
      * [Refactor] Update tilelang kernel functions and remove unused imports
      
      - Refactored the `flashattn_fwd`, `flashattn_bwd_preprocess`, and `flashattn_bwd_postprocess` functions to utilize direct kernel calls instead of cached versions, improving clarity and performance.
      - Added `@tilelang.jit` decorators with specified output indices to enhance kernel compilation.
      - Removed unused import of `cached` from `tilelang`, streamlining the code.
      - Commented out the main testing function call in `test_tilelang_kernel_mha_bwd.py` for potential future use.
      
      * [Refactor] Simplify configuration generation in benchmark and example scripts
      
      - Refactored the `get_configs` functions in multiple benchmark and example scripts to utilize a dictionary-based approach for parameter configuration, improving readability and maintainability.
      - Updated the `flashattn` and `chunk_scan_fwd` functions to directly accept configuration parameters, enhancing flexibility in kernel tuning.
      - Removed redundant code and streamlined the configuration generation process across various files, ensuring consistency in how configurations are defined and utilized.
      
      * [Refactor] Update configuration handling in benchmark scripts
      
      - Refactored the `get_configs` functions in benchmark scripts to accept a variable argument list, improving flexibility in configuration management.
      - Enhanced the `matmul` and `flashattn` functions to utilize the updated configuration approach, streamlining parameter handling for kernel tuning.
      - Added `@autotune` decorators to relevant functions, ensuring consistent autotuning behavior across benchmarks.
      - Cleaned up redundant code and improved overall readability in the affected files.
      
      * [Refactor] Clean up formatting and update subproject commit
      
      - Updated the subproject commit reference in the TVM directory to indicate a dirty state.
      - Removed unnecessary blank lines and improved formatting in the `benchmark_matmul` and `benchmark_matmul_fp8` scripts for better readability.
      - Streamlined the function definitions in the `flashattn` example script to enhance clarity and maintainability.
      
      * [Refactor] Update AutoTuner configuration handling
      
      - Modified the AutoTuner class to check if kernel parameters are set before processing tunable arguments, improving robustness in configuration handling.
      - Enhanced the logic for skipping compilation when tunable parameters are already provided, ensuring efficient use of resources.
      - Updated comments for clarity and maintainability.
      
      * lint fix
      
      * Update TVM subproject commit to indicate dirty state and modify MHA backward test cases
      
      - Updated the subproject commit reference in the TVM directory to reflect a dirty state.
      - Adjusted the `test_mha_bwd` function to use a new configuration for the MHA backward tests, changing the context size from 128 to 256.
      - Uncommented the main testing function call for potential execution.
      d110d087
  4. 04 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Doc] Phaseout Legacy documentations (#610) · d9ae74c6
      Lei Wang authored
      - Added a new entry in the README for the introduction of `T.gemm_sp` supporting 2:4 sparse tensor core.
      - Removed several outdated documentation files related to convolution, flash attention, and other tutorials to streamline the documentation structure.
      d9ae74c6
  5. 12 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Docs] Add AMD Flash MLA Documentation to Tutorials Section (#376) · 0997c333
      Lei Wang authored
      * [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang
      
      - Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism.
      - Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use.
      - Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization.
      - Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming.
      
      * doc update
      
      * [Add] Enhance AMD FlashMLA implementation and documentation
      
      - Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation.
      - Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation.
      - Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks.
      
      * lint fix
      
      * [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies
      
      - Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators.
      - Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling.
      - Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks.
      - Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.
      0997c333
  6. 05 Apr, 2025 1 commit
  7. 28 Mar, 2025 1 commit
    • botbw's avatar
      [doc/example] add gemv doc and examples (#293) · ff3cfa59
      botbw authored
      * [doc/example] init gemv doc and examples
      
      * [example] add vectorized read
      
      * [example] use local register instead of smem
      
      * [example] add bench
      
      * [doc] update doc
      
      * [doc] refine doc
      
      * [lint] format code
      
      * [doc] add tips
      
      * [doc/example] fix typo
      
      * [example] use tmv_all_reduce
      
      * [doc] update doc accordingly
      
      * [doc] add benchmark table
      
      * [lint] format code
      ff3cfa59
  8. 26 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
  9. 19 Mar, 2025 1 commit
    • Chenghua's avatar
      [Examples] Implement elementwise add kernel (#219) · 43bd9d3e
      Chenghua authored
      * [Example] Modify tuning configurations for FlashAttention example
      
      * [Examples] formatting example_gqa_fwd_bshd.py
      
      * [Examples] Implement elementwise add kernel
      
      * [Doc] Update ElementWise Operators document
      
      * [Examples] Replace the example of elementwise add.
      43bd9d3e
  10. 21 Feb, 2025 1 commit
    • Lei Wang's avatar
      [JIT] Support Cython jit and make cython a default execution backend (#102) · 3471904f
      Lei Wang authored
      * [Feature] Add CTypes JIT kernel support for dynamic shapes and multi-stream execution
      
      - Enhance CtypesKernelAdapter to handle dynamic symbolic shapes
      - Add support for multi-stream kernel execution in CTypes backend
      - Implement dynamic shape handling in test_tilelang_jit_gemm_ctypes.py
      - Add symbolic shape utility function in tilelang.language
      - Update profiler to improve flexibility in benchmark selection
      
      * Remove redundant thread binding in GEMM kernel implementations
      
      - Remove unnecessary `thread_binding` line in GEMM kernel functions
      - Clean up code in `examples/gemm/README.md` and `testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py`
      - Enhance code readability by removing redundant thread binding annotation
      
      * Fix indentation in int4 GEMM kernel test file
      
      - Correct indentation for function calls in `test_tilelang_kernel_int4_gemm_mma.py`
      - Remove extra indentation in `mma_emitter.ldmatrix_a()` and `mma_emitter.ldmatrix_b()` calls
      - Improve code formatting for better readability
      
      * [Feature] Add Cython JIT kernel support for dynamic shapes and multi-stream execution
      
      - Implement CythonKernelAdapter to handle dynamic symbolic shapes
      - Add support for multi-stream kernel execution in Cython backend
      - Create comprehensive test suite for Cython GEMM kernel in test_tilelang_jit_gemm_cython.py
      - Update JITKernel to include "cython" as a valid execution backend
      - Add Cython-specific wrapper and library generation modules
      - Update .gitignore to exclude Cython cache directory
      - Modify setup.py to include Cython source files in package data
      
      * lint fix
      
      * [Refactor] Replace JITKernel with compile() function for kernel compilation
      
      - Add new `compile()` function in tilelang/jit/__init__.py as a wrapper for JITKernel
      - Update multiple test files and examples to use `tilelang.compile()` instead of `tilelang.JITKernel()`
      - Modify kernel adapters to support optional kernel-only source retrieval
      - Update `__init__.py` to import the new `compile()` function
      - Improve kernel source retrieval for different execution backends
      
      * lint fix
      
      * remove debug print
      
      * Add C/C++ compiler utility module and update Cython JIT kernel support
      
      - Introduce new `tilelang/contrib/cc.py` module with cross-platform C/C++ compiler utilities
      - Add functions to detect and retrieve system C/C++ compilers
      - Implement cross-compilation and shared library creation support
      - Update Cython JIT kernel to validate C++ compiler availability
      - Modify Cython adapter to use detected C++ compiler for library generation
      
      * Refactor float8 dtype mapping in tensor utility module
      
      - Move float8_dtype_map inside adapt_torch2tvm function
      - Simplify global scope by localizing the dtype mapping
      - Maintain existing functionality for converting torch float8 tensors to TVM ndarray
      
      * Refactor float8 dtype mapping in tensor utility module
      
      - Move float8_dtype_map inside adapt_torch2tvm function
      - Simplify global scope by localizing the dtype mapping
      - Maintain existing functionality for converting torch float8 tensors to TVM ndarray
      
      * revert
      
      * Enhance Cython JIT adapter with Cython compiler detection
      
      - Add `get_cython_compiler()` function to dynamically locate Cython executable
      - Update Cython adapter to use detected Cython compiler instead of hardcoded command
      - Raise an exception if no Cython compiler is found
      - Update requirements.txt to specify minimum PyTorch version (>=2.2.0)
      
      * Fix Cython kernel wrapper stream handling and type annotations
      
      - Update stream parameter type to int64_t for better compatibility
      - Directly use torch.cuda.current_stream().cuda_stream instead of casting
      - Improve type safety and precision in Cython kernel wrapper
      3471904f
  11. 13 Feb, 2025 1 commit
  12. 02 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Doc] Add matmul kernel tutorial documentations with tile library (#60) · ea612446
      Lei Wang authored
      * implement jit test case
      
      * [Dev] implement auto tune test case for matrix multiplication
      
      * Implement test for legalize memory access and vectorized loop
      
      * lint fix
      
      * introduce run_once
      
      * Refactor callback function names for consistency and improve code readability
      
      * enhance documentations
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix formatting issues in rt_mod_hip.cc
      
      * add random seed initialization for deterministic testing
      
      * Add documentation images and comprehensive GEMM tutorial for TileLang
      
      * Update MATMUL documentation title to highlight Tile Library
      ea612446
  13. 26 Jan, 2025 1 commit
    • Lei Wang's avatar
      [Doc] Addd debug relevant testing and documentations (#58) · 5e259239
      Lei Wang authored
      * implement jit test case
      
      * [Dev] implement auto tune test case for matrix multiplication
      
      * Implement test for legalize memory access and vectorized loop
      
      * lint fix
      
      * introduce run_once
      
      * Refactor callback function names for consistency and improve code readability
      
      * enhance documentations
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix formatting issues in rt_mod_hip.cc
      
      * add random seed initialization for deterministic testing
      5e259239