1. 28 Sep, 2025 1 commit
    • Zhiwen Mo's avatar
      [SM100] Add sm100 GEMM layouts and tcgen05 support (#887) · f58bcd43
      Zhiwen Mo authored
      * update sm100 related utcmma, tmem, ld/st256 in src
      * update sm100 related utcmma, tmem, ld/st256 in tilelang
      * Remove deprecated GEMM examples and related README documentation for SM100 architecture support
      * Update GEMM implementation to replace UTCMMA with TCGEN5MMA across relevant files
      * Remove gemm_umma.py example and update README to reflect TCGEN5MMA terminology changes
      * Update README.md for gemm_sm100 example by removing outdated API sections and streamlining documentation
      * Update README and source files to reflect TCGEN5.MMA terminology changes
      * Refactor CUDA GEMM header for improved readability
      f58bcd43
  2. 15 Sep, 2025 1 commit
    • botbw's avatar
      [feat] support gemm_sp for ampere and ada arch (#691) · 0b3683bf
      botbw authored
      
      
      * [feat] add an example mma atom
      
      * [fix] fix typo naming
      
      * [feat] add a template to enable compilation
      
      * [feat] add print util
      
      * [WIP] pass on single block tile
      
      * [feat] add sm80 metadata layout
      
      * [chore] clean codebase
      
      * [CI] format.sh
      
      * [feat] add sm80 compress utils
      
      * [bugfix] fix C fragment layout
      
      * [refactor] use nvcc version instead of str
      
      * [test] add test cases
      
      * [chore] add a param check
      
      * [chore] format a bit
      
      * [chore] rename func to satisfy PEP 8 and appease gemini
      
      * [chore] add check
      
      * [feat] support sm75 layout && add assertion && chore
      
      * [bug] fix illegal memory access when using two warps over N=32
      
      This could be a missing check related to cutlass 2.x implementation.
      Using the cutlass example can't trigger this cause it's bypassed by
      padding the input.
      
      For now I think it might be safe to increase the atom size and inve-
      sgate in the future.
      
      * [chore] add example
      
      * [chore] format
      
      * [example] update benchmark
      
      * [bugfix] fix namespace and format
      
      * [bugfix] fix incorrect param passing
      
      * [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp
      
      * [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py
      
      * [CI] fix arch
      
      * [example] add torch sparse benchmark
      
      * [misc] polish && add reference && apply review suggestionsi && format
      
      * [CI] format with clang-tidy
      
      * [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h
      
      * [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0b3683bf
  3. 31 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757) · 8eab7755
      Lei Wang authored
      
      
      * [Enhancement] Introduce finalize_reducer operator and layout reducer support
      
      - Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations.
      - Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management.
      - Updated `setup.py` to include logging for build directory paths, improving build process visibility.
      - Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations.
      - Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts.
      - Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance.
      
      * Refactor code formatting and improve readability in multiple files
      
      - Cleaned up whitespace in `setup.py` to enhance logging clarity.
      - Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability.
      - Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability.
      - Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity.
      
      * Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability.
      
      * [Enhancement] Disable reuse of small arrays in shared memory allocation
      
      - Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management.
      
      * Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability.
      
      * Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability.
      
      * bug fix
      
      * Add thread checks workaround for replicated cases
      
      * Remove the is_one check
      
      * fix lint error
      
      * lint fix
      
      * Update autotune tests to use smaller matrix sizes for improved performance and reliability
      
      * [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods
      
      - Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure.
      - Enhanced layout inference and cloning methods in FinalizeReducerOpNode.
      - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main.
      - Adjusted header inclusions for improved organization and clarity across multiple files.
      
      * [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py
      
      - Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently.
      - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization.
      
      * [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution
      
      - Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations.
      - Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus.
      
      ---------
      Co-authored-by: default avatarHuanqi Cao <caohuanqi@deepseek.com>
      Co-authored-by: default avatarFreebase6912 <amid-gauze-racing@duck.com>
      8eab7755
  4. 15 Aug, 2025 1 commit
  5. 05 Jun, 2025 1 commit
    • Gabriel Wu's avatar
      [Enhancement] Add nvrtc execution backend (#461) · 17f7394f
      Gabriel Wu authored
      
      
      * [wip] feat: add nvrtc backend
      
      * [wip] fix: handle out_idx
      
      * [wip] refactor: move lib logic to libgen
      
      * feat: cache for nvrtc backend
      
      * fmt: run format
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: get kernel source
      
      * refactor: speedup pyimport
      
      * Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning.
      
      * Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity.
      
      * Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates.
      
      * fix tl_templates
      
      * ensure CUDA context
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      17f7394f
  6. 26 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Replace default fp8 dtype with cute to perform fast cast (#520) · 6addc509
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      
      * Add global barrier state and expectation handling in CUDA code generation
      
      * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
      * Updated `Finish` method to declare the global barrier state if needed.
      * Implemented handling for `EvaluateNode` to initialize the barrier expectation.
      * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
      * Enhanced CUDA FP8 type definitions for better alignment and structure.
      
      * Enhance CUDA FP8 type handling and debug printing
      
      * Updated `cuda_fp8.h` to replace NVidia's FP8 types with Cute's FP8 types for better compatibility and structure.
      * Added specializations for `debug_print_var` and `debug_print_buffer_value` functions to support the new FP8 types, improving debugging capabilities for these data types.
      * Updated `debug.h` to include the new `cuda_fp8.h` header for access to the FP8 type definitions.
      
      * Refactor CUDA code generation to remove unnecessary managed qualifier for global barrier state
      
      * Updated the `Finish` method in `codegen_cuda.cc` to declare the global barrier state without the `__managed__` qualifier, simplifying the declaration.
      * Added a new `sync_global` function in `builtin.py` to synchronize all threads in a block, enhancing synchronization capabilities in the TileLang framework.
      
      * Remove deprecated CUDA kernel and Python script for FP8 E4M3 casting
      
      * Deleted the `cast_to_fp8_e4m3_kernel` CUDA kernel implementation and its corresponding Python script, streamlining the codebase by removing unused components related to FP8 E4M3 type casting.
      * This cleanup enhances maintainability and reduces potential confusion regarding obsolete code.
      
      * lint fix
      6addc509
  7. 05 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support debug print for unsigned char datatype (#145) · bb60f6ce
      Lei Wang authored
      * Fix debug print buffer template for unsigned char type
      
      - Update debug_print_buffer_value template specialization for unsigned char
      - Modify test_tilelang_debug_print.py to include additional dtype tests
      - Add test case for uint8 dtype in debug print buffer function
      
      * Refactor debug print buffer template formatting for unsigned char
      
      - Improve code formatting for debug_print_buffer_value template specialization
      - Adjust line breaks and indentation for better readability
      - Maintain consistent code style with other template specializations
      bb60f6ce
  8. 22 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Example] Implement simple block sparse kernel (#106) · c7462abf
      Lei Wang authored
      * Remove Torch CPP backend and update execution backend options
      
      - Remove TorchCPPKernelAdapter and related code from JIT modules
      - Update execution backend options in jit/__init__.py, kernel.py, and adapter/__init__.py
      - Remove "torch_cpp" from supported execution backend literals
      - Simplify backend validation and remove unused torch_cpp-related code
      。
      
      * lint fix
      
      * Add block sparse attention implementations for TileLang and Triton
      
      - Implement block sparse attention kernels for TileLang and Triton
      - Add example scripts for block sparse attention with top-k and threshold-based masking
      - Include utility functions for generating sparse attention masks
      - Demonstrate causal attention with block-level sparsity
      - Add test cases to validate sparse attention implementations against PyTorch reference
      c7462abf
  9. 24 Jan, 2025 1 commit
    • Lei Wang's avatar
      [Debug] Introduce `T.print` for buffer and variables logging on frontend (#45) · 8cdc185b
      Lei Wang authored
      * [Doc] Update documentation structure and content: add overview section, revise project name, and change theme to Furo
      
      * [Feature] Add device-side debug printing functions and integrate into kernel interface
      
      * lint fix
      
      * remove debug print
      
      * implement test for debug
      
      * lint fix
      
      * add some comments
      
      * Enhance fragment design and assert fragment print
      
      * enhance debug print
      
      * add test for msg
      
      * lint fix
      8cdc185b