1. 20 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Feature] Add CTypes JIT kernel support (#100) · 7c817d51
      Lei Wang authored
      * [Feature] Add CTypes JIT kernel support for dynamic shapes and multi-stream execution
      
      - Enhance CtypesKernelAdapter to handle dynamic symbolic shapes
      - Add support for multi-stream kernel execution in CTypes backend
      - Implement dynamic shape handling in test_tilelang_jit_gemm_ctypes.py
      - Add symbolic shape utility function in tilelang.language
      - Update profiler to improve flexibility in benchmark selection
      
      * Remove redundant thread binding in GEMM kernel implementations
      
      - Remove unnecessary `thread_binding` line in GEMM kernel functions
      - Clean up code in `examples/gemm/README.md` and `testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py`
      - Enhance code readability by removing redundant thread binding annotation
      
      * Fix indentation in int4 GEMM kernel test file
      
      - Correct indentation for function calls in `test_tilelang_kernel_int4_gemm_mma.py`
      - Remove extra indentation in `mma_emitter.ldmatrix_a()` and `mma_emitter.ldmatrix_b()` calls
      - Improve code formatting for better readability
      7c817d51
  2. 19 Feb, 2025 3 commits
    • Lei Wang's avatar
      [Bugfix] Put `InjectPtxAsyncCopy` Pass behind `ThreadSync` Pass (#97) · 7cd6b3cd
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      
      * [Build] Add tox configuration and local distribution script for multi-Python version support
      
      * [Build] Improve git submodule update function with better error handling
      
      * [Build] Update LLVM configuration path in ROCm installation script
      
      * [Build] Add .tox/ to .gitignore for tox testing environment
      
      * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt
      
      * [Cleanup] Remove unused TVM runtime error codes header
      
      * [Cleanup] Fix TVM grid constant type reference in CUDA module
      
      * [Cleanup] Remove unused customized_code function from IR module
      
      * [Feature] Add TileLang thread synchronization and storage access analysis passes
      
      * [Build] Reorder DLL search path directories for more flexible library loading
      
      * [Refactor] Improve thread synchronization and library path handling
      
      - Rename ThreadSync and TileLangThreadSync functions in C++ code
      - Update Python docstring for ThreadSync with more detailed description
      - Reorder library path detection in tilelang environment setup
      - Minor comment and code cleanup in CUDA and warp specialization modules
      
      * [Refactor] Improve thread synchronization code style and formatting
      
      - Standardize pointer type spacing in storage_access.h and storage_access.cc
      - Update whitespace and indentation in thread_storage_sync.cc
      - Reorder include statements in thread_partial_sync.cc
      - Minor code formatting improvements across thread synchronization files
      
      * [Refactor] Fix global function registration for ThreadSync
      
      - Correct global function registration to use ThreadSync instead of TileLangThreadSync
      - Update TVM global registration to match recent refactoring efforts
      
      * [Refactor] Simplify ThreadSync global function registration
      
      - Remove unnecessary whitespace in global function registration
      - Compact the TVM global registration line for ThreadSync
      
      * [Feature] Add WebGPU code generation support in TileLang
      
      - Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h)
      - Add WebGPU target support in lower.py and target.py
      - Update CMakeLists.txt to include WebGPU codegen source files
      - Introduce WebGPU-specific code generation for WGSL shader language
      
      * [Refactor] Improve WebGPU code generation formatting and readability
      
      - Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h
      - Standardize pointer type spacing and indentation
      - Improve line breaks and reduce line length for better readability
      - Minor code style improvements in WebGPU code generation
      
      * [Test] Add WebGPU matrix multiplication code generation test
      
      - Implement test_webgpu_codegen.py for WebGPU matrix multiplication
      - Add assert_gemm_codegen function to validate WebGPU code generation
      - Include basic matrix multiplication kernel test case
      
      * Update README with WebGPU codegen support announcement
      
      * Support multi version pypi package build via tox
      
      * Add support for CPU device backend with C code generation
      
      - Introduce `is_cpu_device_backend` function to detect CPU backend with C code generation
      - Modify `lower` function to handle special case of CPU device backend
      - Update host and device call filtering for CPU backend
      - Add conditional source code generation for C host target
      - Extend JITKernel to support optional target_host parameter
      
      * lint fix
      
      * Enhance JIT kernel adapters with CTypes and Torch C++ backends
      
      - Add CtypesKernelAdapter with dynamic library generation and kernel wrapping
      - Implement TorchCPPKernelAdapter for CUDA kernel compilation
      - Refactor BaseKernelAdapter to support more flexible initialization
      - Improve error handling and argument processing in kernel adapters
      - Update adapter initialization to support various execution backends
      
      * Refactor and clean up code style in JIT CTypes adapter modules
      
      - Apply consistent code formatting and whitespace in CTypes adapter files
      - Remove unused imports and improve import organization
      - Enhance readability of code in adapter, libgen, and wrapper modules
      - Add missing whitespace and improve line breaks
      - Minor linting and code style improvements across CTypes adapter files
      
      * Add test for TileLang JIT GEMM with CTypes backend
      
      - Implement comprehensive test for matrix multiplication using CTypes execution backend
      - Create test functions for GEMM with float16 data type
      - Add kernel source verification with custom callback
      - Implement reference implementation using PyTorch for result validation
      - Support various matrix multiplication configurations (transposition, block sizes)
      
      * test fix
      
      * Update TileLang JIT callback registration with override parameter
      
      - Modify tilelang_callback_cuda_postproc to use @tvm.register_func(override=True)
      - Ensure proper function registration with ability to replace existing implementations
      
      * Reorder TileLang lowering passes for Hopper intrinsics and PTX async copy
      
      - Adjust the order of LowerHopperIntrin and InjectPTXAsyncCopy passes
      - Move these passes to ensure correct synchronization and device preparation
      
      * Rebase main
      
      * shared.dyn
      
      * lint fix
      
      * test fix
      
      * Add environment variable handling for TileLang template and CUTLASS paths
      
      - Introduce fallback logic for TL_TEMPLATE_PATH environment variable
      - Add support for optional TL_CUTLASS_PATH configuration
      - Include TODO comment for future environment variable renaming
      7cd6b3cd
    • Lei Wang's avatar
      Update Dockerfile.cu120 (#98) · 93294e61
      Lei Wang authored
      93294e61
    • Lei Wang's avatar
      [Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency (#95) · 2ac51a03
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      
      * [Build] Add tox configuration and local distribution script for multi-Python version support
      
      * [Build] Improve git submodule update function with better error handling
      
      * [Build] Update LLVM configuration path in ROCm installation script
      
      * [Build] Add .tox/ to .gitignore for tox testing environment
      
      * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt
      
      * [Cleanup] Remove unused TVM runtime error codes header
      
      * [Cleanup] Fix TVM grid constant type reference in CUDA module
      
      * [Cleanup] Remove unused customized_code function from IR module
      
      * [Feature] Add TileLang thread synchronization and storage access analysis passes
      
      * [Build] Reorder DLL search path directories for more flexible library loading
      
      * [Refactor] Improve thread synchronization and library path handling
      
      - Rename ThreadSync and TileLangThreadSync functions in C++ code
      - Update Python docstring for ThreadSync with more detailed description
      - Reorder library path detection in tilelang environment setup
      - Minor comment and code cleanup in CUDA and warp specialization modules
      
      * [Refactor] Improve thread synchronization code style and formatting
      
      - Standardize pointer type spacing in storage_access.h and storage_access.cc
      - Update whitespace and indentation in thread_storage_sync.cc
      - Reorder include statements in thread_partial_sync.cc
      - Minor code formatting improvements across thread synchronization files
      
      * [Refactor] Fix global function registration for ThreadSync
      
      - Correct global function registration to use ThreadSync instead of TileLangThreadSync
      - Update TVM global registration to match recent refactoring efforts
      
      * [Refactor] Simplify ThreadSync global function registration
      
      - Remove unnecessary whitespace in global function registration
      - Compact the TVM global registration line for ThreadSync
      
      * [Feature] Add WebGPU code generation support in TileLang
      
      - Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h)
      - Add WebGPU target support in lower.py and target.py
      - Update CMakeLists.txt to include WebGPU codegen source files
      - Introduce WebGPU-specific code generation for WGSL shader language
      
      * [Refactor] Improve WebGPU code generation formatting and readability
      
      - Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h
      - Standardize pointer type spacing and indentation
      - Improve line breaks and reduce line length for better readability
      - Minor code style improvements in WebGPU code generation
      
      * [Test] Add WebGPU matrix multiplication code generation test
      
      - Implement test_webgpu_codegen.py for WebGPU matrix multiplication
      - Add assert_gemm_codegen function to validate WebGPU code generation
      - Include basic matrix multiplication kernel test case
      
      * Update README with WebGPU codegen support announcement
      
      * Support multi version pypi package build via tox
      
      * Add support for CPU device backend with C code generation
      
      - Introduce `is_cpu_device_backend` function to detect CPU backend with C code generation
      - Modify `lower` function to handle special case of CPU device backend
      - Update host and device call filtering for CPU backend
      - Add conditional source code generation for C host target
      - Extend JITKernel to support optional target_host parameter
      
      * lint fix
      
      * Enhance JIT kernel adapters with CTypes and Torch C++ backends
      
      - Add CtypesKernelAdapter with dynamic library generation and kernel wrapping
      - Implement TorchCPPKernelAdapter for CUDA kernel compilation
      - Refactor BaseKernelAdapter to support more flexible initialization
      - Improve error handling and argument processing in kernel adapters
      - Update adapter initialization to support various execution backends
      
      * Refactor and clean up code style in JIT CTypes adapter modules
      
      - Apply consistent code formatting and whitespace in CTypes adapter files
      - Remove unused imports and improve import organization
      - Enhance readability of code in adapter, libgen, and wrapper modules
      - Add missing whitespace and improve line breaks
      - Minor linting and code style improvements across CTypes adapter files
      
      * Add test for TileLang JIT GEMM with CTypes backend
      
      - Implement comprehensive test for matrix multiplication using CTypes execution backend
      - Create test functions for GEMM with float16 data type
      - Add kernel source verification with custom callback
      - Implement reference implementation using PyTorch for result validation
      - Support various matrix multiplication configurations (transposition, block sizes)
      
      * test fix
      
      * Update TileLang JIT callback registration with override parameter
      
      - Modify tilelang_callback_cuda_postproc to use @tvm.register_func(override=True)
      - Ensure proper function registration with ability to replace existing implementations
      2ac51a03
  3. 18 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Wheel] Support pypi build scripts for different python via tox (#93) · fa9a19b0
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      
      * [Build] Add tox configuration and local distribution script for multi-Python version support
      
      * [Build] Improve git submodule update function with better error handling
      
      * [Build] Update LLVM configuration path in ROCm installation script
      
      * [Build] Add .tox/ to .gitignore for tox testing environment
      
      * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt
      
      * [Cleanup] Remove unused TVM runtime error codes header
      
      * [Cleanup] Fix TVM grid constant type reference in CUDA module
      
      * [Cleanup] Remove unused customized_code function from IR module
      
      * [Feature] Add TileLang thread synchronization and storage access analysis passes
      
      * [Build] Reorder DLL search path directories for more flexible library loading
      
      * [Refactor] Improve thread synchronization and library path handling
      
      - Rename ThreadSync and TileLangThreadSync functions in C++ code
      - Update Python docstring for ThreadSync with more detailed description
      - Reorder library path detection in tilelang environment setup
      - Minor comment and code cleanup in CUDA and warp specialization modules
      
      * [Refactor] Improve thread synchronization code style and formatting
      
      - Standardize pointer type spacing in storage_access.h and storage_access.cc
      - Update whitespace and indentation in thread_storage_sync.cc
      - Reorder include statements in thread_partial_sync.cc
      - Minor code formatting improvements across thread synchronization files
      
      * [Refactor] Fix global function registration for ThreadSync
      
      - Correct global function registration to use ThreadSync instead of TileLangThreadSync
      - Update TVM global registration to match recent refactoring efforts
      
      * [Refactor] Simplify ThreadSync global function registration
      
      - Remove unnecessary whitespace in global function registration
      - Compact the TVM global registration line for ThreadSync
      
      * [Feature] Add WebGPU code generation support in TileLang
      
      - Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h)
      - Add WebGPU target support in lower.py and target.py
      - Update CMakeLists.txt to include WebGPU codegen source files
      - Introduce WebGPU-specific code generation for WGSL shader language
      
      * [Refactor] Improve WebGPU code generation formatting and readability
      
      - Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h
      - Standardize pointer type spacing and indentation
      - Improve line breaks and reduce line length for better readability
      - Minor code style improvements in WebGPU code generation
      
      * [Test] Add WebGPU matrix multiplication code generation test
      
      - Implement test_webgpu_codegen.py for WebGPU matrix multiplication
      - Add assert_gemm_codegen function to validate WebGPU code generation
      - Include basic matrix multiplication kernel test case
      
      * Update README with WebGPU codegen support announcement
      
      * Support multi version pypi package build via tox
      fa9a19b0
  4. 15 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Backend][WebGPU] Support WebGPU WGSL code generation (#86) · c8fc0cbb
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      
      * [Build] Add tox configuration and local distribution script for multi-Python version support
      
      * [Build] Improve git submodule update function with better error handling
      
      * [Build] Update LLVM configuration path in ROCm installation script
      
      * [Build] Add .tox/ to .gitignore for tox testing environment
      
      * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt
      
      * [Cleanup] Remove unused TVM runtime error codes header
      
      * [Cleanup] Fix TVM grid constant type reference in CUDA module
      
      * [Cleanup] Remove unused customized_code function from IR module
      
      * [Feature] Add TileLang thread synchronization and storage access analysis passes
      
      * [Build] Reorder DLL search path directories for more flexible library loading
      
      * [Refactor] Improve thread synchronization and library path handling
      
      - Rename ThreadSync and TileLangThreadSync functions in C++ code
      - Update Python docstring for ThreadSync with more detailed description
      - Reorder library path detection in tilelang environment setup
      - Minor comment and code cleanup in CUDA and warp specialization modules
      
      * [Refactor] Improve thread synchronization code style and formatting
      
      - Standardize pointer type spacing in storage_access.h and storage_access.cc
      - Update whitespace and indentation in thread_storage_sync.cc
      - Reorder include statements in thread_partial_sync.cc
      - Minor code formatting improvements across thread synchronization files
      
      * [Refactor] Fix global function registration for ThreadSync
      
      - Correct global function registration to use ThreadSync instead of TileLangThreadSync
      - Update TVM global registration to match recent refactoring efforts
      
      * [Refactor] Simplify ThreadSync global function registration
      
      - Remove unnecessary whitespace in global function registration
      - Compact the TVM global registration line for ThreadSync
      
      * [Feature] Add WebGPU code generation support in TileLang
      
      - Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h)
      - Add WebGPU target support in lower.py and target.py
      - Update CMakeLists.txt to include WebGPU codegen source files
      - Introduce WebGPU-specific code generation for WGSL shader language
      
      * [Refactor] Improve WebGPU code generation formatting and readability
      
      - Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h
      - Standardize pointer type spacing and indentation
      - Improve line breaks and reduce line length for better readability
      - Minor code style improvements in WebGPU code generation
      
      * [Test] Add WebGPU matrix multiplication code generation test
      
      - Implement test_webgpu_codegen.py for WebGPU matrix multiplication
      - Add assert_gemm_codegen function to validate WebGPU code generation
      - Include basic matrix multiplication kernel test case
      
      * Update README with WebGPU codegen support announcement
      c8fc0cbb
  5. 14 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Separate tilelang Pass Thread Sync (with Hopper support) from tvm (#85) · ec84188f
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      
      * [Build] Add tox configuration and local distribution script for multi-Python version support
      
      * [Build] Improve git submodule update function with better error handling
      
      * [Build] Update LLVM configuration path in ROCm installation script
      
      * [Build] Add .tox/ to .gitignore for tox testing environment
      
      * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt
      
      * [Cleanup] Remove unused TVM runtime error codes header
      
      * [Cleanup] Fix TVM grid constant type reference in CUDA module
      
      * [Cleanup] Remove unused customized_code function from IR module
      
      * [Feature] Add TileLang thread synchronization and storage access analysis passes
      
      * [Build] Reorder DLL search path directories for more flexible library loading
      
      * [Refactor] Improve thread synchronization and library path handling
      
      - Rename ThreadSync and TileLangThreadSync functions in C++ code
      - Update Python docstring for ThreadSync with more detailed description
      - Reorder library path detection in tilelang environment setup
      - Minor comment and code cleanup in CUDA and warp specialization modules
      
      * [Refactor] Improve thread synchronization code style and formatting
      
      - Standardize pointer type spacing in storage_access.h and storage_access.cc
      - Update whitespace and indentation in thread_storage_sync.cc
      - Reorder include statements in thread_partial_sync.cc
      - Minor code formatting improvements across thread synchronization files
      
      * [Refactor] Fix global function registration for ThreadSync
      
      - Correct global function registration to use ThreadSync instead of TileLangThreadSync
      - Update TVM global registration to match recent refactoring efforts
      
      * [Refactor] Simplify ThreadSync global function registration
      
      - Remove unnecessary whitespace in global function registration
      - Compact the TVM global registration line for ThreadSync
      ec84188f
  6. 13 Feb, 2025 3 commits
    • Lei Wang's avatar
      [WHL] Support whl building for different python versions via tox (#83) · f55defac
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      
      * [Build] Add tox configuration and local distribution script for multi-Python version support
      
      * [Build] Improve git submodule update function with better error handling
      f55defac
    • Lei Wang's avatar
      [Bugfix] Bugfix of installing with develop mode (#81) · b43ab2dd
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      b43ab2dd
    • Wenhao Xie's avatar
      [Doc] Convert docs from rst format to Markdown format. (#82) · d44e291c
      Wenhao Xie authored
      * [CI] Clean up target repository before publishing documentation.
      
      * [Doc] Convert docs from rst format to Markdown format.
      d44e291c
  7. 12 Feb, 2025 2 commits
  8. 11 Feb, 2025 4 commits
    • Yu Cheng's avatar
      [Dev] Add mha backward example (#77) · a6fe61e2
      Yu Cheng authored
      * [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized
      
      * Relax the mismatch ratio restrictions in the flash_linear_attention and mha tests
      
      * [Dev] Add mha backward example
      a6fe61e2
    • Lei Wang's avatar
      [Carver] Remove legacy todo items in carver's readme (#74) · a26a4315
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      
      * Add BF16 support to matrix multiplication and introduce corresponding test cases
      
      * Add a blank line for improved readability in BF16 GEMM test
      
      * Update acknowledgements in README to include supervision by Zhi Yang at Peking University
      
      * enhance acknowledgement
      
      * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives
      
      * Update subproject commit for TVM dependency
      
      * Update subproject commit for TVM dependency
      
      * Add int4_t type and functions for packing char values in CUDA common header
      
      * Add plot_layout example and implement GetForwardVars method in layout classes
      
      * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files
      
      * Fix formatting by removing unnecessary line break in layout.h
      
      * Refactor make_int4 function for improved readability by adjusting parameter formatting
      
      * Add legend to plot_layout for improved clarity of thread and local IDs
      
      * Remove unnecessary dependencies from requirements files for cleaner setup
      
      * Remove flash_mha.py and add .gitkeep to deepseek_mla directory
      
      * Add build requirements and update installation scripts for improved setup
      
      * Introduce carver
      
      * Refactor imports and improve code formatting for consistency
      
      * Add unit tests for carver recommendation hints
      
      * lint fix
      
      * Enhance ElementwiseTemplate and BaseTemplate with detailed docstrings for improved code documentation and clarity
      
      * Refactor import statements and clean up whitespace in template files for improved readability
      
      * Add README.md for Carver framework with usage examples and architecture support
      
      * Refactor import statement in matmul_analysis.py for consistency
      
      * Refactor TileDict and TensorCorePolicy methods for improved clarity and functionality
      
      * Add tests for general matrix multiplication emit configurations
      
      * Refactor formatting in test_tilelang_carver_generate_hints.py for improved readability
      
      * Add FlashAttentionTemplate and related functionality for hint recommendations
      
      * Refactor whitespace in FlashAttentionTemplate and test_tilelang_carver_recommend_hints for improved readability
      
      * Update README.md to include FlashAttentionTemplate in the carver section
      a26a4315
    • Lei Wang's avatar
      [CostModel][Carver] Support Hint Recommend for Shared memory Kernel Fusion (#73) · 1ef644e7
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      
      * Add BF16 support to matrix multiplication and introduce corresponding test cases
      
      * Add a blank line for improved readability in BF16 GEMM test
      
      * Update acknowledgements in README to include supervision by Zhi Yang at Peking University
      
      * enhance acknowledgement
      
      * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives
      
      * Update subproject commit for TVM dependency
      
      * Update subproject commit for TVM dependency
      
      * Add int4_t type and functions for packing char values in CUDA common header
      
      * Add plot_layout example and implement GetForwardVars method in layout classes
      
      * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files
      
      * Fix formatting by removing unnecessary line break in layout.h
      
      * Refactor make_int4 function for improved readability by adjusting parameter formatting
      
      * Add legend to plot_layout for improved clarity of thread and local IDs
      
      * Remove unnecessary dependencies from requirements files for cleaner setup
      
      * Remove flash_mha.py and add .gitkeep to deepseek_mla directory
      
      * Add build requirements and update installation scripts for improved setup
      
      * Introduce carver
      
      * Refactor imports and improve code formatting for consistency
      
      * Add unit tests for carver recommendation hints
      
      * lint fix
      
      * Enhance ElementwiseTemplate and BaseTemplate with detailed docstrings for improved code documentation and clarity
      
      * Refactor import statements and clean up whitespace in template files for improved readability
      
      * Add README.md for Carver framework with usage examples and architecture support
      
      * Refactor import statement in matmul_analysis.py for consistency
      
      * Refactor TileDict and TensorCorePolicy methods for improved clarity and functionality
      
      * Add tests for general matrix multiplication emit configurations
      
      * Refactor formatting in test_tilelang_carver_generate_hints.py for improved readability
      
      * Add FlashAttentionTemplate and related functionality for hint recommendations
      
      * Refactor whitespace in FlashAttentionTemplate and test_tilelang_carver_recommend_hints for improved readability
      1ef644e7
    • Yu Cheng's avatar
      [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized (#72) · 465f0107
      Yu Cheng authored
      * [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized
      
      * Relax the mismatch ratio restrictions in the flash_linear_attention and mha tests
      465f0107
  9. 10 Feb, 2025 3 commits
    • Lei Wang's avatar
      [Bugfix] bug fix for bitblas dependency (#71) · be946d02
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      
      * Add BF16 support to matrix multiplication and introduce corresponding test cases
      
      * Add a blank line for improved readability in BF16 GEMM test
      
      * Update acknowledgements in README to include supervision by Zhi Yang at Peking University
      
      * enhance acknowledgement
      
      * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives
      
      * Update subproject commit for TVM dependency
      
      * Update subproject commit for TVM dependency
      
      * Add int4_t type and functions for packing char values in CUDA common header
      
      * Add plot_layout example and implement GetForwardVars method in layout classes
      
      * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files
      
      * Fix formatting by removing unnecessary line break in layout.h
      
      * Refactor make_int4 function for improved readability by adjusting parameter formatting
      
      * Add legend to plot_layout for improved clarity of thread and local IDs
      
      * Remove unnecessary dependencies from requirements files for cleaner setup
      
      * Remove flash_mha.py and add .gitkeep to deepseek_mla directory
      
      * Add build requirements and update installation scripts for improved setup
      
      * Introduce carver
      
      * Refactor imports and improve code formatting for consistency
      
      * Add unit tests for carver recommendation hints
      
      * lint fix
      
      * Enhance ElementwiseTemplate and BaseTemplate with detailed docstrings for improved code documentation and clarity
      
      * Refactor import statements and clean up whitespace in template files for improved readability
      
      * Add README.md for Carver framework with usage examples and architecture support
      
      * Refactor import statement in matmul_analysis.py for consistency
      be946d02
    • Lei Wang's avatar
      [Carver] Introduce a tile-structure based cost model for auto tuning (#70) · cd191889
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      
      * Add BF16 support to matrix multiplication and introduce corresponding test cases
      
      * Add a blank line for improved readability in BF16 GEMM test
      
      * Update acknowledgements in README to include supervision by Zhi Yang at Peking University
      
      * enhance acknowledgement
      
      * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives
      
      * Update subproject commit for TVM dependency
      
      * Update subproject commit for TVM dependency
      
      * Add int4_t type and functions for packing char values in CUDA common header
      
      * Add plot_layout example and implement GetForwardVars method in layout classes
      
      * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files
      
      * Fix formatting by removing unnecessary line break in layout.h
      
      * Refactor make_int4 function for improved readability by adjusting parameter formatting
      
      * Add legend to plot_layout for improved clarity of thread and local IDs
      
      * Remove unnecessary dependencies from requirements files for cleaner setup
      
      * Remove flash_mha.py and add .gitkeep to deepseek_mla directory
      
      * Add build requirements and update installation scripts for improved setup
      
      * Introduce carver
      
      * Refactor imports and improve code formatting for consistency
      
      * Add unit tests for carver recommendation hints
      
      * lint fix
      
      * Enhance ElementwiseTemplate and BaseTemplate with detailed docstrings for improved code documentation and clarity
      
      * Refactor import statements and clean up whitespace in template files for improved readability
      
      * Add README.md for Carver framework with usage examples and architecture support
      cd191889
    • Lei Wang's avatar
      [Dev] Remove unnecessary python dependencies (#69) · 2411fa28
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      
      * Add BF16 support to matrix multiplication and introduce corresponding test cases
      
      * Add a blank line for improved readability in BF16 GEMM test
      
      * Update acknowledgements in README to include supervision by Zhi Yang at Peking University
      
      * enhance acknowledgement
      
      * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives
      
      * Update subproject commit for TVM dependency
      
      * Update subproject commit for TVM dependency
      
      * Add int4_t type and functions for packing char values in CUDA common header
      
      * Add plot_layout example and implement GetForwardVars method in layout classes
      
      * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files
      
      * Fix formatting by removing unnecessary line break in layout.h
      
      * Refactor make_int4 function for improved readability by adjusting parameter formatting
      
      * Add legend to plot_layout for improved clarity of thread and local IDs
      
      * Remove unnecessary dependencies from requirements files for cleaner setup
      
      * Remove flash_mha.py and add .gitkeep to deepseek_mla directory
      
      * Add build requirements and update installation scripts for improved setup
      2411fa28
  10. 09 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Tools] Introduce `plot_layout` to visualize the fragment layout (#68) · f9b6a92e
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      
      * Add BF16 support to matrix multiplication and introduce corresponding test cases
      
      * Add a blank line for improved readability in BF16 GEMM test
      
      * Update acknowledgements in README to include supervision by Zhi Yang at Peking University
      
      * enhance acknowledgement
      
      * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives
      
      * Update subproject commit for TVM dependency
      
      * Update subproject commit for TVM dependency
      
      * Add int4_t type and functions for packing char values in CUDA common header
      
      * Add plot_layout example and implement GetForwardVars method in layout classes
      
      * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files
      
      * Fix formatting by removing unnecessary line break in layout.h
      
      * Refactor make_int4 function for improved readability by adjusting parameter formatting
      f9b6a92e
  11. 08 Feb, 2025 1 commit
    • Yu Cheng's avatar
      [CI][Test] Add test cases for tilelang transform InjectFenceProxy (#66) · 0677e542
      Yu Cheng authored
      * [Dev] Add FlashDecoding example
      
      * [CI][Test] Add test cases for tilelang kernel convolution
      
      * [CI][Test] Add test cases for tilelang kernel FlashAttention
      
      * Reduce the number of stages to ensure the shared memory allocation is valid
      
      * Temporarily remove the dim128 case
      
      * lint
      
      * update einops in requirements-dev.txt
      
      * update einops in requirements-test.txt
      
      * remove einops in requirements-dev.txt
      
      * [CI][Test] Add test cases for tilelang transform ClusterPlanning
      
      * [CI][Test] Add test cases for tilelang transform LowerHopperIntrin
      
      * [CI][Test] Add test cases for tilelang transform InjectFenceProxy
      0677e542
  12. 06 Feb, 2025 2 commits
    • Lei Wang's avatar
      [Dev] Add test case for bfloat16 and int4 gemm with mma (#65) · 0d19e268
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      
      * Add BF16 support to matrix multiplication and introduce corresponding test cases
      
      * Add a blank line for improved readability in BF16 GEMM test
      
      * Update acknowledgements in README to include supervision by Zhi Yang at Peking University
      0d19e268
    • Lei Wang's avatar
      [Dev] Support FP8 Codegen for cuda backend (#64) · 61de5288
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      61de5288
  13. 03 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Dev] Separate `LoopVectorize` Pass from upstream tvm (#62) · 7111239d
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      7111239d
  14. 02 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Doc] Add matmul kernel tutorial documentations with tile library (#60) · ea612446
      Lei Wang authored
      * implement jit test case
      
      * [Dev] implement auto tune test case for matrix multiplication
      
      * Implement test for legalize memory access and vectorized loop
      
      * lint fix
      
      * introduce run_once
      
      * Refactor callback function names for consistency and improve code readability
      
      * enhance documentations
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix formatting issues in rt_mod_hip.cc
      
      * add random seed initialization for deterministic testing
      
      * Add documentation images and comprehensive GEMM tutorial for TileLang
      
      * Update MATMUL documentation title to highlight Tile Library
      ea612446
  15. 27 Jan, 2025 1 commit
    • Yu Cheng's avatar
      [CI][Test] Add test cases for tilelang transform LowerHopperIntrin (#59) · 7d4156df
      Yu Cheng authored
      * [Dev] Add FlashDecoding example
      
      * [CI][Test] Add test cases for tilelang kernel convolution
      
      * [CI][Test] Add test cases for tilelang kernel FlashAttention
      
      * Reduce the number of stages to ensure the shared memory allocation is valid
      
      * Temporarily remove the dim128 case
      
      * lint
      
      * update einops in requirements-dev.txt
      
      * update einops in requirements-test.txt
      
      * remove einops in requirements-dev.txt
      
      * [CI][Test] Add test cases for tilelang transform ClusterPlanning
      
      * [CI][Test] Add test cases for tilelang transform LowerHopperIntrin
      7d4156df
  16. 26 Jan, 2025 3 commits
    • Lei Wang's avatar
      [Doc] Addd debug relevant testing and documentations (#58) · 5e259239
      Lei Wang authored
      * implement jit test case
      
      * [Dev] implement auto tune test case for matrix multiplication
      
      * Implement test for legalize memory access and vectorized loop
      
      * lint fix
      
      * introduce run_once
      
      * Refactor callback function names for consistency and improve code readability
      
      * enhance documentations
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix formatting issues in rt_mod_hip.cc
      
      * add random seed initialization for deterministic testing
      5e259239
    • Yu Cheng's avatar
      [CI][Test] Add test cases for tilelang transform ClusterPlanning (#57) · 0d8421f1
      Yu Cheng authored
      * [Dev] Add FlashDecoding example
      
      * [CI][Test] Add test cases for tilelang kernel convolution
      
      * [CI][Test] Add test cases for tilelang kernel FlashAttention
      
      * Reduce the number of stages to ensure the shared memory allocation is valid
      
      * Temporarily remove the dim128 case
      
      * lint
      
      * update einops in requirements-dev.txt
      
      * update einops in requirements-test.txt
      
      * remove einops in requirements-dev.txt
      
      * [CI][Test] Add test cases for tilelang transform ClusterPlanning
      0d8421f1
    • Wenhao Xie's avatar
  17. 25 Jan, 2025 8 commits
  18. 24 Jan, 2025 3 commits
    • Lei Wang's avatar
      [Debug] Introduce `T.print` for buffer and variables logging on frontend (#45) · 8cdc185b
      Lei Wang authored
      * [Doc] Update documentation structure and content: add overview section, revise project name, and change theme to Furo
      
      * [Feature] Add device-side debug printing functions and integrate into kernel interface
      
      * lint fix
      
      * remove debug print
      
      * implement test for debug
      
      * lint fix
      
      * add some comments
      
      * Enhance fragment design and assert fragment print
      
      * enhance debug print
      
      * add test for msg
      
      * lint fix
      8cdc185b
    • Zhengju Tang's avatar
      [CI][Test] Add test cases for tilelang transform `LayoutInference` and... · 22246c65
      Zhengju Tang authored
      [CI][Test] Add test cases for tilelang transform `LayoutInference` and `LowerTileOp` on loop tail split functionality (#29)
      
      * [CI][Test] Add test cases for tilelang transform `LayoutInference` and `LowerTileOp` on loop tail split functionality
      
      * format
      
      * rename test script
      22246c65
    • Cunxiao Ni's avatar
      [CI][Test] Add test cases for tilelang transform PipelinePlanning (#44) · 1f4bfe66
      Cunxiao Ni authored
      * [Doc] update installation.md and readme
      
      * solve conflicts
      
      * change readme
      
      * fix installation.rst
      
      * fix readme
      
      * fix installation
      
      * [fix] fix installation.rst
      
      * [Doc] fix installation.rst
      
      * [Doc] fix installation
      
      * [CI][Test] Add test cases for tilelang transform PipelinePlanning
      1f4bfe66