"git@developer.sourcefind.cn:gaoqiong/migraphx.git" did not exist on "75d919807bba837285b87edaa6376a8217dc55e4"
  1. 16 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Warp Specialize] Implicit Warp Specialize Programing Model (#605) · e2d25ba8
      Lei Wang authored
      * [Enhancement] Improve memory access condition checks in GlobalMemChecker
      
      - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
      - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.
      
      * lintfix
      
      * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy
      
      - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation.
      - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function.
      
      * [Refactor] Update barrier and clear operations in warp specialization examples
      
      - Replaced `mbarrier_wait_parity` and `mbarrier_arrive` with `barrier_wait` and `barrier_arrive` for improved clarity and consistency in synchronization.
      - Adjusted the order of `clear` operations for local fragments in `example_warp_specialize_gemm_copy_1_gemm_0` to enhance parallel execution efficiency.
      
      * [Enhancement] Implement thread partial synchronization and improve shared memory allocation handling
      
      - Added support for thread partial barrier synchronization in CUDA, allowing for more flexible thread management.
      - Enhanced the `MergeSharedMemoryAllocations` function to accept alignment bytes, improving memory allocation efficiency based on target requirements.
      - Updated the `Lower` methods in `Copy` and `Fill` classes to include conditional predicates for thread execution, ensuring better control over thread behavior.
      - Refactored the `print` function to include warp group and warp IDs for more detailed debugging output.
      - Improved the handling of dynamic shared memory allocations in the `LowerAndLegalize` function to align with target-specific requirements.
      
      * [Enhancement] Add support for disabling TMA in Copy operations
      
      - Introduced a new `disable_tma` parameter in the `Copy` class to control thread memory access behavior.
      - Updated the `Lower` method to conditionally execute bulk copy operations based on the `disable_tma` flag.
      - Enhanced the `copy` function to accept the `disable_tma` argument, allowing for more flexible memory copy operations.
      - Improved handling of `coalesced_width` to ensure it defaults to -1 when not provided, enhancing robustness in memory operations.
      
      * [Refactor] Clean up whitespace and formatting in multiple files
      
      - Removed unnecessary blank lines and adjusted line breaks for improved code readability in `example_mla_decode.py`, `example_warp_specialize_gemm_copy_gemm_0_1.py`, `phase.py`, and `copy.py`.
      - Ensured consistent formatting across functions to enhance maintainability and clarity of the codebase.
      
      * [Enhancement] Refactor flash attention implementation for improved performance and configurability
      
      - Split the shared memory allocations for query and key-value pairs to optimize memory usage.
      - Introduced command-line arguments for batch size, number of heads, and dimensions, enhancing flexibility in running the example.
      - Updated kernel execution parameters to improve thread management and synchronization.
      - Enhanced the overall structure of the flash attention function for better readability and maintainability.
      
      * fix
      
      * Update layout inference in ParallelOp to account for thread bounds; remove debug print in OptimizeForTarget
      
      * Refactor barrier handling and update example configurations
      
      - Replaced commented-out barrier creation with new barrier allocation in GEMM example.
      - Updated kernel configuration in warp specialization example to include async copy settings.
      - Enhanced barrier management in the phase optimization process to improve synchronization handling.
      - Introduced new barrier allocation function for better memory management in shared contexts.
      
      * Refactor barrier handling in LowerAndLegalize and OptimizeForTarget
      
      - Reintroduced barrier lowering in OptimizeForTarget to enhance synchronization.
      - Removed commented-out barrier lowering in LowerAndLegalize for cleaner code.
      - Added exit() call in OptimizeForTarget to halt execution after barrier lowering.
      
      * Enhance CMake configuration and clean up example scripts
      
      - Enabled compile command export in CMakeLists.txt for better build integration.
      - Removed unnecessary print statement in the warp specialization example.
      - Cleaned up commented-out code in GEMM example for improved readability.
      - Updated barrier handling in shared memory allocation transformations for better synchronization.
      
      * Refactor barrier handling in warp specialization examples
      
      - Replaced commented-out mbarrier code with new barrier allocation using T.alloc_barrier for improved synchronization.
      - Updated barrier wait and arrive calls to align with the new allocation method across multiple example scripts.
      - Enhanced code readability by removing unnecessary comments and ensuring consistent barrier management.
      
      * Update lower_shared_barrier.cc
      
      * Update phase.py
      
      * Update warp specialization example and Cython wrapper
      
      - Removed commented-out pass configuration options in the warp specialization example for clarity.
      - Added functionality to write the generated kernel source to a file named "kernel.cu".
      - Enhanced Cython wrapper to support boolean type conversion for improved type handling.
      
      * Add storage synchronization call in shared barrier transformation
      
      - Introduced a new evaluation statement to call the TVM storage sync function with "shared" as an argument, enhancing synchronization in the shared barrier handling process.
      
      * remove debug files
      
      * Remove kernel source output to file in warp specialization example
      
      * remove comments
      
      * Refactor tensor handling and update test execution in TileLang
      
      - Changed `Buffer` to `Tensor` in `customize.py` for better type consistency.
      - Updated `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to use `tir.BufferLoad` instead of `BufferLoad`.
      - Commented out the main testing function in `test_tilelang_language_reshape.py` and replaced it with a direct call to `run_reshape_smem` for streamlined testing.
      - Removed unnecessary NVCC compiler flags in `libgen.py` to reduce verbosity.
      
      * Update test_tilelang_language_reshape.py
      e2d25ba8
  2. 03 Jun, 2025 1 commit
  3. 20 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180
      Lei Wang authored
      * remove llvm build
      
      * [Refactor] Update kernel compilation and profiling in examples
      
      - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
      - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
      - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.
      
      * lint fix
      
      * License Update
      
      * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files
      
      - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
      - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.
      
      * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files
      
      - Improved comment alignment and readability in `cuda.h`.
      - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix
      
      * License update
      
      * [Enhancement] Update JITKernel to use artifact for kernel source
      
      - Assigned the generated artifact to `self.artifact` for better management.
      - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.
      
      * lint fix
      
      * Add @tilelang.testing.requires_llvm decorator to vectorization tests
      
      * Enhance setup.py and env.py for library management
      
      - Added functionality to remove original files after copying in CMakeBuild.
      - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.
      
      * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py
      
      * Refactor CMakeBuild file handling in setup.py
      
      - Added a check to ensure the target library directory exists before copying .so files.
      - Improved the logic for creating the target directory and copying files to enhance robustness.
      
      * bugfix
      
      * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.
      
      * lint fix
      
      * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.
      
      * lint fix
      
      * Add support for C target in device code generation
      
      - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.
      
      * [Enhancement] Implement auto-clear cache feature based on environment variable
      
      * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
      * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
      * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.
      
      * [Refactor] Update kernel invocation and import paths in tests and cache
      
      * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
      * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
      * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.
      
      * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py
      
      * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
      * Enhanced overall code formatting to align with project standards.
      
      * [Enhancement] Add bfloat16 test case and improve kernel caching logic
      
      * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
      * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
      * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
      * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
      * Improved code formatting and readability across several files.
      
      * lint fix
      
      * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
      f2e99180
  4. 12 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [CMake] Add CUDA Major Version Detection for Conditional Compilation (#197) · 20f19611
      Yu Cheng authored
      * [Feature] Add TMA Store Synchronization Support
      
      - Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
      - Add new builtin operations in op/builtin.cc and op/builtin.h
      - Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
      - Update CUDA codegen to support new TMA store synchronization intrinsics
      - Add Python language bindings for new TMA store synchronization operations
      
      * [CMake] Add CUDA Major Version Detection for Conditional Compilation
      
      - Introduce CUDA_MAJOR_VERSION CMake variable to dynamically detect CUDA toolkit version
      - Update runtime and transform files to use CUDA_MAJOR_VERSION for version-specific code paths
      - Replace hardcoded __CUDACC_VER_MAJOR__ with dynamically set CUDA_MAJOR_VERSION
      - Improve cross-version compatibility for CUDA-dependent code sections
      20f19611
  5. 15 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Backend][WebGPU] Support WebGPU WGSL code generation (#86) · c8fc0cbb
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      
      * [Build] Add tox configuration and local distribution script for multi-Python version support
      
      * [Build] Improve git submodule update function with better error handling
      
      * [Build] Update LLVM configuration path in ROCm installation script
      
      * [Build] Add .tox/ to .gitignore for tox testing environment
      
      * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt
      
      * [Cleanup] Remove unused TVM runtime error codes header
      
      * [Cleanup] Fix TVM grid constant type reference in CUDA module
      
      * [Cleanup] Remove unused customized_code function from IR module
      
      * [Feature] Add TileLang thread synchronization and storage access analysis passes
      
      * [Build] Reorder DLL search path directories for more flexible library loading
      
      * [Refactor] Improve thread synchronization and library path handling
      
      - Rename ThreadSync and TileLangThreadSync functions in C++ code
      - Update Python docstring for ThreadSync with more detailed description
      - Reorder library path detection in tilelang environment setup
      - Minor comment and code cleanup in CUDA and warp specialization modules
      
      * [Refactor] Improve thread synchronization code style and formatting
      
      - Standardize pointer type spacing in storage_access.h and storage_access.cc
      - Update whitespace and indentation in thread_storage_sync.cc
      - Reorder include statements in thread_partial_sync.cc
      - Minor code formatting improvements across thread synchronization files
      
      * [Refactor] Fix global function registration for ThreadSync
      
      - Correct global function registration to use ThreadSync instead of TileLangThreadSync
      - Update TVM global registration to match recent refactoring efforts
      
      * [Refactor] Simplify ThreadSync global function registration
      
      - Remove unnecessary whitespace in global function registration
      - Compact the TVM global registration line for ThreadSync
      
      * [Feature] Add WebGPU code generation support in TileLang
      
      - Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h)
      - Add WebGPU target support in lower.py and target.py
      - Update CMakeLists.txt to include WebGPU codegen source files
      - Introduce WebGPU-specific code generation for WGSL shader language
      
      * [Refactor] Improve WebGPU code generation formatting and readability
      
      - Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h
      - Standardize pointer type spacing and indentation
      - Improve line breaks and reduce line length for better readability
      - Minor code style improvements in WebGPU code generation
      
      * [Test] Add WebGPU matrix multiplication code generation test
      
      - Implement test_webgpu_codegen.py for WebGPU matrix multiplication
      - Add assert_gemm_codegen function to validate WebGPU code generation
      - Include basic matrix multiplication kernel test case
      
      * Update README with WebGPU codegen support announcement
      c8fc0cbb
  6. 14 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Separate tilelang Pass Thread Sync (with Hopper support) from tvm (#85) · ec84188f
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      
      * [Build] Add tox configuration and local distribution script for multi-Python version support
      
      * [Build] Improve git submodule update function with better error handling
      
      * [Build] Update LLVM configuration path in ROCm installation script
      
      * [Build] Add .tox/ to .gitignore for tox testing environment
      
      * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt
      
      * [Cleanup] Remove unused TVM runtime error codes header
      
      * [Cleanup] Fix TVM grid constant type reference in CUDA module
      
      * [Cleanup] Remove unused customized_code function from IR module
      
      * [Feature] Add TileLang thread synchronization and storage access analysis passes
      
      * [Build] Reorder DLL search path directories for more flexible library loading
      
      * [Refactor] Improve thread synchronization and library path handling
      
      - Rename ThreadSync and TileLangThreadSync functions in C++ code
      - Update Python docstring for ThreadSync with more detailed description
      - Reorder library path detection in tilelang environment setup
      - Minor comment and code cleanup in CUDA and warp specialization modules
      
      * [Refactor] Improve thread synchronization code style and formatting
      
      - Standardize pointer type spacing in storage_access.h and storage_access.cc
      - Update whitespace and indentation in thread_storage_sync.cc
      - Reorder include statements in thread_partial_sync.cc
      - Minor code formatting improvements across thread synchronization files
      
      * [Refactor] Fix global function registration for ThreadSync
      
      - Correct global function registration to use ThreadSync instead of TileLangThreadSync
      - Update TVM global registration to match recent refactoring efforts
      
      * [Refactor] Simplify ThreadSync global function registration
      
      - Remove unnecessary whitespace in global function registration
      - Compact the TVM global registration line for ThreadSync
      ec84188f
  7. 17 Jan, 2025 1 commit
  8. 11 Jan, 2025 1 commit
    • Lei Wang's avatar
      [Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c
      Lei Wang authored
      
      
      * Add format.sh script for code formatting and linting
      
      * docs update
      
      * center align the title
      
      * lint fix
      
      * add ignore
      
      * Add .gitignore for 3rdparty directory
      
      * Add requirements-dev.txt, requirements-test.txt, and requirements.txt
      
      * 3rdparty
      
      * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h
      
      * Refactor CMakeLists.txt and include statements
      
      - Update CMakeLists.txt to use a newer version of CMake and add project name
      - Remove unnecessary include directories
      
      Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc
      
      - Update include paths to use relative paths instead of absolute paths
      
      * Update submodule for 3rdparty/tvm
      
      * update
      
      * load dll first
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * git keep update
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * refactor code structure
      
      * Update Readme
      
      * CMakeLists Customized
      
      * update readme
      
      * update README
      
      * update readme
      
      * update usage
      
      * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import
      
      * annotate lower transform global func with `transform` prefix
      
      * Migrate Simplify Pass from tilelang tvm branch
      
      * enhance system environment handling with __init__ and CMake
      
      * Initial commit
      
      * CODE_OF_CONDUCT.md committed
      
      * LICENSE committed
      
      * README.md committed
      
      * SECURITY.md committed
      
      * SUPPORT.md committed
      
      * CODE_OF_CONDUCT Commit
      
      * LICENSE Commit
      
      * SECURITY Commit
      
      * SUPPORT Commit
      
      * Modify Support
      
      * Update README.md
      
      * security ci update
      
      * remove examples
      
      * Update and implement clang-format
      
      * add composable kernel components
      
      * Migrate from latest update
      
      * submodule update
      
      * Test update
      
      * Update License
      
      * Spell check
      
      * lint fix
      
      * add clang-tidy to apply static analysis for c source
      
      * update tilelang examples
      
      * Update Install Docs
      
      * Refactor filetree
      
      * Enhance Install
      
      * conflict resloved
      
      * annotate_version
      
      * Initial Update
      
      * test fix
      
      * install
      
      * Implement setup.py
      
      * lint fix
      
      * Separate Init
      
      * Separate test
      
      * docker file commit
      
      * add logo
      
      * Update Readme and Examples
      
      * update readme
      
      * update logo
      
      * Implement AMD Installation
      
      * Add License
      
      * Update AMD MI300x Benchmark
      
      * update README
      
      * update mi300 benchmark scripts
      
      * update ignore
      
      * enhance build scirpt
      
      * update image
      
      * enhance setup.py to remove duplicated libraries
      
      * remove debug files
      
      * update readme
      
      * update image
      
      * update gemm examples
      
      * update flashattention README
      
      * readme update
      
      * add cmake into requirements
      
      * libinfo fix
      
      * auto update submodule
      
      * lint fix
      
      * Fix AMD Build and Test
      
      * Update check for transpose attribute for CDNA Arch
      
      * typo fix for amd
      
      * Implement Matmul Benchmark
      
      * Refactor Code
      
      * [TypoFix] Fix GEMM Example
      
      * [Docs] Init Linear Attention README
      
      * [TYPO] Typo fix
      
      * [Lint] Lint Fix
      
      * enhance example with intrinsics
      
      * [Enhancement] Improve Buffer Collection during IR Parser
      
      * [Dev] Introduce Current classmethod to get current frame
      
      * submodule update
      
      * fake test pass update
      
      * support thread_extent_api
      
      * code optimize
      
      * Add GEMM function implementation for matrix multiplication
      
      * Update logging format to reflect TileLang in logger messages
      
      * Refactor CMakeLists.txt for improved readability and set default build type to Release
      
      * Support Gemm SS Primitives Implementation
      
      * [README] Upload Tile Language Logo (#5)
      
      * update logo
      
      * Update README.md to enhance formatting and center the title
      
      ---------
      Co-authored-by: default avatarmicrosoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
      Co-authored-by: default avatarMicrosoft Open Source <microsoftopensource@users.noreply.github.com>
      Co-authored-by: default avatarYu Cheng <yu.cheng@pku.edu.cn>
      57ab687c