"test/git@developer.sourcefind.cn:gaoqiong/migraphx.git" did not exist on "cc2535e0c5201e4ece51be90699dcc8b79579915"
  1. 05 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Enhance FP8/FP4 type handling in CUDA codegen (#323) · 89725f7f
      Lei Wang authored
      
      
      * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
      
      - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
      - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
      - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
      - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
      
      * [Refactor] Clean up whitespace in CUDA-related files
      
      - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
      - This change enhances the overall organization of the codebase without altering functionality.
      
      * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
      
      - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
      - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
      - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
      - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
      
      * lint fix
      
      * Update submodule and enhance FP8 type handling in CUDA codegen
      
      - Updated the TVM submodule to the latest commit.
      - Modified FP8 type handling in `codegen_cuda.cc` to use more descriptive type codes.
      - Improved constant printing for FP8 and bfloat16 types, ensuring correct representation in generated code.
      - Added error handling for missing configuration keys in the AutoTuner class.
      
      * lint fix
      
      * Remove print statement from example script
      
      * lint fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
      89725f7f
  2. 26 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
  3. 20 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180
      Lei Wang authored
      * remove llvm build
      
      * [Refactor] Update kernel compilation and profiling in examples
      
      - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
      - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
      - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.
      
      * lint fix
      
      * License Update
      
      * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files
      
      - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
      - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.
      
      * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files
      
      - Improved comment alignment and readability in `cuda.h`.
      - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix
      
      * License update
      
      * [Enhancement] Update JITKernel to use artifact for kernel source
      
      - Assigned the generated artifact to `self.artifact` for better management.
      - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.
      
      * lint fix
      
      * Add @tilelang.testing.requires_llvm decorator to vectorization tests
      
      * Enhance setup.py and env.py for library management
      
      - Added functionality to remove original files after copying in CMakeBuild.
      - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.
      
      * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py
      
      * Refactor CMakeBuild file handling in setup.py
      
      - Added a check to ensure the target library directory exists before copying .so files.
      - Improved the logic for creating the target directory and copying files to enhance robustness.
      
      * bugfix
      
      * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.
      
      * lint fix
      
      * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.
      
      * lint fix
      
      * Add support for C target in device code generation
      
      - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.
      
      * [Enhancement] Implement auto-clear cache feature based on environment variable
      
      * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
      * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
      * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.
      
      * [Refactor] Update kernel invocation and import paths in tests and cache
      
      * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
      * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
      * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.
      
      * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py
      
      * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
      * Enhanced overall code formatting to align with project standards.
      
      * [Enhancement] Add bfloat16 test case and improve kernel caching logic
      
      * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
      * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
      * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
      * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
      * Improved code formatting and readability across several files.
      
      * lint fix
      
      * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
      f2e99180
  4. 17 Mar, 2025 1 commit
  5. 13 Mar, 2025 1 commit
    • zqh-wz's avatar
      [Feature] Upgrade cutlass version and support fp8 T.gemm (#202) · 2cccf1f5
      zqh-wz authored
      
      
      * upgrade cutlass to upstream v3.8.0
      
      * Implement fp8 gemm and add example script
      
      * Fix dtype retrieval with map_torch_type for fp8 inputs
      
      * Disable vectorization of fp8 values
      
      * Make MMA declaration compatible with cutlass 3.4.0+
      
      * Add test for fp8 T.gemm
      
      * fix indent
      
      * fix indent
      
      * Add copyright and license header
      
      * Add copyright and license header
      
      * lint fix
      
      * Refactor matmul_nt and assert_matmul_correctness functions for improved readability by consolidating parameter definitions and adjusting formatting.
      
      * clang format lint
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      2cccf1f5
  6. 07 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Replace `T.thread_binding` with `T.get_thread_binding` in examples and test cases (#163) · de1ba1e4
      Lei Wang authored
      * [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation
      
      - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
      - Modify roller hints generation using new TileLang Carver template and utility functions
      - Update get_roller_hints_from_func to handle None cases and improve return logic
      - Adjust DefaultPolicy to handle different codegen dictionary formats
      
      * [Refactor] Update Thread Binding and Import Statements in TileLang Kernels
      
      - Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
      - Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
      - Move map_torch_type utility function to tilelang.utils.tensor
      - Remove unnecessary imports and improve code organization
      de1ba1e4
  7. 05 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Rename gemm fp8 example as we currently lack `T.gemm` support for fp8 (#144) · 37d44f24
      Lei Wang authored
      * Change default log level from WARNING to INFO in TileLang initialization
      
      * Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support
      
      - Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
      - Remove unused imports and simplify function signature
      - Modify `flashattn` function to handle max sequence length as a separate argument
      - Update kernel call to include max sequence length parameter
      - Improve code readability and remove commented-out code
      - Add print statement to confirm successful assertion
      
      * Refactor code formatting in TileLang lowering and example files
      
      - Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
      - Simplify line breaks and reduce unnecessary whitespace
      - Enhance code readability by adjusting indentation and line breaks
      - Update example MHA forward pass script with cleaner tensor initialization
      
      * Update TileLang kernel test with import path changes for MMA layout and macro generator
      
      - Modify import statements in test_tilelang_kernel_dequantize_gemm.py
      - Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
      - Update main function to use tilelang.testing.main()
      
      * Add Block Sparse Attention Examples for TileLang and Triton
      
      - Implement block sparse attention kernels for both TileLang and Triton
      - Add utility functions for generating sparse attention masks using top-k and threshold methods
      - Support causal and variable-length attention scenarios
      - Include test cases for different sequence length configurations
      - Demonstrate block-level sparse attention with configurable parameters
      
      * Refactor Block Sparse Attention Examples with Code Style Improvements
      
      - Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
      - Enhance readability by adjusting line breaks and indentation
      - Simplify kernel and function calls with better formatting
      - Add whitespace and line break improvements for better code clarity
      
      * Enhance Layout Plotting with Multi-Replication and Dynamic Visualization
      
      - Update plot_layout function to support multiple replications in thread and value mapping
      - Improve thread and value mapping to handle replicated layouts
      - Dynamically adjust figure size and legend positioning
      - Add print statements for saved plot file paths
      - Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting
      
      * Refactor AtomicAdd functions in CUDA common header
      
      - Implement a generic template for AtomicAdd function
      - Specialize templates for half_t, bfloat16_t, and pointer types
      - Reorganize and clean up existing AtomicAdd implementations
      - Improve type handling and conversion in atomic operations
      
      * Remove unused import in MHA backward test file
      
      - Remove unnecessary argparse import from test_tilelang_kenrel_mha_bwd.py
      - Add blank line for improved code formatting
      - Minor code cleanup in test file
      
      * Add FP8 GEMM Example with TensorCore Intrinsics
      
      - Implement a new example for FP8 matrix multiplication using TensorCore intrinsics
      - Support E4M3 and E5M2 floating-point 8-bit data types
      - Add README with notes on current FP8 implementation limitations
      - Include correctness test for FP8 GEMM with different configurations
      - Demonstrate swizzle layout and pipeline optimizations for FP8 computation
      37d44f24
  8. 26 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Example] Update GEMM FP8 Example (#123) · 13f4b5c6
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      
      * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang
      
      This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
      - Group Query Attention (GQA) implementation
      - Flash Attention forward pass
      - Performance benchmarking
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Autotuning support
      - Reference implementation comparison
      
      * Refactor IR lowering pipeline into modular phases
      
      This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
      - `LowerAndLegalize`: Handles initial IR legalization and transformation
      - `OptimizeForTarget`: Applies target-specific optimizations
      
      The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.
      
      * lintfix
      
      * nas kernel
      
      * Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates
      
      - Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
      - Increased default number of heads and selected blocks in TileLang NSA example
      - Added Ruff linter ignore comments to reference.py
      - Standardized function signatures and improved code readability across NSA implementations
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation
      
      - Update flash attention kernel to support positional embeddings (PE)
      - Modify reference implementation to handle PE and group query attention
      - Increase default batch size and adjust benchmarking parameters
      - Improve kernel performance and readability
      - Add einops and torch operations for more flexible tensor manipulation
      
      * Update README.md with corrected Flash MLA Decoding example path
      
      - Modify the example link for Flash MLA Decoding to point to the correct directory
      - Ensure accurate navigation to the DeepSeek MLA decoding example
      13f4b5c6