1. 19 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency (#95) · 2ac51a03
      Lei Wang authored
      * bump version into v0.1.0
      
      * [Enhancement] Add custom develop command for editable installs and update .gitignore
      
      * [Documentation] Update README to include system dependencies installation instructions
      
      * [Build] Update setup.py to support library file copying for both release and develop modes
      
      * [Build] Refactor library file copying logic in setup.py
      
      * [Documentation] Remove unnecessary install section header in Installation.md
      
      * [Build] Add tox configuration and local distribution script for multi-Python version support
      
      * [Build] Improve git submodule update function with better error handling
      
      * [Build] Update LLVM configuration path in ROCm installation script
      
      * [Build] Add .tox/ to .gitignore for tox testing environment
      
      * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt
      
      * [Cleanup] Remove unused TVM runtime error codes header
      
      * [Cleanup] Fix TVM grid constant type reference in CUDA module
      
      * [Cleanup] Remove unused customized_code function from IR module
      
      * [Feature] Add TileLang thread synchronization and storage access analysis passes
      
      * [Build] Reorder DLL search path directories for more flexible library loading
      
      * [Refactor] Improve thread synchronization and library path handling
      
      - Rename ThreadSync and TileLangThreadSync functions in C++ code
      - Update Python docstring for ThreadSync with more detailed description
      - Reorder library path detection in tilelang environment setup
      - Minor comment and code cleanup in CUDA and warp specialization modules
      
      * [Refactor] Improve thread synchronization code style and formatting
      
      - Standardize pointer type spacing in storage_access.h and storage_access.cc
      - Update whitespace and indentation in thread_storage_sync.cc
      - Reorder include statements in thread_partial_sync.cc
      - Minor code formatting improvements across thread synchronization files
      
      * [Refactor] Fix global function registration for ThreadSync
      
      - Correct global function registration to use ThreadSync instead of TileLangThreadSync
      - Update TVM global registration to match recent refactoring efforts
      
      * [Refactor] Simplify ThreadSync global function registration
      
      - Remove unnecessary whitespace in global function registration
      - Compact the TVM global registration line for ThreadSync
      
      * [Feature] Add WebGPU code generation support in TileLang
      
      - Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h)
      - Add WebGPU target support in lower.py and target.py
      - Update CMakeLists.txt to include WebGPU codegen source files
      - Introduce WebGPU-specific code generation for WGSL shader language
      
      * [Refactor] Improve WebGPU code generation formatting and readability
      
      - Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h
      - Standardize pointer type spacing and indentation
      - Improve line breaks and reduce line length for better readability
      - Minor code style improvements in WebGPU code generation
      
      * [Test] Add WebGPU matrix multiplication code generation test
      
      - Implement test_webgpu_codegen.py for WebGPU matrix multiplication
      - Add assert_gemm_codegen function to validate WebGPU code generation
      - Include basic matrix multiplication kernel test case
      
      * Update README with WebGPU codegen support announcement
      
      * Support multi version pypi package build via tox
      
      * Add support for CPU device backend with C code generation
      
      - Introduce `is_cpu_device_backend` function to detect CPU backend with C code generation
      - Modify `lower` function to handle special case of CPU device backend
      - Update host and device call filtering for CPU backend
      - Add conditional source code generation for C host target
      - Extend JITKernel to support optional target_host parameter
      
      * lint fix
      
      * Enhance JIT kernel adapters with CTypes and Torch C++ backends
      
      - Add CtypesKernelAdapter with dynamic library generation and kernel wrapping
      - Implement TorchCPPKernelAdapter for CUDA kernel compilation
      - Refactor BaseKernelAdapter to support more flexible initialization
      - Improve error handling and argument processing in kernel adapters
      - Update adapter initialization to support various execution backends
      
      * Refactor and clean up code style in JIT CTypes adapter modules
      
      - Apply consistent code formatting and whitespace in CTypes adapter files
      - Remove unused imports and improve import organization
      - Enhance readability of code in adapter, libgen, and wrapper modules
      - Add missing whitespace and improve line breaks
      - Minor linting and code style improvements across CTypes adapter files
      
      * Add test for TileLang JIT GEMM with CTypes backend
      
      - Implement comprehensive test for matrix multiplication using CTypes execution backend
      - Create test functions for GEMM with float16 data type
      - Add kernel source verification with custom callback
      - Implement reference implementation using PyTorch for result validation
      - Support various matrix multiplication configurations (transposition, block sizes)
      
      * test fix
      
      * Update TileLang JIT callback registration with override parameter
      
      - Modify tilelang_callback_cuda_postproc to use @tvm.register_func(override=True)
      - Ensure proper function registration with ability to replace existing implementations
      2ac51a03
  2. 06 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Dev] Support FP8 Codegen for cuda backend (#64) · 61de5288
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      61de5288
  3. 26 Jan, 2025 1 commit
    • Lei Wang's avatar
      [Doc] Addd debug relevant testing and documentations (#58) · 5e259239
      Lei Wang authored
      * implement jit test case
      
      * [Dev] implement auto tune test case for matrix multiplication
      
      * Implement test for legalize memory access and vectorized loop
      
      * lint fix
      
      * introduce run_once
      
      * Refactor callback function names for consistency and improve code readability
      
      * enhance documentations
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix formatting issues in rt_mod_hip.cc
      
      * add random seed initialization for deterministic testing
      5e259239
  4. 25 Jan, 2025 1 commit
  5. 20 Jan, 2025 1 commit
    • Lei Wang's avatar
      [Dev][jit] Introduce jit for kernel functions (#12) · 39fc5a6d
      Lei Wang authored
      * instruction update
      
      * replace link with TileLang/tile-lang
      
      * [Dev][Adapter] Implement Torch DLPack Kernel Adapter and related utilities
      
      * lint fix
      
      * Implement JIT Compiler Components
      
      * Documents update
      
      * lint fix
      
      * update logo
      
      * install script fix
      39fc5a6d