• Lei Wang's avatar
    [Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency (#95) · 2ac51a03
    Lei Wang authored
    * bump version into v0.1.0
    
    * [Enhancement] Add custom develop command for editable installs and update .gitignore
    
    * [Documentation] Update README to include system dependencies installation instructions
    
    * [Build] Update setup.py to support library file copying for both release and develop modes
    
    * [Build] Refactor library file copying logic in setup.py
    
    * [Documentation] Remove unnecessary install section header in Installation.md
    
    * [Build] Add tox configuration and local distribution script for multi-Python version support
    
    * [Build] Improve git submodule update function with better error handling
    
    * [Build] Update LLVM configuration path in ROCm installation script
    
    * [Build] Add .tox/ to .gitignore for tox testing environment
    
    * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt
    
    * [Cleanup] Remove unused TVM runtime error codes header
    
    * [Cleanup] Fix TVM grid constant type reference in CUDA module
    
    * [Cleanup] Remove unused customized_code function from IR module
    
    * [Feature] Add TileLang thread synchronization and storage access analysis passes
    
    * [Build] Reorder DLL search path directories for more flexible library loading
    
    * [Refactor] Improve thread synchronization and library path handling
    
    - Rename ThreadSync and TileLangThreadSync functions in C++ code
    - Update Python docstring for ThreadSync with more detailed description
    - Reorder library path detection in tilelang environment setup
    - Minor comment and code cleanup in CUDA and warp specialization modules
    
    * [Refactor] Improve thread synchronization code style and formatting
    
    - Standardize pointer type spacing in storage_access.h and storage_access.cc
    - Update whitespace and indentation in thread_storage_sync.cc
    - Reorder include statements in thread_partial_sync.cc
    - Minor code formatting improvements across thread synchronization files
    
    * [Refactor] Fix global function registration for ThreadSync
    
    - Correct global function registration to use ThreadSync instead of TileLangThreadSync
    - Update TVM global registration to match recent refactoring efforts
    
    * [Refactor] Simplify ThreadSync global function registration
    
    - Remove unnecessary whitespace in global function registration
    - Compact the TVM global registration line for ThreadSync
    
    * [Feature] Add WebGPU code generation support in TileLang
    
    - Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h)
    - Add WebGPU target support in lower.py and target.py
    - Update CMakeLists.txt to include WebGPU codegen source files
    - Introduce WebGPU-specific code generation for WGSL shader language
    
    * [Refactor] Improve WebGPU code generation formatting and readability
    
    - Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h
    - Standardize pointer type spacing and indentation
    - Improve line breaks and reduce line length for better readability
    - Minor code style improvements in WebGPU code generation
    
    * [Test] Add WebGPU matrix multiplication code generation test
    
    - Implement test_webgpu_codegen.py for WebGPU matrix multiplication
    - Add assert_gemm_codegen function to validate WebGPU code generation
    - Include basic matrix multiplication kernel test case
    
    * Update README with WebGPU codegen support announcement
    
    * Support multi version pypi package build via tox
    
    * Add support for CPU device backend with C code generation
    
    - Introduce `is_cpu_device_backend` function to detect CPU backend with C code generation
    - Modify `lower` function to handle special case of CPU device backend
    - Update host and device call filtering for CPU backend
    - Add conditional source code generation for C host target
    - Extend JITKernel to support optional target_host parameter
    
    * lint fix
    
    * Enhance JIT kernel adapters with CTypes and Torch C++ backends
    
    - Add CtypesKernelAdapter with dynamic library generation and kernel wrapping
    - Implement TorchCPPKernelAdapter for CUDA kernel compilation
    - Refactor BaseKernelAdapter to support more flexible initialization
    - Improve error handling and argument processing in kernel adapters
    - Update adapter initialization to support various execution backends
    
    * Refactor and clean up code style in JIT CTypes adapter modules
    
    - Apply consistent code formatting and whitespace in CTypes adapter files
    - Remove unused imports and improve import organization
    - Enhance readability of code in adapter, libgen, and wrapper modules
    - Add missing whitespace and improve line breaks
    - Minor linting and code style improvements across CTypes adapter files
    
    * Add test for TileLang JIT GEMM with CTypes backend
    
    - Implement comprehensive test for matrix multiplication using CTypes execution backend
    - Create test functions for GEMM with float16 data type
    - Add kernel source verification with custom callback
    - Implement reference implementation using PyTorch for result validation
    - Support various matrix multiplication configurations (transposition, block sizes)
    
    * test fix
    
    * Update TileLang JIT callback registration with override parameter
    
    - Modify tilelang_callback_cuda_postproc to use @tvm.register_func(override=True)
    - Ensure proper function registration with ability to replace existing implementations
    2ac51a03
__init__.py 295 Bytes