"...composable_kernel_onnxruntime.git" did not exist on "78b987fbd6a7897ee9827187a231441794b13490"
  1. 17 Dec, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Update examples and tests for improved type handling functionality (#1448) · c750fb8a
      Lei Wang authored
      * [Enhancement] Update examples and tests for improved type handling and functionality
      
      - Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
      - Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
      - Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
      - Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.
      
      * [Refactor] Update accumulation data type to float32 across examples
      
      - Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
      - This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.
      
      * [Refactor] Standardize data type usage across benchmark scripts
      
      - Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
      - Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
      - Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.
      
      * [Refactor] Standardize data type usage in templates and scripts
      
      - Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
      - Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
      - This change aims to streamline type handling and improve compatibility with existing workflows.
      
      * [Refactor] Standardize data type usage in examples and benchmarks
      
      - Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
      - Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.
      
      * [Refactor] Import dtypes from language.v2 module
      
      - Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
      - This change aims to streamline data type management and improve overall code clarity.
      
      * fix
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
      - Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
      - This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.
      
      * [Refactor] Update data type handling for consistency and clarity
      
      - Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
      - Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
      - Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
      - This refactor aims to streamline data type management and improve overall code clarity and maintainability.
      
      * [Enhancement] Improve data type handling and error messaging
      
      - Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
      - Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
      - Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.
      
      * [Fix] Correct boolean flag in GEMM SP test case
      
      - Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
      - This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.
      
      * [Refactor] Standardize data type usage in various modules
      
      - Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.
      
      * [Refactor] Update argument parsing for data types in benchmarks
      
      - Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
      - This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.
      
      * [Refactor] Update data type handling in benchmark and example scripts
      
      - Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
      - Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
      - This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.
      
      * [Refactor] Fix data type conversion in multiple scripts
      
      - Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
      - This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.
      
      * [Refactor] Update float8 data type usage across multiple scripts
      
      - Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
      - This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.
      
      * [Refactor] Enhance float8 data type handling in CUDA code generation
      
      - Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
      - Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
      - Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Streamline float8 data type handling in CUDA and related modules
      
      - Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
      - Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Remove unnecessary cache disabling in float8 example script
      
      - Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
      - This change enhances clarity and maintainability of the example script without affecting its functionality.
      
      * [Refactor] Update data type usage in debug print tests
      
      - Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
      - This update enhances consistency in data type handling within the test suite, improving clarity and maintainability.
      
      * lint fix
      
      * Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples
      
      * Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.
      c750fb8a
  2. 01 Dec, 2025 1 commit
    • botbw's avatar
      [Language] support `T.gemm_sp_v2` on sm80 and sm89 (#1056) · 283a9a00
      botbw authored
      * [misc] add a cpp side wrapper for gemm_sp_py
      
      * [misc] typing
      
      * [IR] bind GemmSPWarpPolicy
      
      * [chore] add wrapper code
      
      * [IR] fix GemmSPWarpPolicy
      
      * [codegen] apply ptxas instructions
      
      * [intrinsic] add typical (unused) mma layout
      
      * [template] add uint16 debug func
      
      * [intrinsic] add b matrix layout
      
      * [gemm_sp] enable fp16/bf16 on sm8x
      
      * [layout] refactor fp16/bf16 layout
      
      * [gemm_sp] enable int8
      
      * [chore] update test case dtype
      
      * [gemm_sp] enable fp32
      
      * [layout] refactor layouts
      
      * [intrinsic] enable ldmatrix for mat A
      
      * [layout] enable ldsm for matrix b
      
      * [layout] add ldmatrix for fp32 and fp8
      
      * [chore] refine
      
      * [chore] refactor
      
      * [chore] add fp8 efactor
      
      * [chore] refactor
      
      * [chore] add remove negative zero util
      
      * [example] add a custom compress kernel
      
      * [chore] minor update
      
      * [test] refactor gemm_sp test
      
      * [refactor] make metadata layout func
      
      * [example] add option for using cutlass layout
      
      * [doc] add a gemm_sp doc
      
      * [doc] minor polish
      
      * [chore] remove unused
      
      * [bugfix] fix non replicate b case
      
      * [test] refactor
      
      * [chore] add a check
      
      * [bugfix] fix util bug
      
      * [wip] init a new test case for v2
      
      * [chore] minor refactor
      
      * [chore] minor update
      
      * [bugfix] enable 16bit rs
      
      * [language] enable rs
      
      * [language] enable gemm_sp_sr
      
      * [language] enable gemm_sp_rr
      
      * [test] enable more tests
      
      * [tvm] update ffi binding
      
      * [chore] remove print
      
      * [chore] fix benchmark script
      
      * [lint] precommit lint
      
      * [chore] apply feedback
      
      * [test] use arch 8.0
      
      * [chore] rollback ::ordered_metadata for backward compatibility
      
      * [bugfix] fix captialized
      
      * [example] keep gemm_sp on hopper
      
      * [test] fix no fp8 normal kernel
      
      * [test] reduce matmul size to satisfy accum error
      
      * [test] use cal_diff for assertion
      
      * [bugfix] expand float8 type
      
      * [lib] add make_int4 for short type
      
      * [language] add transpose E
      
      * [bugfix] fix wrong var
      
      * [format] format
      
      * [chore] refactor binding
      
      * [chore] fix wrong passing var
      283a9a00
  3. 20 Oct, 2025 1 commit
  4. 10 Oct, 2025 1 commit
  5. 15 Aug, 2025 1 commit
  6. 08 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] refactor autotune examples (#617) · d110d087
      Lei Wang authored
      * [Refactor] Update tilelang kernel functions and remove unused imports
      
      - Refactored the `flashattn_fwd`, `flashattn_bwd_preprocess`, and `flashattn_bwd_postprocess` functions to utilize direct kernel calls instead of cached versions, improving clarity and performance.
      - Added `@tilelang.jit` decorators with specified output indices to enhance kernel compilation.
      - Removed unused import of `cached` from `tilelang`, streamlining the code.
      - Commented out the main testing function call in `test_tilelang_kernel_mha_bwd.py` for potential future use.
      
      * [Refactor] Simplify configuration generation in benchmark and example scripts
      
      - Refactored the `get_configs` functions in multiple benchmark and example scripts to utilize a dictionary-based approach for parameter configuration, improving readability and maintainability.
      - Updated the `flashattn` and `chunk_scan_fwd` functions to directly accept configuration parameters, enhancing flexibility in kernel tuning.
      - Removed redundant code and streamlined the configuration generation process across various files, ensuring consistency in how configurations are defined and utilized.
      
      * [Refactor] Update configuration handling in benchmark scripts
      
      - Refactored the `get_configs` functions in benchmark scripts to accept a variable argument list, improving flexibility in configuration management.
      - Enhanced the `matmul` and `flashattn` functions to utilize the updated configuration approach, streamlining parameter handling for kernel tuning.
      - Added `@autotune` decorators to relevant functions, ensuring consistent autotuning behavior across benchmarks.
      - Cleaned up redundant code and improved overall readability in the affected files.
      
      * [Refactor] Clean up formatting and update subproject commit
      
      - Updated the subproject commit reference in the TVM directory to indicate a dirty state.
      - Removed unnecessary blank lines and improved formatting in the `benchmark_matmul` and `benchmark_matmul_fp8` scripts for better readability.
      - Streamlined the function definitions in the `flashattn` example script to enhance clarity and maintainability.
      
      * [Refactor] Update AutoTuner configuration handling
      
      - Modified the AutoTuner class to check if kernel parameters are set before processing tunable arguments, improving robustness in configuration handling.
      - Enhanced the logic for skipping compilation when tunable parameters are already provided, ensuring efficient use of resources.
      - Updated comments for clarity and maintainability.
      
      * lint fix
      
      * Update TVM subproject commit to indicate dirty state and modify MHA backward test cases
      
      - Updated the subproject commit reference in the TVM directory to reflect a dirty state.
      - Adjusted the `test_mha_bwd` function to use a new configuration for the MHA backward tests, changing the context size from 128 to 256.
      - Uncommented the main testing function call for potential execution.
      d110d087
  7. 04 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Doc] Phaseout Legacy documentations (#610) · d9ae74c6
      Lei Wang authored
      - Added a new entry in the README for the introduction of `T.gemm_sp` supporting 2:4 sparse tensor core.
      - Removed several outdated documentation files related to convolution, flash attention, and other tutorials to streamline the documentation structure.
      d9ae74c6
  8. 12 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Docs] Add AMD Flash MLA Documentation to Tutorials Section (#376) · 0997c333
      Lei Wang authored
      * [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang
      
      - Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism.
      - Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use.
      - Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization.
      - Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming.
      
      * doc update
      
      * [Add] Enhance AMD FlashMLA implementation and documentation
      
      - Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation.
      - Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation.
      - Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks.
      
      * lint fix
      
      * [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies
      
      - Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators.
      - Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling.
      - Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks.
      - Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.
      0997c333
  9. 05 Apr, 2025 1 commit
  10. 28 Mar, 2025 1 commit
    • botbw's avatar
      [doc/example] add gemv doc and examples (#293) · ff3cfa59
      botbw authored
      * [doc/example] init gemv doc and examples
      
      * [example] add vectorized read
      
      * [example] use local register instead of smem
      
      * [example] add bench
      
      * [doc] update doc
      
      * [doc] refine doc
      
      * [lint] format code
      
      * [doc] add tips
      
      * [doc/example] fix typo
      
      * [example] use tmv_all_reduce
      
      * [doc] update doc accordingly
      
      * [doc] add benchmark table
      
      * [lint] format code
      ff3cfa59
  11. 26 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
  12. 19 Mar, 2025 1 commit
    • Chenghua's avatar
      [Examples] Implement elementwise add kernel (#219) · 43bd9d3e
      Chenghua authored
      * [Example] Modify tuning configurations for FlashAttention example
      
      * [Examples] formatting example_gqa_fwd_bshd.py
      
      * [Examples] Implement elementwise add kernel
      
      * [Doc] Update ElementWise Operators document
      
      * [Examples] Replace the example of elementwise add.
      43bd9d3e
  13. 21 Feb, 2025 1 commit
    • Lei Wang's avatar
      [JIT] Support Cython jit and make cython a default execution backend (#102) · 3471904f
      Lei Wang authored
      * [Feature] Add CTypes JIT kernel support for dynamic shapes and multi-stream execution
      
      - Enhance CtypesKernelAdapter to handle dynamic symbolic shapes
      - Add support for multi-stream kernel execution in CTypes backend
      - Implement dynamic shape handling in test_tilelang_jit_gemm_ctypes.py
      - Add symbolic shape utility function in tilelang.language
      - Update profiler to improve flexibility in benchmark selection
      
      * Remove redundant thread binding in GEMM kernel implementations
      
      - Remove unnecessary `thread_binding` line in GEMM kernel functions
      - Clean up code in `examples/gemm/README.md` and `testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py`
      - Enhance code readability by removing redundant thread binding annotation
      
      * Fix indentation in int4 GEMM kernel test file
      
      - Correct indentation for function calls in `test_tilelang_kernel_int4_gemm_mma.py`
      - Remove extra indentation in `mma_emitter.ldmatrix_a()` and `mma_emitter.ldmatrix_b()` calls
      - Improve code formatting for better readability
      
      * [Feature] Add Cython JIT kernel support for dynamic shapes and multi-stream execution
      
      - Implement CythonKernelAdapter to handle dynamic symbolic shapes
      - Add support for multi-stream kernel execution in Cython backend
      - Create comprehensive test suite for Cython GEMM kernel in test_tilelang_jit_gemm_cython.py
      - Update JITKernel to include "cython" as a valid execution backend
      - Add Cython-specific wrapper and library generation modules
      - Update .gitignore to exclude Cython cache directory
      - Modify setup.py to include Cython source files in package data
      
      * lint fix
      
      * [Refactor] Replace JITKernel with compile() function for kernel compilation
      
      - Add new `compile()` function in tilelang/jit/__init__.py as a wrapper for JITKernel
      - Update multiple test files and examples to use `tilelang.compile()` instead of `tilelang.JITKernel()`
      - Modify kernel adapters to support optional kernel-only source retrieval
      - Update `__init__.py` to import the new `compile()` function
      - Improve kernel source retrieval for different execution backends
      
      * lint fix
      
      * remove debug print
      
      * Add C/C++ compiler utility module and update Cython JIT kernel support
      
      - Introduce new `tilelang/contrib/cc.py` module with cross-platform C/C++ compiler utilities
      - Add functions to detect and retrieve system C/C++ compilers
      - Implement cross-compilation and shared library creation support
      - Update Cython JIT kernel to validate C++ compiler availability
      - Modify Cython adapter to use detected C++ compiler for library generation
      
      * Refactor float8 dtype mapping in tensor utility module
      
      - Move float8_dtype_map inside adapt_torch2tvm function
      - Simplify global scope by localizing the dtype mapping
      - Maintain existing functionality for converting torch float8 tensors to TVM ndarray
      
      * Refactor float8 dtype mapping in tensor utility module
      
      - Move float8_dtype_map inside adapt_torch2tvm function
      - Simplify global scope by localizing the dtype mapping
      - Maintain existing functionality for converting torch float8 tensors to TVM ndarray
      
      * revert
      
      * Enhance Cython JIT adapter with Cython compiler detection
      
      - Add `get_cython_compiler()` function to dynamically locate Cython executable
      - Update Cython adapter to use detected Cython compiler instead of hardcoded command
      - Raise an exception if no Cython compiler is found
      - Update requirements.txt to specify minimum PyTorch version (>=2.2.0)
      
      * Fix Cython kernel wrapper stream handling and type annotations
      
      - Update stream parameter type to int64_t for better compatibility
      - Directly use torch.cuda.current_stream().cuda_stream instead of casting
      - Improve type safety and precision in Cython kernel wrapper
      3471904f
  14. 13 Feb, 2025 1 commit
  15. 02 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Doc] Add matmul kernel tutorial documentations with tile library (#60) · ea612446
      Lei Wang authored
      * implement jit test case
      
      * [Dev] implement auto tune test case for matrix multiplication
      
      * Implement test for legalize memory access and vectorized loop
      
      * lint fix
      
      * introduce run_once
      
      * Refactor callback function names for consistency and improve code readability
      
      * enhance documentations
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix formatting issues in rt_mod_hip.cc
      
      * add random seed initialization for deterministic testing
      
      * Add documentation images and comprehensive GEMM tutorial for TileLang
      
      * Update MATMUL documentation title to highlight Tile Library
      ea612446
  16. 26 Jan, 2025 1 commit
    • Lei Wang's avatar
      [Doc] Addd debug relevant testing and documentations (#58) · 5e259239
      Lei Wang authored
      * implement jit test case
      
      * [Dev] implement auto tune test case for matrix multiplication
      
      * Implement test for legalize memory access and vectorized loop
      
      * lint fix
      
      * introduce run_once
      
      * Refactor callback function names for consistency and improve code readability
      
      * enhance documentations
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix formatting issues in rt_mod_hip.cc
      
      * add random seed initialization for deterministic testing
      5e259239