"tests/vscode:/vscode.git/clone" did not exist on "9680943352726744eabc02341f25c7c072ec9359"
- 07 May, 2026 2 commits
- 06 May, 2026 2 commits
-
-
wangziyang authored
-
wangziyang authored
-
- 27 Apr, 2026 1 commit
-
-
qisan authored
-
- 22 Apr, 2026 2 commits
- 16 Apr, 2026 1 commit
-
-
wangziyang authored
-
- 09 Apr, 2026 1 commit
-
-
wangziyang authored
-
- 03 Apr, 2026 2 commits
-
-
wangziyang authored
-
wangziyang authored
-
- 17 Mar, 2026 1 commit
-
-
qisan authored
-
- 22 Dec, 2025 2 commits
- 21 Dec, 2025 1 commit
-
-
Lei Wang authored
[Refactor] Phaseout PassConfig `kDisableDynamicTailSplit` and `kDynamicAlignment` as they are legacy (#1486) * [Cleanup] Remove dynamic shape example and related tests * Deleted the dynamic shape example script `example_dynamic.py` and its corresponding test file `test_example_dynamic.py` to streamline the codebase. * Removed unused dynamic tail split and dynamic alignment configurations from `builtin.h` and `pass_config.py`. * Cleaned up the dynamic shape testing files to eliminate redundancy and improve maintainability. * build fix
-
- 20 Dec, 2025 1 commit
-
-
Lei Wang authored
* [Feature] Add FullyReplicated Fragment Layout and Enhance Layout Inference * Introduced a new static method `FullyReplicated` in the `Fragment` class to create fully replicated fragment layouts, ensuring all threads hold identical copies of the buffer. * Updated `CopyNode` to collect fragment layouts and mark them as fully replicated during layout inference. * Enhanced `ParallelOpNode` to expand let bindings for fragment buffer accesses, improving layout inference accuracy. * Added documentation for new methods and updated existing methods to support the new layout features. * lint fix * Remove debug logging statements from layout inference process to streamline output and improve performance.
-
- 19 Dec, 2025 1 commit
-
-
Lei Wang authored
* feat(arg_binder): enhance shape variable handling and assertions - Implemented special handling for comparing if_then_else expressions to simplify conditions involving NULL checks. - Added methods to set shared shape variables and finalize deferred bindings, generating cascading if_then_else expressions and runtime assertions for non-NULL buffers. - Updated the binding logic to defer shape variable bindings for shared variables, ensuring proper handling across multiple nullable buffers. * refactor(arg_binder): clean up shape variable handling and remove unused code - Removed deprecated methods for setting shared shape variables and finalizing deferred bindings, streamlining the argument binding process. - Simplified the logic for handling shape values in the `BindDLTensor` function, ensuring immediate binding for normal shape variables. - Enhanced clarity by eliminating unnecessary comments and code related to cascading if_then_else expressions for shared variables. * refactor(arg_binder): enhance DLTensor binding with improved shape handling - Replaced the single `BindDLTensor` method with `BindDLTensors` to support multiple buffers, improving flexibility in handling DLTensor bindings. - Introduced a two-pass approach for shape variable handling, allowing for better management of symbolic dimensions and null checks. - Updated the logic to assert non-null conditions at runtime and utilize cascaded if_then_else expressions for shape retrieval, enhancing robustness. - Removed deprecated code and streamlined the binding process for clarity and maintainability. * fix(test_nullable_buffer_params): improve formatting and consistency in test output - Updated string formatting for better readability in the `test_nullable_shared_shape` function. - Ensured consistent use of double quotes for string literals. - Added a missing newline at the end of the file for proper formatting. * refactor(arg_binder): simplify allocation size calculation in BindDLTensors - Streamlined the calculation of allocation size by replacing a lambda function with a direct loop, enhancing readability and maintainability. - Improved clarity in the null check message for data pointers, ensuring better understanding of the binding process. * Remove debug prints from phase.py Removed debug print statements after MakePackedAPI transformation.
-
- 18 Dec, 2025 4 commits
-
-
qisan authored
-
Gabriel Wu authored
* feat: CuTeDSL backend * fix: clang-tidy * fix: clang-format * fix: ci * fix: revert example gemm fp8 * fix: remove duplicate code * fix: switch-case * fix: fp16 silence * fix: TVM IR print * fix: useless tir * fix: clang-format * fix: remove tilelang/contrib/cutedsl/.gitignore * fix: use hexfloat * fix: gsym guard * fix: unknown storage sync type * fix: string literal * fix: add args guard * fix: name hint dedup * fix: better find_kernel_by_pattern * fix: set libpath for from_database path * fix: guard buffer.strides * fix: from guard * fix: eviction guard * fix: use thread local tma descs * fix: ruff * fix: drop tma_init_cpp * fix: exc_info * fix: negative unmatch early return * fix: rename postproc func and add test * fix: handle fast math according to pass config * fix: dyn_sym parse * fix: wrap_forward * fix: use tvm_ffi.libinfo instead of cli * fix: keep signature * fix: C++ string safety * fix: mark tma_store_add as unsupported * fix: tvm version * resolve ldsm and cpasync issues. * fix: minor fixes * fix: parse signature using ast * fix: guard global_addr * fix: create tempfile only when necessary * fix: use logger.execption for exceptions * fix: guard lib_path and host_func * fix: remove tma_cpp_init and add timeout for cpp compile * add timeout for mbarrier_wait. * fix: _load_kernel_from_disk signature * resolve codegen issues. * fix: logger.exception * add comment for div_by=1 * merge * fix: reserve cutlass,cute,tl * fix: guard tma_store * fix: allow int64 offset in make_tensor_at_offset * fix: guard barrier * fix: add comments for div_by=16 * fix: div_by=1 issue * delete div_by when offset is 0 * use tl.make_tensor when offset is 0 * fix: explicitly check cutedsl target * fix: use param.torch_dtype() --------- Co-authored-by:
yuxic <yuxic@nvidia.com> Co-authored-by:
Yong <yong@local> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
Jinjie Liu authored
Signed-off-by:Jinjie Liu <jjliu@baai.ac.cn>
-
silentCoder-dev authored
* add curand.{curand_init, curand} * run format.sh * add default value for curand_init & add test for curand * Update testing/python/language/test_rand.py Remove unused thread binding Co-authored-by:coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * remove unused library * enable tilelang cache for testing * run format.sh * Revert "run format.sh" This reverts commit 5afaff782f31cdf653e2c45b469da8dead228b8a. * Revert "enable tilelang cache for testing" This reverts commit c277a43e77938bd88d47a108dd1bd65734d4a1ae. * Revert "remove unused library" This reverts commit 568ad20611f039380113937fd131151a2bffd801. * run format.sh * ensure FreshName for __philox_state * ensure FreshName for __philox_state * change the return type of T.rng_init --------- Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
-
- 17 Dec, 2025 2 commits
-
-
Lei Wang authored
* [Enhancement] Update examples and tests for improved type handling and functionality - Enhanced various example scripts to support new data types and improve compatibility with PyTorch. - Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling. - Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management. - Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling. * [Refactor] Update accumulation data type to float32 across examples - Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability. - This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board. * [Refactor] Standardize data type usage across benchmark scripts - Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling. - Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard. - Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules. * [Refactor] Standardize data type usage in templates and scripts - Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity. - Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates. - This change aims to streamline type handling and improve compatibility with existing workflows. * [Refactor] Standardize data type usage in examples and benchmarks - Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability. - Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard. - Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples. * [Refactor] Import dtypes from language.v2 module - Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase. - This change aims to streamline data type management and improve overall code clarity. * fix * [Refactor] Standardize data type usage across scripts - Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity. - Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability. - This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase. * [Refactor] Update data type handling for consistency and clarity - Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency. - Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase. - Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling. - This refactor aims to streamline data type management and improve overall code clarity and maintainability. * [Enhancement] Improve data type handling and error messaging - Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation. - Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs. - Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience. * [Fix] Correct boolean flag in GEMM SP test case - Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case. - This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation. * [Refactor] Standardize data type usage across scripts - Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability. - Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase. - This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability. * [Refactor] Standardize data type usage in various modules - Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability. - Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase. - This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations. * [Refactor] Update argument parsing for data types in benchmarks - Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float. - This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability. * [Refactor] Update data type handling in benchmark and example scripts - Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency. - Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase. - This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples. * [Refactor] Fix data type conversion in multiple scripts - Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts. - This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations. * [Refactor] Update float8 data type usage across multiple scripts - Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling. - This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations. * [Refactor] Enhance float8 data type handling in CUDA code generation - Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic. - Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase. - Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation. - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy. * [Refactor] Streamline float8 data type handling in CUDA and related modules - Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks. - Updated layout inference for float8 types to improve clarity and maintainability across the implementation. - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy. * [Refactor] Remove unnecessary cache disabling in float8 example script - Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code. - This change enhances clarity and maintainability of the example script without affecting its functionality. * [Refactor] Update data type usage in debug print tests - Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type. - This update enhances consistency in data type handling within the test suite, improving clarity and maintainability. * lint fix * Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples * Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines. -
Lei Wang authored
* [Enhancement] Introduce non-restrict parameter support in code generation - Added a new PrimFunc-level attribute `tl.non_restrict_params` to specify handle Vars that should not be marked with the restrict qualifier during code generation. - Updated `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to handle non-restrict parameters, ensuring proper treatment of overlapping buffer aliases. - Implemented a new annotation function `annotate_restrict_buffers` to facilitate the marking of buffer parameters as non-restrict. - Enhanced the `SplitHostDevice` transformation to propagate non-restrict parameters from host to device functions. - Added a new transform function `HoistNonRestrictParams` to manage non-restrict parameters effectively. * [Enhancement] Improve HoistNonRestrictParams transformation - Updated the HoistNonRestrictParams function to recursively collect all `tl.non_restrict_params` annotations from nested blocks, enhancing flexibility in annotation placement. - Introduced a new NonRestrictCollector class to manage the collection and deduplication of non-restrict parameters. - Modified the SplitHostDevice transformation to remove the non-restrict attribute from the host-side PrimFunc after propagation to device kernels. - Adjusted the LowerAndLegalize function to directly apply the HoistNonRestrictParams transformation without exception handling, streamlining the process. * [Refactor] Simplify non-restrict parameter handling in code generation - Removed unnecessary normalization logic and associated data structures from `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP`. - Streamlined the handling of non-restrict parameters by directly inserting them into the `non_restrict` set, improving code clarity and maintainability. - Updated conditional checks to eliminate redundant checks against normalized names, enhancing performance and readability. * [Dependency] Update TVM subproject to latest commit 68aa8461 - Updated the TVM subproject to the latest commit, ensuring compatibility with recent changes and improvements. - Refactored non-restrict parameter handling in `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to enhance code clarity and maintainability. - Adjusted the `SplitHostDevice` transformation to streamline the propagation of non-restrict parameters. * fix
-
- 16 Dec, 2025 2 commits
-
-
Lei Wang authored
* [Enhancement] Update KernelParam to use tvm.DataType directly and add torch_dtype conversion method - Changed dtype in KernelParam from torch.dtype to tvm.DataType to support a wider range of data types and prevent information loss during conversions. - Added a new method, torch_dtype, to convert tvm.DataType back to torch.dtype for tensor creation. - Updated various adapters to utilize the new torch_dtype method for parameter type conversion during initialization. * [Enhancement] Refactor CUDA type handling and add support for FP4 and FP8 types - Renamed functions for clarity: GetFP8Type, GetFP6Type, and GetFP4Type are now GetTileLangFP8Type, GetTileLangFP6Type, and GetTileLangFP4Type respectively. - Enhanced FP4 type handling to support additional lane sizes (2, 4, 8, 16, 32, 64). - Updated CUDA code generation to include new FP8 and FP4 types, ensuring proper type handling in PrintType and related functions. - Introduced new structures for FP8 types in cuda_fp8.h to facilitate better memory management and type packing. - Added methods in KernelParam and tensor utilities to recognize and handle float4 types, improving compatibility with PyTorch. - Enhanced logging for debugging purposes in various CUDA functions to track type handling and memory operations more effectively. * lint fix * Remove unnecessary logging statements from CUDA code generation and delete obsolete matrix multiplication test file. * [Enhancement] Add support for FP4 and FP8 types in CUDA code generation - Enhanced PrintVecElemLoad and PrintVecElemStore functions to handle new FP4 types. - Updated arg_binder to allow float4 to match int8 at runtime, improving compatibility with PyTorch. - Modified loop_vectorize to account for buffer dtype lanes in vectorization calculations. - Refactored tensor type mapping to support new float4 and float8 types, ensuring correct type handling in tensor operations. - Added tests for FP4 and FP8 copy operations to validate functionality and integration with existing workflows. --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-
Kuris authored
-
- 15 Dec, 2025 7 commits
-
-
Dayuxiaoshui authored
* [Feature] Support region as input of T.cumsum - Extend T.cumsum to accept BufferRegion and BufferLoad inputs in addition to Buffer - This enables operations on buffer slices/regions like: T.cumsum(InputG_fragment[i * chunk_size:(i + 1) * chunk_size], dim=0) - Update cumsum_fragment to handle region inputs properly - Add comprehensive tests for 1D and 2D region inputs including normal and reverse modes Fixes #879 * Fix formatting and add docstring for cumsum_fragment - Add comprehensive docstring for cumsum_fragment function - Format code according to ruff style guidelines * Fix CodeRabbit review issues - Fix negative dimension bounds check (dim < -len(shape) instead of dim <= -len(shape)) - Add src/dst shape compatibility validation for out-of-place cumsum - Update copy() type annotation to accept BufferRegion as dst parameter - Fix test in-place mutation issues by using out-of-place cumsum operations - Add non-divisible size test cases for tail region coverage * Fix out-of-bounds access in region tests - Add bounds clamping using T.min() for chunk_end calculations - Prevents accessing beyond tensor bounds for non-divisible sizes - Matches reference implementation behavior - Fixes both 1D and 2D region test cases * Fix region test: use simple slice expressions instead of T.min() - Remove T.min() which cannot be used directly in slice indices - Use chunk_start + chunk_size form instead - Rely on system's automatic bounds checking for non-divisible sizes - Update comments to reflect this approach * Fix cumsum region: use region extents in lowering and update tests for shared memory * Simplify fragment scope check using is_fragment() --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Xiangwen Wang authored
-
Lei Wang authored
* Introduced a new function, IsExprInvariantInVectorBoundary, to encapsulate the logic for checking if an expression is invariant within vector boundaries, improving code clarity and reusability. * Updated the existing vectorization logic to utilize this new function, streamlining the process of determining vectorization feasibility based on boundary conditions. * Enhanced comments for better understanding of the vectorization criteria and mathematical rationale behind the checks.
-
Chaofan Lin authored
* [Refactor] Refactor InjectAssumes logic and make assumes work after SplitHostDevice * address comments * fix * fix submodule * fix * fix 3rdparty
-
Lei Wang authored
* Added detailed logging for data and shape variable parameters during buffer usage detection in the MakePackedAPI function. * Refactored the UsedBufferDetector to differentiate between used parameters by data and shape variables, enhancing clarity in buffer management. * Updated logic to ensure minimal carrier buffers are selected for shape symbols, improving the efficiency of parameter handling.
-
Lei Wang authored
* [Enhancement] Add device compile flags support in pass configuration * Introduced `kDeviceCompileFlags` option in the pass configuration to allow additional device compiler flags for CUDA compilation. * Updated the `tilelang_callback_cuda_compile` function to merge extra flags from the pass configuration, enhancing flexibility in compiler options. * Modified the `JITKernel` class to handle device compile flags appropriately, ensuring they are included during compilation. * Documented the new pass configuration key for clarity on usage and expected input formats. * lint fix * [Refactor] Simplify compile_flags handling in JIT functions * Removed redundant string check for compile_flags in the compile, jit, and lazy_jit functions, ensuring compile_flags is consistently treated as a list. * Updated the JITKernel class to handle compile_flags as a list when a string is provided, enhancing code clarity and maintainability. * lint fix * fix
-
Lei Wang authored
* Phase out primitives * revert changes * Refactor GemmWarpPolicy method signature for clarity Updated the `from_warp_partition` method in the `GemmWarpPolicy` class to return the type `GemmWarpPolicy` instead of a string, enhancing type safety and clarity in the codebase. Removed an unnecessary blank line for improved readability. * fix
-
- 13 Dec, 2025 2 commits
-
-
Lei Wang authored
* [Enhancement] Add read-only parameter annotation for CUDA codegen * Introduced the `AnnotateReadOnlyParams` transformation to annotate read-only handle parameters in PrimFuncs, enabling the generation of `const` qualifiers in CUDA codegen. * Updated `PrintFunctionSignature` and `AddFunction` methods to utilize the new attribute `tl.readonly_param_indices`, enhancing performance by allowing read-only cache loads. * Modified the optimization pipeline to include the new annotation step, improving the overall efficiency of the code generation process. * lint fix * [Dependency] Update apache-tvm-ffi version to >=0.1.3 * Updated the version of apache-tvm-ffi in pyproject.toml, requirements.txt, and requirements-dev.txt to ensure compatibility with the latest features and fixes. * Made adjustments in CUDA and HIP template files to use `const` qualifiers for global pointer parameters, enhancing code safety and clarity. * lint fix * [Enhancement] Refactor ReadWriteMarker for improved parameter handling * Updated the ReadWriteMarker class to accept a set of parameter or data variables, enhancing its ability to track written variables. * Introduced a new method, ResolveDataVarFromPtrArg, to resolve underlying buffer data from pointer-like arguments, improving accuracy in identifying written variables. * Modified the MarkReadOnlyParams function to gather handle parameters and their corresponding buffer data variables, streamlining the process of determining read-only parameters. * Enhanced the logic for identifying written variables to account for aliased data variables, ensuring comprehensive tracking of modifications. * lint fix * Update tma_load function to use const qualifier for global memory pointer * Changed the parameter type of gmem_ptr in the tma_load function from void* to void const* to enhance type safety and clarity in memory operations. * This modification ensures that the function correctly handles read-only global memory pointers, aligning with best practices in CUDA programming. * Remove commented-out code and reorder transformations in OptimizeForTarget function for clarity * Refactor buffer marking logic in annotate_read_only_params.cc to improve accuracy in identifying written variables. Update OptimizeForTarget function to reorder transformations for better clarity.
-
Lei Wang authored
* [Enhancement] Update AtomicAdd function signature to accept pointer to destination * Modified AtomicAdd in CUDA to take a pointer instead of a reference for the destination argument. * Updated related code in atomicadd_vectorize.cc to ensure compatibility with the new signature. * Adjusted Python interface in atomic.py to pass the destination by pointer, aligning with device function requirements. * [Enhancement] Refactor AtomicAddRet function signature to accept pointer * Updated AtomicAddRet in both CUDA and HIP to take a pointer instead of a reference for the address argument, improving consistency with the AtomicAdd function. * Adjusted the implementation to ensure proper reinterpretation of the address type for atomic operations. * lint fix * [Enhancement] Refactor AtomicAddNode::MakeSIMTLoop to use destination pointer * Updated the MakeSIMTLoop function to build a pointer to the destination element using tvm_access_ptr instead of loading the destination value directly. * Simplified the handling of source and destination predicates, improving clarity and maintainability of the code. * Ensured compatibility with the new pointer-based approach for atomic operations. * lint fix * test fix * lint fix
-
- 12 Dec, 2025 2 commits
-
-
Xiangwen Wang authored
* Improve loop vectorize * Improve loop vectorize * Improve loop vectorize * Improve loop vectorize * Improve loop vectorize * Add some vectorize tests and comments
-
Lei Wang authored
* [Enhancement] Add __ldg intrinsic for CUDA read-only cache loads * Introduced the __ldg intrinsic to enable explicit read-only cached loads from global memory in CUDA. * Updated the corresponding documentation and added support in both CUDA and HIP code generation. * Enhanced the Python interface for __ldg to accept BufferLoad and Buffer types, improving usability. * [Enhancement] Update formatting and linting rules in pyproject.toml; minor test adjustment * Added new formatting rules in pyproject.toml to enforce consistent code style, including hanging indents and argument splitting. * Updated test_tilelang_language_intrinsics_codegen.py to improve readability by adding a blank line before the main execution block. * Refactored error messages in builtin.py for better clarity and consistency, ensuring proper formatting in function definitions and raising ValueErrors. * lint fix
-
- 11 Dec, 2025 3 commits
-
-
Cunxiao Ni authored
-
danielhua23 authored
* enable FA2 on AMD MI300X * make lint happy
-
Lei Wang authored
* [Dependency] Update apache-tvm-ffi version to >=0.1.2 in project files * [Dependency] Update subproject commit for TVM to latest version afc07935 * [Enhancement] Add support for optional step parameter in loop constructs - Updated loop creation functions to accept an optional step parameter, enhancing flexibility in loop definitions. - Modified ForFrame implementations to utilize the new step parameter across various loop types including serial, parallel, and pipelined loops. - Adjusted related vectorization transformations to accommodate the step parameter, ensuring consistent behavior in loop vectorization processes. * lint fix
-
- 10 Dec, 2025 1 commit
-
-
Lei Wang authored
* [Build] Update CMake configuration for tilelang_cython_wrapper installation - Adjusted output directories for the tilelang_cython_wrapper to ensure that development builds place the extension in build/lib. - Updated installation paths to place the extension in tilelang/lib within the wheel, improving organization and avoiding potential conflicts with other modules. - Modified the internal library path exposure in env.py to prevent shadowing of common module names, enhancing compatibility and usability in user projects. * [Build] Standardize output directories for tilelang libraries - Set output directories for both tilelang and tilelang_module libraries to "${CMAKE_BINARY_DIR}/lib" for consistency in development builds. - This change enhances organization and ensures that all build artifacts are located in a unified directory structure. * [Refactor] Update TVM subproject and enhance pipeline loop handling - Updated the TVM subproject to commit 90581fe9e5287bbcf1844ad14255a1e1e8cdf7f0. - Added new fields to `PipelineAnnotation` and `RewrittenBlockInfo` structures to track original statement indices and improve async state management. - Refactored `EmitImpl` and `PopulateWaitCounts` methods to enhance clarity and functionality, including better handling of commit groups and wait counts. - Simplified access index calculations and strengthened analyzer constraints for loop bounds. * [Cleanup] Remove license block and unused includes from inject_pipeline.cc - Eliminated the Apache license block from the top of the file to streamline the code. - Removed unused include directives for memory and stringstream to enhance code clarity and reduce unnecessary dependencies. * [Refactor] Enhance transformation pipeline and test execution - Added an additional Simplify transformation in the InjectSoftwarePipeline to improve optimization. - Updated the test file to call `test_trival_pipeline()` directly, commenting out the previous main execution for better test isolation.
-