[Refactor] Phaseout LLVM Dependency by Making it Optional (#247)

* remove llvm build * [Refactor] Update kernel compilation and profiling in examples - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation. - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency. - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations. * lint fix * License Update * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields. - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability. * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files - Improved comment alignment and readability in `cuda.h`. - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability. * lint fix * lint fix * lint fix * lint fix * fix * License update * [Enhancement] Update JITKernel to use artifact for kernel source - Assigned the generated artifact to `self.artifact` for better management. - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling. * lint fix * Add @tilelang.testing.requires_llvm decorator to vectorization tests * Enhance setup.py and env.py for library management - Added functionality to remove original files after copying in CMakeBuild. - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration. * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py * Refactor CMakeBuild file handling in setup.py - Added a check to ensure the target library directory exists before copying .so files. - Improved the logic for creating the target directory and copying files to enhance robustness. * bugfix * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement. * lint fix * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility. * lint fix * Add support for C target in device code generation - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function. * [Enhancement] Implement auto-clear cache feature based on environment variable * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing. * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing. * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true. * [Refactor] Update kernel invocation and import paths in tests and cache * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result. * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`. * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability. * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class. * Enhanced overall code formatting to align with project standards. * [Enhancement] Add bfloat16 test case and improve kernel caching logic * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`. * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading. * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management. * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database. * Improved code formatting and readability across several files. * lint fix * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage

[Refactor] Phaseout LLVM Dependency by Making it Optional (#247)
* remove llvm build * [Refactor] Update kernel compilation and profiling in examples - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation. - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency. - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations. * lint fix * License Update * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields. - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability. * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files - Improved comment alignment and readability in `cuda.h`. - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability. * lint fix * lint fix * lint fix * lint fix * fix * License update * [Enhancement] Update JITKernel to use artifact for kernel source - Assigned the generated artifact to `self.artifact` for better management. - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling. * lint fix * Add @tilelang.testing.requires_llvm decorator to vectorization tests * Enhance setup.py and env.py for library management - Added functionality to remove original files after copying in CMakeBuild. - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration. * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py * Refactor CMakeBuild file handling in setup.py - Added a check to ensure the target library directory exists before copying .so files. - Improved the logic for creating the target directory and copying files to enhance robustness. * bugfix * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement. * lint fix * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility. * lint fix * Add support for C target in device code generation - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function. * [Enhancement] Implement auto-clear cache feature based on environment variable * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing. * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing. * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true. * [Refactor] Update kernel invocation and import paths in tests and cache * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result. * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`. * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability. * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class. * Enhanced overall code formatting to align with project standards. * [Enhancement] Add bfloat16 test case and improve kernel caching logic * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`. * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading. * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management. * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database. * Improved code formatting and readability across several files. * lint fix * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
f2e99180 · Lei Wang · LeiWang1999 · 43bd9d3e · f2e99180 · f2e99180
Commit f2e99180 authored Mar 20, 2025 by Lei Wang Committed by LeiWang1999 Mar 20, 2025
20 changed files
--- a/examples/flash_attention/example_mha_fwd_bhsd.py
+++ b/examples/flash_attention/example_mha_fwd_bhsd.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 import torch
 import torch.nn.functional as F
 import tilelang

--- a/examples/flash_attention/example_mha_fwd_bshd.py
+++ b/examples/flash_attention/example_mha_fwd_bshd.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 import torch
 import torch.nn.functional as F
 import tilelang

--- a/examples/flash_attention/example_mha_fwd_bshd_wgmma_pipelined.py
+++ b/examples/flash_attention/example_mha_fwd_bshd_wgmma_pipelined.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 import torch
 import torch.nn.functional as F
 import tilelang

--- a/examples/flash_attention/example_mha_fwd_varlen.py
+++ b/examples/flash_attention/example_mha_fwd_varlen.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
 # ruff: noqa
 import torch
 import tilelang

--- a/examples/flash_attention/example_mha_inference.py
+++ b/examples/flash_attention/example_mha_inference.py
@@ -305,14 +305,14 @@ if __name__ == "__main__":
    BLOCK_N = 64  # if D_HEAD <= 128 else 32
    program = flashattn(BATCH, H, Q_CTX, KV_CTX, D_HEAD, causal, BLOCK_M, BLOCK_N)
    ref_program = partial(ref_program, causal=causal)
-    mod, params = tilelang.lower(program)
-    mod = tilelang.Profiler(mod, params, [5], tilelang.TensorSupplyType.Normal)
-    mod.assert_allclose(ref_program, rtol=0.01, atol=0.01)
+    kernel = tilelang.compile(program, out_idx=[5])
+    profiler = kernel.get_profiler(tilelang.TensorSupplyType.Normal)
+    profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
    print("All checks passed!")

-    latency = mod.do_bench(ref_program, warmup=500)
+    latency = profiler.do_bench(ref_program, warmup=500)
    print("{:.2f} ms".format(latency))
    print("{:.2f} TFlops".format(total_flops / latency * 1e-9))
-    latency = mod.do_bench(mod, n_warmup=10, n_repeat=10, profiler="tvm")
+    latency = profiler.do_bench(n_warmup=10, n_repeat=10, profiler="tvm")
    print("{:.4f} ms".format(latency))
    print("{:.2f} TFlops".format(total_flops / latency * 1e-9))
--- a/examples/gemm/README.md
+++ b/examples/gemm/README.md
@@ -126,9 +126,9 @@ def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="flo
 func = matmul(1024, 1024, 1024, 128, 128, 32)
 print(func)  # Prints an IR-like representation of the TileLang kernel

-rt_mod, params = tilelang.lower(func)
+artifact = tilelang.lower(func)

-profiler = Profiler(rt_mod, params, result_idx=[2])
+profiler = Profiler(artifact.rt_mod, artifact.params, result_idx=[2])

 import torch
 a = torch.randn(1024, 1024).cuda().half()
@@ -141,7 +141,7 @@ ref_c = a @ b
 torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)

 # Get CUDA Kernel Source
-print(rt_mod.imported_modules[0].get_source())
+print(artifact.kernel_source)
 ```

 ---

--- a/examples/gemm/example_gemm_intrinsics.py
+++ b/examples/gemm/example_gemm_intrinsics.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 import torch
 import torch.backends
 from tilelang import tvm as tvm

--- a/examples/gemm/example_gemm_schedule.py
+++ b/examples/gemm/example_gemm_schedule.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 import tilelang
 from tilelang import Profiler
 import tilelang.language as T
@@ -47,9 +44,9 @@ func = matmul(1024, 1024, 1024, 128, 128, 32)

 print(func)

-rt_mod, params = tilelang.lower(func)
+artifact = tilelang.lower(func)

-profiler = Profiler(rt_mod, params, result_idx=[2])
+profiler = Profiler(artifact.rt_mod, artifact.params, result_idx=[2])

 import torch

@@ -66,4 +63,4 @@ print(ref_c)
 torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)

 # Get CUDA Source
-print(rt_mod.imported_modules[0].get_source())
+print(artifact.kernel_source)
--- a/examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py
+++ b/examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 import torch
 import torch.backends
 from tilelang import tvm as tvm
 import tilelang.testing
 from tvm import DataType
-import tilelang as TL
 import tilelang.language as T
 from tilelang.intrinsics import get_swizzle_layout
 from tilelang.intrinsics.mma_macro_generator import (
@@ -181,8 +177,8 @@ def tl_matmul(

 def assert_tl_matmul_correctness(M, N, K, in_dtype, out_dtype, accum_dtype):
    matmul = tl_matmul(M, N, K, in_dtype, out_dtype, accum_dtype)
-    mod, params = TL.lower(matmul)
-    src_code = mod.imported_modules[0].get_source()
+    kernel = tilelang.compile(matmul, out_idx=[2])
+    src_code = kernel.get_kernel_source()
    print(src_code)
    # src_code is the generated cuda source
    assert src_code is not None
@@ -203,11 +199,11 @@ def assert_tl_matmul_correctness(M, N, K, in_dtype, out_dtype, accum_dtype):

    C = torch.zeros(M, N, device="cuda", dtype=accum_dtype)

-    mod = TL.Profiler(mod, params, [], TL.TensorSupplyType.Integer)
+    profiler = kernel.get_profiler(tilelang.TensorSupplyType.Integer)

-    mod(A, B, C)
+    profiler(A, B, C)

-    latency = mod.do_bench(mod.func, warmup=25)
+    latency = profiler.do_bench(warmup=25)

    # Ensure that the latency is not None
    assert latency is not None

--- a/examples/gemm_splitk/example_tilelang_gemm_splitk.py
+++ b/examples/gemm_splitk/example_tilelang_gemm_splitk.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 import tilelang
 import tilelang.language as T
 from tvm import DataType

--- a/examples/linear_attention/example_mamba_chunk_scan.py
+++ b/examples/linear_attention/example_mamba_chunk_scan.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 import argparse
 import torch
 import tilelang
-from tilelang import Profiler
 from tilelang.autotuner import *
 import tilelang.language as T
 from einops import rearrange, repeat
@@ -244,14 +240,14 @@ if __name__ == "__main__":
        program = chunk_scan_fwd(
            batch, seq_len, chunk_size, groups, heads, dim, dstate, tune=args.tune)(
                block_M=64, block_N=64, block_K=64, block_Dstate=128, num_stages=2, threads=128)
-        mod, params = tilelang.lower(program)
-        mod = Profiler(mod, params, [7], tilelang.TensorSupplyType.Normal)
-        mod.assert_allclose(ref_program, rtol=0.01, atol=0.01)
+        kernel = tilelang.compile(program, out_idx=[7])
+        profiler = kernel.get_profiler(tilelang.TensorSupplyType.Normal)
+        profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
        print("All checks pass.")
-        latency = mod.do_bench(ref_program, warmup=500)
+        latency = profiler.do_bench(ref_program, warmup=500)
        print("Ref: {:.2f} ms".format(latency))
        print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))
-        latency = mod.do_bench(mod.func, warmup=500)
+        latency = profiler.do_bench(warmup=500)
        print("Tile-lang: {:.2f} ms".format(latency))
        print("Tile-lang: {:.2f} TFlops".format(total_flops / latency * 1e-9))
    else:

--- a/examples/linear_attention/example_mamba_chunk_state.py
+++ b/examples/linear_attention/example_mamba_chunk_state.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 import argparse
 import torch
 import torch.nn.functional as F
 import tilelang
-from tilelang import Profiler
 from tilelang.autotuner import *
 import tilelang.language as T
 from einops import rearrange, repeat
@@ -181,14 +177,14 @@ if __name__ == "__main__":
        program = chunk_state_fwd(
            batch, seq_len, chunk_size, groups, heads, dim, dstate, tune=args.tune)(
                block_M=64, block_N=128, block_K=64, num_stages=4, threads=128)
-        mod, params = tilelang.lower(program)
-        mod = Profiler(mod, params, [4], tilelang.TensorSupplyType.Normal)
-        mod.assert_allclose(ref_program, rtol=0.01, atol=0.01)
+        kernel = tilelang.compile(program, out_idx=[4])
+        profiler = kernel.get_profiler(tilelang.TensorSupplyType.Normal)
+        profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
        print("All checks pass.")
-        latency = mod.do_bench(ref_program, warmup=500)
+        latency = profiler.do_bench(ref_program, warmup=500)
        print("Ref: {:.2f} ms".format(latency))
        print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))
-        latency = mod.do_bench(mod.func, warmup=500)
+        latency = profiler.do_bench(warmup=500)
        print("Tile-lang: {:.2f} ms".format(latency))
        print("Tile-lang: {:.2f} TFlops".format(total_flops / latency * 1e-9))
    else:

--- a/examples/linear_attention/example_retnet.py
+++ b/examples/linear_attention/example_retnet.py
@@ -180,18 +180,18 @@ if __name__ == "__main__":
    BLOCK_M = 64
    BLOCK_N = 64
    program = retnet(BATCH, H, N_CTX, dim_qk, dim_v, BLOCK_M, BLOCK_N)
-    mod, params = tilelang.lower(program)
-    mod = tilelang.Profiler(mod, params, [4], tilelang.TensorSupplyType.Normal)
+    kernel = tilelang.compile(program, out_idx=[4])
+    profiler = kernel.get_profiler(tilelang.TensorSupplyType.Normal)

    ins = []
-    for i in range(len(mod.params)):
-        if i not in mod.result_idx:
-            shape = [int(x) for x in mod.params[i].shape]
+    for i in range(len(kernel.params)):
+        if i not in kernel.result_idx:
+            shape = [int(x) for x in kernel.params[i].shape]
            ins.append(torch.empty(shape, device="cuda", dtype=torch.float16).normal_(-0.1, 0.1))

    ref_outs = ref_program(*ins)
    torch.cuda.synchronize()
-    lib_outs = mod.func(*ins)
+    lib_outs = kernel(*ins)
    torch.cuda.synchronize()

    if isinstance(lib_outs, torch.Tensor):
@@ -210,7 +210,7 @@ if __name__ == "__main__":
            max_mismatched_ratio=0.01,
        )

-    mod.assert_allclose(ref_program, rtol=0.01, atol=0.01)
-    latency = mod.do_bench(mod, n_warmup=10, n_repeat=10, profiler="torch")
+    profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
+    latency = profiler.do_bench(n_warmup=10, n_repeat=10, profiler="torch")
    print("tilelang: {:.2f} ms".format(latency))
    print("tilelang: {:.2f} TFlops".format(total_flops / latency * 1e-9))
--- a/examples/native_sparse_attention/example_tilelang_nsa_decode.py
+++ b/examples/native_sparse_attention/example_tilelang_nsa_decode.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
 # ruff: noqa
 import torch
 from reference import naive_nsa_simple_inference

--- a/examples/plot_layout/fragment_mma_load_a.py
+++ b/examples/plot_layout/fragment_mma_load_a.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 import tilelang.language as T
 from typing import Literal, Callable
 from tvm import DataType

--- a/examples/quickstart.py
+++ b/examples/quickstart.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
 import tilelang
 import tilelang.language as T
 # `make_mma_swizzle_layout` is a python defined layout function

--- a/examples/seer_attention/block_sparse_attn_tilelang.py
+++ b/examples/seer_attention/block_sparse_attn_tilelang.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
 import math
 import torch


--- a/examples/seer_attention/block_sparse_attn_triton.py
+++ b/examples/seer_attention/block_sparse_attn_triton.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
 # ruff: noqa: E712
 import math
 import torch

--- a/format.sh
+++ b/format.sh
-#!/usr/bin/env bash
-
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 # Usage:
 #    # Do work and commit your work.


--- a/install_cpu.sh
+++ b/install_cpu.sh
-#!/bin/bash
-
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT License.
-
 echo "Starting installation script..."

 # Step 1: Install Python requirements