"git@developer.sourcefind.cn:gaoqiong/migraphx.git" did not exist on "9f046d674b43da0713282ad777652da392e15452"
Commit f2e99180 authored by Lei Wang's avatar Lei Wang Committed by LeiWang1999
Browse files

[Refactor] Phaseout LLVM Dependency by Making it Optional (#247)

* remove llvm build

* [Refactor] Update kernel compilation and profiling in examples

- Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
- Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
- Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.

* lint fix

* License Update

* [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files

- Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
- Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.

* [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files

- Improved comment alignment and readability in `cuda.h`.
- Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.

* lint fix

* lint fix

* lint fix

* lint fix

* fix

* License update

* [Enhancement] Update JITKernel to use artifact for kernel source

- Assigned the generated artifact to `self.artifact` for better management.
- Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.

* lint fix

* Add @tilelang.testing.requires_llvm decorator to vectorization tests

* Enhance setup.py and env.py for library management

- Added functionality to remove original files after copying in CMakeBuild.
- Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.

* Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py

* Refactor CMakeBuild file handling in setup.py

- Added a check to ensure the target library directory exists before copying .so files.
- Improved the logic for creating the target directory and copying files to enhance robustness.

* bugfix

* Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.

* lint fix

* Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.

* lint fix

* Add support for C target in device code generation

- Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.

* [Enhancement] Implement auto-clear cache feature based on environment variable

* Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
* Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
* Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.

* [Refactor] Update kernel invocation and import paths in tests and cache

* Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
* Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
* Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.

* [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py

* Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
* Enhanced overall code formatting to align with project standards.

* [Enhancement] Add bfloat16 test case and improve kernel caching logic

* Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
* Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
* Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
* Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
* Improved code formatting and readability across several files.

* lint fix

* Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
parent 43bd9d3e
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import torch
import torch.nn.functional as F
import tilelang
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import torch
import torch.nn.functional as F
import tilelang
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import torch
import torch.nn.functional as F
import tilelang
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
# ruff: noqa
import torch
import tilelang
......
......@@ -305,14 +305,14 @@ if __name__ == "__main__":
BLOCK_N = 64 # if D_HEAD <= 128 else 32
program = flashattn(BATCH, H, Q_CTX, KV_CTX, D_HEAD, causal, BLOCK_M, BLOCK_N)
ref_program = partial(ref_program, causal=causal)
mod, params = tilelang.lower(program)
mod = tilelang.Profiler(mod, params, [5], tilelang.TensorSupplyType.Normal)
mod.assert_allclose(ref_program, rtol=0.01, atol=0.01)
kernel = tilelang.compile(program, out_idx=[5])
profiler = kernel.get_profiler(tilelang.TensorSupplyType.Normal)
profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
print("All checks passed!")
latency = mod.do_bench(ref_program, warmup=500)
latency = profiler.do_bench(ref_program, warmup=500)
print("{:.2f} ms".format(latency))
print("{:.2f} TFlops".format(total_flops / latency * 1e-9))
latency = mod.do_bench(mod, n_warmup=10, n_repeat=10, profiler="tvm")
latency = profiler.do_bench(n_warmup=10, n_repeat=10, profiler="tvm")
print("{:.4f} ms".format(latency))
print("{:.2f} TFlops".format(total_flops / latency * 1e-9))
......@@ -126,9 +126,9 @@ def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="flo
func = matmul(1024, 1024, 1024, 128, 128, 32)
print(func) # Prints an IR-like representation of the TileLang kernel
rt_mod, params = tilelang.lower(func)
artifact = tilelang.lower(func)
profiler = Profiler(rt_mod, params, result_idx=[2])
profiler = Profiler(artifact.rt_mod, artifact.params, result_idx=[2])
import torch
a = torch.randn(1024, 1024).cuda().half()
......@@ -141,7 +141,7 @@ ref_c = a @ b
torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)
# Get CUDA Kernel Source
print(rt_mod.imported_modules[0].get_source())
print(artifact.kernel_source)
```
---
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import torch
import torch.backends
from tilelang import tvm as tvm
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import tilelang
from tilelang import Profiler
import tilelang.language as T
......@@ -47,9 +44,9 @@ func = matmul(1024, 1024, 1024, 128, 128, 32)
print(func)
rt_mod, params = tilelang.lower(func)
artifact = tilelang.lower(func)
profiler = Profiler(rt_mod, params, result_idx=[2])
profiler = Profiler(artifact.rt_mod, artifact.params, result_idx=[2])
import torch
......@@ -66,4 +63,4 @@ print(ref_c)
torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)
# Get CUDA Source
print(rt_mod.imported_modules[0].get_source())
print(artifact.kernel_source)
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import torch
import torch.backends
from tilelang import tvm as tvm
import tilelang.testing
from tvm import DataType
import tilelang as TL
import tilelang.language as T
from tilelang.intrinsics import get_swizzle_layout
from tilelang.intrinsics.mma_macro_generator import (
......@@ -181,8 +177,8 @@ def tl_matmul(
def assert_tl_matmul_correctness(M, N, K, in_dtype, out_dtype, accum_dtype):
matmul = tl_matmul(M, N, K, in_dtype, out_dtype, accum_dtype)
mod, params = TL.lower(matmul)
src_code = mod.imported_modules[0].get_source()
kernel = tilelang.compile(matmul, out_idx=[2])
src_code = kernel.get_kernel_source()
print(src_code)
# src_code is the generated cuda source
assert src_code is not None
......@@ -203,11 +199,11 @@ def assert_tl_matmul_correctness(M, N, K, in_dtype, out_dtype, accum_dtype):
C = torch.zeros(M, N, device="cuda", dtype=accum_dtype)
mod = TL.Profiler(mod, params, [], TL.TensorSupplyType.Integer)
profiler = kernel.get_profiler(tilelang.TensorSupplyType.Integer)
mod(A, B, C)
profiler(A, B, C)
latency = mod.do_bench(mod.func, warmup=25)
latency = profiler.do_bench(warmup=25)
# Ensure that the latency is not None
assert latency is not None
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import tilelang
import tilelang.language as T
from tvm import DataType
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import argparse
import torch
import tilelang
from tilelang import Profiler
from tilelang.autotuner import *
import tilelang.language as T
from einops import rearrange, repeat
......@@ -244,14 +240,14 @@ if __name__ == "__main__":
program = chunk_scan_fwd(
batch, seq_len, chunk_size, groups, heads, dim, dstate, tune=args.tune)(
block_M=64, block_N=64, block_K=64, block_Dstate=128, num_stages=2, threads=128)
mod, params = tilelang.lower(program)
mod = Profiler(mod, params, [7], tilelang.TensorSupplyType.Normal)
mod.assert_allclose(ref_program, rtol=0.01, atol=0.01)
kernel = tilelang.compile(program, out_idx=[7])
profiler = kernel.get_profiler(tilelang.TensorSupplyType.Normal)
profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
print("All checks pass.")
latency = mod.do_bench(ref_program, warmup=500)
latency = profiler.do_bench(ref_program, warmup=500)
print("Ref: {:.2f} ms".format(latency))
print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))
latency = mod.do_bench(mod.func, warmup=500)
latency = profiler.do_bench(warmup=500)
print("Tile-lang: {:.2f} ms".format(latency))
print("Tile-lang: {:.2f} TFlops".format(total_flops / latency * 1e-9))
else:
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import argparse
import torch
import torch.nn.functional as F
import tilelang
from tilelang import Profiler
from tilelang.autotuner import *
import tilelang.language as T
from einops import rearrange, repeat
......@@ -181,14 +177,14 @@ if __name__ == "__main__":
program = chunk_state_fwd(
batch, seq_len, chunk_size, groups, heads, dim, dstate, tune=args.tune)(
block_M=64, block_N=128, block_K=64, num_stages=4, threads=128)
mod, params = tilelang.lower(program)
mod = Profiler(mod, params, [4], tilelang.TensorSupplyType.Normal)
mod.assert_allclose(ref_program, rtol=0.01, atol=0.01)
kernel = tilelang.compile(program, out_idx=[4])
profiler = kernel.get_profiler(tilelang.TensorSupplyType.Normal)
profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
print("All checks pass.")
latency = mod.do_bench(ref_program, warmup=500)
latency = profiler.do_bench(ref_program, warmup=500)
print("Ref: {:.2f} ms".format(latency))
print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))
latency = mod.do_bench(mod.func, warmup=500)
latency = profiler.do_bench(warmup=500)
print("Tile-lang: {:.2f} ms".format(latency))
print("Tile-lang: {:.2f} TFlops".format(total_flops / latency * 1e-9))
else:
......
......@@ -180,18 +180,18 @@ if __name__ == "__main__":
BLOCK_M = 64
BLOCK_N = 64
program = retnet(BATCH, H, N_CTX, dim_qk, dim_v, BLOCK_M, BLOCK_N)
mod, params = tilelang.lower(program)
mod = tilelang.Profiler(mod, params, [4], tilelang.TensorSupplyType.Normal)
kernel = tilelang.compile(program, out_idx=[4])
profiler = kernel.get_profiler(tilelang.TensorSupplyType.Normal)
ins = []
for i in range(len(mod.params)):
if i not in mod.result_idx:
shape = [int(x) for x in mod.params[i].shape]
for i in range(len(kernel.params)):
if i not in kernel.result_idx:
shape = [int(x) for x in kernel.params[i].shape]
ins.append(torch.empty(shape, device="cuda", dtype=torch.float16).normal_(-0.1, 0.1))
ref_outs = ref_program(*ins)
torch.cuda.synchronize()
lib_outs = mod.func(*ins)
lib_outs = kernel(*ins)
torch.cuda.synchronize()
if isinstance(lib_outs, torch.Tensor):
......@@ -210,7 +210,7 @@ if __name__ == "__main__":
max_mismatched_ratio=0.01,
)
mod.assert_allclose(ref_program, rtol=0.01, atol=0.01)
latency = mod.do_bench(mod, n_warmup=10, n_repeat=10, profiler="torch")
profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
latency = profiler.do_bench(n_warmup=10, n_repeat=10, profiler="torch")
print("tilelang: {:.2f} ms".format(latency))
print("tilelang: {:.2f} TFlops".format(total_flops / latency * 1e-9))
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
# ruff: noqa
import torch
from reference import naive_nsa_simple_inference
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import tilelang.language as T
from typing import Literal, Callable
from tvm import DataType
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import tilelang
import tilelang.language as T
# `make_mma_swizzle_layout` is a python defined layout function
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import math
import torch
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
# ruff: noqa: E712
import math
import torch
......
#!/usr/bin/env bash
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
# Usage:
# # Do work and commit your work.
......
#!/bin/bash
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
echo "Starting installation script..."
# Step 1: Install Python requirements
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment