Refactor build system (#235)

* Refactor Setuptools build system Successfully launches CMake install, but installs CMake extensions in temp dir. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug JAX build Fix pybind11 import. Distinguish between build-time and run-time dependencies. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add helper function to determine dependencies Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add missing license Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug case where system CMake is too old Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add missing license Signed-off-by: Tim Moon <tmoon@nvidia.com> * Simplify sanity import tests Just importing modules provides richer error messages. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Properly install submodules Signed-off-by: Tim Moon <tmoon@nvidia.com> * Install helper library for TensorFlow Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update documentation Signed-off-by: Tim Moon <tmoon@nvidia.com> * Do not install Ninja by default Signed-off-by: Tim Moon <tmoon@nvidia.com> * Include Git commit hash in version string Signed-off-by: Tim Moon <tmoon@nvidia.com> * Override build_ext.build_extensions instead of build_ext.run Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix incorrect include path Restore Ninja dependency. Restore overriding build_ext.run func. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Review suggestions from @nouiz Signed-off-by: Tim Moon <tmoon@nvidia.com> * Disable parallel Ninja jobs in GitHub actions Signed-off-by: Tim Moon <tmoon@nvidia.com> * Properly install userbuffers lib Signed-off-by: Tim Moon <tmoon@nvidia.com> * Tweak install docs Review suggestion from @ksivaman Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add examples for specifying framework in docs Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com>

Refactor build system (#235)
* Refactor Setuptools build system Successfully launches CMake install, but installs CMake extensions in temp dir. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug JAX build Fix pybind11 import. Distinguish between build-time and run-time dependencies. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add helper function to determine dependencies Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add missing license Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug case where system CMake is too old Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add missing license Signed-off-by: Tim Moon <tmoon@nvidia.com> * Simplify sanity import tests Just importing modules provides richer error messages. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Properly install submodules Signed-off-by: Tim Moon <tmoon@nvidia.com> * Install helper library for TensorFlow Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update documentation Signed-off-by: Tim Moon <tmoon@nvidia.com> * Do not install Ninja by default Signed-off-by: Tim Moon <tmoon@nvidia.com> * Include Git commit hash in version string Signed-off-by: Tim Moon <tmoon@nvidia.com> * Override build_ext.build_extensions instead of build_ext.run Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix incorrect include path Restore Ninja dependency. Restore overriding build_ext.run func. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Review suggestions from @nouiz Signed-off-by: Tim Moon <tmoon@nvidia.com> * Disable parallel Ninja jobs in GitHub actions Signed-off-by: Tim Moon <tmoon@nvidia.com> * Properly install userbuffers lib Signed-off-by: Tim Moon <tmoon@nvidia.com> * Tweak install docs Review suggestion from @ksivaman Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add examples for specifying framework in docs Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com>
37bbfc76 · Tim Moon · GitHub · 215dfe7e · 37bbfc76 · 37bbfc76
Unverified Commit 37bbfc76 authored May 31, 2023 by Tim Moon Committed by GitHub May 31, 2023
12 changed files
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -22,7 +22,7 @@ jobs:
      - name: 'Build'
        run: |
          mkdir -p wheelhouse && \
-          NVTE_FRAMEWORK=pytorch pip wheel -w wheelhouse . -v
+          NVTE_FRAMEWORK=pytorch MAX_JOBS=1 pip wheel -w wheelhouse . -v
      - name: 'Upload wheel'
        uses: actions/upload-artifact@v3
        with:
@@ -47,7 +47,6 @@ jobs:
          submodules: recursive
      - name: 'Build'
        run: |
-          pip install ninja pybind11 && \
          pip install --upgrade "jax[cuda12_local]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html && \
          mkdir -p wheelhouse && \
          NVTE_FRAMEWORK=jax pip wheel -w wheelhouse . -v
@@ -74,7 +73,6 @@ jobs:
          submodules: recursive
      - name: 'Build'
        run: |
-          pip install ninja pybind11 && \
          mkdir -p wheelhouse && \
          NVTE_FRAMEWORK=tensorflow pip wheel -w wheelhouse . -v
      - name: 'Upload wheel'

--- a/docs/installation.rst
+++ b/docs/installation.rst
@@ -34,12 +34,9 @@ pip - from GitHub
 Additional Prerequisites
 ^^^^^^^^^^^^^^^^^^^^^^^^
-1. `CMake <https://cmake.org/>`__ version 3.18 or later: `pip install cmake`.
+1. [For PyTorch support] `PyTorch <https://pytorch.org/>`__ with GPU support.
-2. [For pyTorch support] `pyTorch <https://pytorch.org/>`__ with GPU support.
+2. [For JAX support] `JAX <https://github.com/google/jax/>`__ with GPU support, version >= 0.4.7.
-3. [For JAX support] `JAX <https://github.com/google/jax/>`__ with GPU support, version >= 0.4.7.
+3. [For TensorFlow support] `TensorFlow <https://www.tensorflow.org/>`__ with GPU support.
-4. [For TensorFlow support] `TensorFlow <https://www.tensorflow.org/>`__ with GPU support.
-5. `pybind11`: `pip install pybind11`.
-6. [Optional] `Ninja <https://ninja-build.org/>`__: `pip install ninja`.
 Installation (stable release)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -48,11 +45,9 @@ Execute the following command to install the latest stable version of Transforme
 .. code-block:: bash
-  # Execute one of the following commands
+  pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
-  NVTE_FRAMEWORK=pytorch pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable    # Build TE for PyTorch only. The default.
-  NVTE_FRAMEWORK=jax pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable        # Build TE for JAX only.
+This will automatically detect if any supported deep learning frameworks are installed and build Transformer Engine support for them. To explicitly specify frameworks, set the environment variable `NVTE_FRAMEWORK` to a comma-separated list (e.g. `NVTE_FRAMEWORK=jax,tensorflow`).
-  NVTE_FRAMEWORK=tensorflow pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable # Build TE for TensorFlow only.
-  NVTE_FRAMEWORK=all pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable        # Build TE for all supported frameworks.
 Installation (development build)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -67,12 +62,10 @@ Execute the following command to install the latest development build of Transfo
 .. code-block:: bash
-  # Execute one of the following commands
+  pip install git+https://github.com/NVIDIA/TransformerEngine.git@main
-  NVTE_FRAMEWORK=pytorch pip install git+https://github.com/NVIDIA/TransformerEngine.git@main    # Build TE for PyTorch only. The default.
-  NVTE_FRAMEWORK=jax pip install git+https://github.com/NVIDIA/TransformerEngine.git@main        # Build TE for JAX only.
+This will automatically detect if any supported deep learning frameworks are installed and build Transformer Engine support for them. To explicitly specify frameworks, set the environment variable `NVTE_FRAMEWORK` to a comma-separated list (e.g. `NVTE_FRAMEWORK=jax,tensorflow`).
-  NVTE_FRAMEWORK=tensorflow pip install git+https://github.com/NVIDIA/TransformerEngine.git@main # Build TE for TensorFlow only.
-  NVTE_FRAMEWORK=all pip install git+https://github.com/NVIDIA/TransformerEngine.git@main        # Build TE for all supported frameworks.
 Installation (from source)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -80,14 +73,27 @@ Execute the following commands to install Transformer Engine from source:
 .. code-block:: bash
-  git clone --recursive https://github.com/NVIDIA/TransformerEngine.git    # Clone the repository/fork and checkout all submodules recursively.
+  # Clone repository, checkout stable branch, clone submodules
-  cd TransformerEngine                                                     # Enter TE directory.
+  git clone --branch stable --recursive https://github.com/NVIDIA/TransformerEngine.git
-  git checkout stable                                                      # Checkout the correct branch.
-  export NVTE_FRAMEWORK=pytorch                                            # Optionally set the framework.
+  cd TransformerEngine
-  pip install .                                                            # Build and install
+  export NVTE_FRAMEWORK=pytorch   # Optionally set framework
+  pip install .                   # Build and install
+If the Git repository has already been cloned, make sure to also clone the submodules:
+.. code-block:: bash
+  git submodule update --init --recursive
+Extra dependencies for testing can be installed by setting the "test" option:
+.. code-block:: bash
+  pip install .[test]
-For already cloned repos, run the following command in TE directory:
+To build the C++ extensions with debug symbols, e.g. with the `-g` flag:
 .. code-block:: bash
-  git submodule update --init --recursive                                   # Checkout all submodules recursively.
+  pip install . --global-option=--debug
--- a/setup.py
+++ b/setup.py
@@ -2,433 +2,517 @@
 #
 # See LICENSE for license information.
-import atexit
+from functools import lru_cache
 import os
-import sys
+from pathlib import Path
-import subprocess
-import io
 import re
-import copy
+import shutil
+import subprocess
+from subprocess import CalledProcessError
+import sys
 import tempfile
-from pkg_resources import packaging
+from typing import List, Optional, Tuple, Union
-from setuptools import setup, find_packages, Extension
-from setuptools.command.build_ext import build_ext
-from shutil import copyfile
-path = os.path.dirname(os.path.realpath(__file__))
-with open(path + "/VERSION", "r") as f:
-    te_version = f.readline()
-CUDA_HOME = os.environ.get("CUDA_HOME", "/usr/local/cuda")
-NVTE_WITH_USERBUFFERS = int(os.environ.get("NVTE_WITH_USERBUFFERS", "0"))
-if NVTE_WITH_USERBUFFERS:
-    MPI_HOME = os.environ.get("MPI_HOME", "")
-    assert MPI_HOME, "MPI_HOME must be set if NVTE_WITH_USERBUFFERS=1"
-def get_cuda_bare_metal_version(cuda_dir):
-    raw_output = subprocess.check_output(
-        [cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True
-    )
-    output = raw_output.split()
-    release_idx = output.index("release") + 1
-    release = output[release_idx].split(".")
-    bare_metal_major = release[0]
-    bare_metal_minor = release[1][0]
-    return (int(bare_metal_major), int(bare_metal_minor))
-def append_nvcc_threads(nvcc_extra_args):
-    cuda_major, cuda_minor = get_cuda_bare_metal_version(CUDA_HOME)
-    if cuda_major >= 11 and cuda_minor >= 2:
-        return nvcc_extra_args + ["--threads", "4"]
-    return nvcc_extra_args
-def extra_gencodes(cc_flag):
-    cuda_bare_metal_version = get_cuda_bare_metal_version(CUDA_HOME)
-    if cuda_bare_metal_version >= (11, 0):
-        cc_flag.append("-gencode")
-        cc_flag.append("arch=compute_80,code=sm_80")
-    if cuda_bare_metal_version >= (11, 8):
-        cc_flag.append("-gencode")
-        cc_flag.append("arch=compute_90,code=sm_90")
-def extra_compiler_flags():
-    extra_flags = [
-        "-O3",
-        "-gencode",
-        "arch=compute_70,code=sm_70",
-        "-U__CUDA_NO_HALF_OPERATORS__",
-        "-U__CUDA_NO_HALF_CONVERSIONS__",
-        "-U__CUDA_NO_BFLOAT16_OPERATORS__",
-        "-U__CUDA_NO_BFLOAT16_CONVERSIONS__",
-        "-U__CUDA_NO_BFLOAT162_OPERATORS__",
-        "-U__CUDA_NO_BFLOAT162_CONVERSIONS__",
-        "-I./transformer_engine/common/layer_norm/",
-        "--expt-relaxed-constexpr",
-        "--expt-extended-lambda",
-        "--use_fast_math",
-    ]
-    if NVTE_WITH_USERBUFFERS:
-        extra_flags.append("-DNVTE_WITH_USERBUFFERS")
-    return extra_flags
-cc_flag = []
+import setuptools
-extra_gencodes(cc_flag)
+from setuptools.command.build_ext import build_ext
+# Project directory root
+root_path: Path = Path(__file__).resolve().parent
-def make_abs_path(l):
+@lru_cache(maxsize=1)
-    return [os.path.join(path, p) for p in l]
+def te_version() -> str:
+    """Transformer Engine version string
+    Includes Git commit as local version, unless suppressed with
+    NVTE_NO_LOCAL_VERSION environment variable.
-pytorch_sources = [
+    """
-    "transformer_engine/pytorch/csrc/extensions.cu",
+    with open(root_path / "VERSION", "r") as f:
-    "transformer_engine/pytorch/csrc/common.cu",
+        version = f.readline().strip()
-    "transformer_engine/pytorch/csrc/ts_fp8_op.cpp",
+    if not int(os.getenv("NVTE_NO_LOCAL_VERSION", "0")):
-]
+        try:
-pytorch_sources = make_abs_path(pytorch_sources)
+            output = subprocess.run(
+                ["git", "rev-parse" , "--short", "HEAD"],
+                capture_output=True,
+                cwd=root_path,
+                check=True,
+                universal_newlines=True,
+            )
+        except (CalledProcessError, OSError):
+            pass
+        else:
+            commit = output.stdout.strip()
+            version += f"+{commit}"
+    return version
-all_sources = pytorch_sources
+@lru_cache(maxsize=1)
+def with_debug_build() -> bool:
+    """Whether to build with a debug configuration"""
+    for arg in sys.argv:
+        if arg == "--debug":
+            sys.argv.remove(arg)
+            return True
+    if int(os.getenv("NVTE_BUILD_DEBUG", "0")):
+        return True
+    return False
-supported_frameworks = {
+# Call once in global scope since this function manipulates the
-    "all": all_sources,
+# command-line arguments. Future calls will use a cached value.
-    "pytorch": pytorch_sources,
+with_debug_build()
-    "jax": None, # JAX use transformer_engine/CMakeLists.txt
-    "tensorflow": None, # tensorflow use transformer_engine/CMakeLists.txt
-}
-framework = os.environ.get("NVTE_FRAMEWORK", "pytorch")
+def found_cmake() -> bool:
+    """"Check if valid CMake is available
-include_dirs = [
+    CMake 3.18 or newer is required.
-    "transformer_engine/common/include",
-    "transformer_engine/pytorch/csrc",
-    "3rdparty/cudnn-frontend/include",
-]
-if NVTE_WITH_USERBUFFERS:
-    if MPI_HOME:
-        include_dirs.append(os.path.join(MPI_HOME, "include"))
-include_dirs = make_abs_path(include_dirs)
-args = sys.argv.copy()
+    """
-for s in args:
-    if s.startswith("--framework="):
-        framework = s.replace("--framework=", "")
-        sys.argv.remove(s)
-if framework not in supported_frameworks.keys():
-    raise ValueError("Unsupported framework " + framework)
+    # Check if CMake is available
+    try:
+        _cmake_bin = cmake_bin()
+    except FileNotFoundError:
+        return False
+    # Query CMake for version info
+    output = subprocess.run(
+        [_cmake_bin, "--version"],
+        capture_output=True,
+        check=True,
+        universal_newlines=True,
+    )
+    match = re.search(r"version\s*([\d.]+)", output.stdout)
+    version = match.group(1).split('.')
+    version = tuple(int(v) for v in version)
+    return version >= (3, 18)
-class CMakeExtension(Extension):
+def cmake_bin() -> Path:
-    def __init__(self, name, cmake_path, sources, **kwargs):
+    """Get CMake executable
-        super(CMakeExtension, self).__init__(name, sources=sources, **kwargs)
-        self.cmake_path = cmake_path
-class FrameworkBuilderBase:
+    Throws FileNotFoundError if not found.
-    def __init__(self, *args, **kwargs) -> None:
-        pass
-    def cmake_flags(self):
+    """
-        return []
-    def initialize_options(self):
+    # Search in CMake Python package
+    _cmake_bin: Optional[Path] = None
+    try:
+        import cmake
+    except ImportError:
        pass
+    else:
-    def finalize_options(self):
+        cmake_dir = Path(cmake.__file__).resolve().parent
+        _cmake_bin = cmake_dir / "data" / "bin" / "cmake"
+        if not _cmake_bin.is_file():
+            _cmake_bin = None
+    # Search in path
+    if _cmake_bin is None:
+        _cmake_bin = shutil.which("cmake")
+        if _cmake_bin is not None:
+            _cmake_bin = Path(_cmake_bin).resolve()
+    # Return executable if found
+    if _cmake_bin is None:
+        raise FileNotFoundError("Could not find CMake executable")
+    return _cmake_bin
+def found_ninja() -> bool:
+    """"Check if Ninja is available"""
+    return shutil.which("ninja") is not None
+def found_pybind11() -> bool:
+    """"Check if pybind11 is available"""
+    # Check if Python package is installed
+    try:
+        import pybind11
+    except ImportError:
        pass
+    else:
+        return True
-    def run(self, extensions):
+    # Check if CMake can find pybind11
+    if not found_cmake():
+        return False
+    try:
+        subprocess.run(
+            [
+                "cmake",
+                "--find-package",
+                "-DMODE=EXIST",
+                "-DNAME=pybind11",
+                "-DCOMPILER_ID=CXX",
+                "-DLANGUAGE=CXX",
+            ],
+            stdout=subprocess.DEVNULL,
+            stderr=subprocess.DEVNULL,
+            check=True,
+        )
+    except (CalledProcessError, OSError):
        pass
+    else:
+        return True
+    return False
+def cuda_version() -> Tuple[int, ...]:
+    """CUDA Toolkit version as a (major, minor) tuple
+    Throws FileNotFoundError if NVCC is not found.
+    """
+    # Try finding NVCC
+    nvcc_bin: Optional[Path] = None
+    if nvcc_bin is None and os.getenv("CUDA_HOME"):
+        # Check in CUDA_HOME
+        cuda_home = Path(os.getenv("CUDA_HOME"))
+        nvcc_bin = cuda_home / "bin" / "nvcc"
+    if nvcc_bin is None:
+        # Check if nvcc is in path
+        nvcc_bin = shutil.which("nvcc")
+        if nvcc_bin is not None:
+            nvcc_bin = Path(nvcc_bin)
+    if nvcc_bin is None:
+        # Last-ditch guess in /usr/local/cuda
+        cuda_home = Path("/usr/local/cuda")
+        nvcc_bin = cuda_home / "bin" / "nvcc"
+    if not nvcc_bin.is_file():
+        raise FileNotFoundError(f"Could not find NVCC at {nvcc_bin}")
+    # Query NVCC for version info
+    output = subprocess.run(
+        [nvcc_bin, "-V"],
+        capture_output=True,
+        check=True,
+        universal_newlines=True,
+    )
+    match = re.search(r"release\s*([\d.]+)", output.stdout)
+    version = match.group(1).split('.')
+    return tuple(int(v) for v in version)
+@lru_cache(maxsize=1)
+def with_userbuffers() -> bool:
+    """Check if userbuffers support is enabled"""
+    if int(os.getenv("NVTE_WITH_USERBUFFERS", "0")):
+        assert os.getenv("MPI_HOME"), \
+            "MPI_HOME must be set if NVTE_WITH_USERBUFFERS=1"
+        return True
+    return False
+@lru_cache(maxsize=1)
+def frameworks() -> List[str]:
+    """DL frameworks to build support for"""
+    _frameworks: List[str] = []
+    supported_frameworks = ["pytorch", "jax", "tensorflow"]
+    # Check environment variable
+    if os.getenv("NVTE_FRAMEWORK"):
+        _frameworks.extend(os.getenv("NVTE_FRAMEWORK").split(","))
+    # Check command-line arguments
+    for arg in sys.argv.copy():
+        if arg.startswith("--framework="):
+            _frameworks.extend(arg.replace("--framework=", "").split(","))
+            sys.argv.remove(arg)
+    # Detect installed frameworks if not explicitly specified
+    if not _frameworks:
+        try:
+            import torch
+        except ImportError:
+            pass
+        else:
+            _frameworks.append("pytorch")
+        try:
+            import jax
+        except ImportError:
+            pass
+        else:
+            _frameworks.append("jax")
+        try:
+            import tensorflow
+        except ImportError:
+            pass
+        else:
+            _frameworks.append("tensorflow")
+    # Special framework names
+    if "all" in _frameworks:
+        _frameworks = supported_frameworks.copy()
+    if "none" in _frameworks:
+        _frameworks = []
+    # Check that frameworks are valid
+    _frameworks = [framework.lower() for framework in _frameworks]
+    for framework in _frameworks:
+        if framework not in supported_frameworks:
+            raise ValueError(
+                f"Transformer Engine does not support framework={framework}"
+            )
-    @staticmethod
+    return _frameworks
-    def install_requires():
-        return []
+# Call once in global scope since this function manipulates the
+# command-line arguments. Future calls will use a cached value.
-class PyTorchBuilder(FrameworkBuilderBase):
+frameworks()
-    def __init__(self, *args, **kwargs) -> None:
-        pytorch_args = copy.deepcopy(args)
+def setup_requirements() -> Tuple[List[str], List[str], List[str]]:
-        pytorch_kwargs = copy.deepcopy(kwargs)
+    """Setup Python dependencies
-        from torch.utils.cpp_extension import BuildExtension
-        self.pytorch_build_extensions = BuildExtension(*pytorch_args, **pytorch_kwargs)
+    Returns dependencies for build, runtime, and testing.
-    def initialize_options(self):
+    """
-        self.pytorch_build_extensions.initialize_options()
+    # Common requirements
-    def finalize_options(self):
+    setup_reqs: List[str] = []
-        self.pytorch_build_extensions.finalize_options()
+    install_reqs: List[str] = ["pydantic"]
+    test_reqs: List[str] = ["pytest"]
-    def run(self, extensions):
-        other_ext = [
+    def add_unique(l: List[str], vals: Union[str, List[str]]) -> None:
-            ext for ext in extensions if not isinstance(ext, CMakeExtension)
+        """Add entry to list if not already included"""
+        if isinstance(vals, str):
+            vals = [vals]
+        for val in vals:
+            if val not in l:
+                l.append(val)
+    # Requirements that may be installed outside of Python
+    if not found_cmake():
+        add_unique(setup_reqs, "cmake>=3.18")
+    if not found_ninja():
+        add_unique(setup_reqs, "ninja")
+    # Framework-specific requirements
+    if "pytorch" in frameworks():
+        add_unique(install_reqs, ["torch", "flash-attn>=1.0.2"])
+        add_unique(test_reqs, ["numpy", "onnxruntime", "torchvision"])
+    if "jax" in frameworks():
+        if not found_pybind11():
+            add_unique(setup_reqs, "pybind11")
+        add_unique(install_reqs, ["jax", "flax"])
+        add_unique(test_reqs, ["numpy", "praxis"])
+    if "tensorflow" in frameworks():
+        if not found_pybind11():
+            add_unique(setup_reqs, "pybind11")
+        add_unique(install_reqs, "tensorflow")
+        add_unique(test_reqs, ["keras", "tensorflow_datasets"])
+    return setup_reqs, install_reqs, test_reqs
+class CMakeExtension(setuptools.Extension):
+    """CMake extension module"""
+    def __init__(
+            self,
+            name: str,
+            cmake_path: Path,
+            cmake_flags: Optional[List[str]] = None,
+    ) -> None:
+        super().__init__(name, sources=[])  # No work for base class
+        self.cmake_path: Path = cmake_path
+        self.cmake_flags: List[str] = [] if cmake_flags is None else cmake_flags
+    def _build_cmake(self, build_dir: Path, install_dir: Path) -> None:
+        # Make sure paths are str
+        _cmake_bin = str(cmake_bin())
+        cmake_path = str(self.cmake_path)
+        build_dir = str(build_dir)
+        install_dir = str(install_dir)
+        # CMake configure command
+        build_type = "Debug" if with_debug_build() else "Release"
+        configure_command = [
+            _cmake_bin,
+            "-S",
+            cmake_path,
+            "-B",
+            build_dir,
+            f"-DCMAKE_BUILD_TYPE={build_type}",
+            f"-DCMAKE_INSTALL_PREFIX={install_dir}",
        ]
-        self.pytorch_build_extensions.extensions = other_ext
+        configure_command += self.cmake_flags
-        print("Building pyTorch extensions!")
+        if found_ninja():
-        self.pytorch_build_extensions.run()
+            configure_command.append("-GNinja")
+        try:
-    def cmake_flags(self):
+            import pybind11
-        return []
+        except ImportError:
+            pass
-    @staticmethod
+        else:
-    def install_requires():
+            pybind11_dir = Path(pybind11.__file__).resolve().parent
-        return ["flash-attn>=1.0.2"]
+            pybind11_dir = pybind11_dir / "share" / "cmake" / "pybind11"
+            configure_command.append(f"-Dpybind11_DIR={pybind11_dir}")
+        # CMake build and install commands
+        build_command = [_cmake_bin, "--build", build_dir]
+        install_command = [_cmake_bin, "--install", build_dir]
-class TensorFlowBuilder(FrameworkBuilderBase):
+        # Run CMake commands
-    def cmake_flags(self):
+        for command in [configure_command, build_command, install_command]:
-        p = [d for d in sys.path if 'dist-packages' in d][0]
+            print(f"Running command {' '.join(command)}")
-        return ["-DENABLE_TENSORFLOW=ON", "-DCMAKE_PREFIX_PATH="+p]
+            try:
+                subprocess.run(command, cwd=build_dir, check=True)
+            except (CalledProcessError, OSError) as e:
+                raise RuntimeError(f"Error when running CMake: {e}")
-    def run(self, extensions):
-        print("Building TensorFlow extensions!")
+# PyTorch extension modules require special handling
+if "pytorch" in frameworks():
+    from torch.utils.cpp_extension import BuildExtension
+else:
+    from setuptools.command.build_ext import build_ext as BuildExtension
-class JaxBuilder(FrameworkBuilderBase):
-    def cmake_flags(self):
-        p = [d for d in sys.path if 'dist-packages' in d][0]
-        return ["-DENABLE_JAX=ON", "-DCMAKE_PREFIX_PATH="+p]
-    def run(self, extensions):
+class CMakeBuildExtension(BuildExtension):
-        print("Building jax extensions!")
+    """Setuptools command with support for CMake extension modules"""
-    def install_requires():
+    def __init__(self, *args, **kwargs) -> None:
-        # TODO: find a way to install pybind11 and ninja directly.
+        super().__init__(*args, **kwargs)
-        return ['cmake', 'flax']
-ext_modules = []
+    def run(self) -> None:
-dlfw_builder_funcs = []
-ext_modules.append(
+        # Build CMake extensions
-    CMakeExtension(
+        for ext in self.extensions:
+            if isinstance(ext, CMakeExtension):
+                print(f"Building CMake extension {ext.name}")
+                with tempfile.TemporaryDirectory() as build_dir:
+                    build_dir = Path(build_dir)
+                    package_path = Path(self.get_ext_fullpath(ext.name))
+                    install_dir = package_path.resolve().parent
+                    ext._build_cmake(
+                        build_dir=build_dir,
+                        install_dir=install_dir,
+                    )
+        # Build non-CMake extensions as usual
+        all_extensions = self.extensions
+        self.extensions = [
+            ext for ext in self.extensions
+            if not isinstance(ext, CMakeExtension)
+        ]
+        super().run()
+        self.extensions = all_extensions
+def setup_common_extension() -> CMakeExtension:
+    """Setup CMake extension for common library
+    Also builds JAX, TensorFlow, and userbuffers support if needed.
+    """
+    cmake_flags = []
+    if "jax" in frameworks():
+        cmake_flags.append("-DENABLE_JAX=ON")
+    if "tensorflow" in frameworks():
+        cmake_flags.append("-DENABLE_TENSORFLOW=ON")
+    if with_userbuffers():
+        cmake_flags.append("-DNVTE_WITH_USERBUFFERS=ON")
+    return CMakeExtension(
        name="transformer_engine",
-        cmake_path=os.path.join(path, "transformer_engine"),
+        cmake_path=root_path / "transformer_engine",
-        sources=[],
+        cmake_flags=cmake_flags,
-        include_dirs=include_dirs,
    )
-)
-if framework in ("all", "pytorch"):
+def setup_pytorch_extension() -> setuptools.Extension:
-    from torch.utils.cpp_extension import CUDAExtension
+    """Setup CUDA extension for PyTorch support"""
-    ext_modules.append(
-        CUDAExtension(
-            name="transformer_engine_extensions",
-            sources=supported_frameworks[framework],
-            extra_compile_args={
-                "cxx": ["-O3"],
-                "nvcc": append_nvcc_threads(extra_compiler_flags() + cc_flag),
-            },
-            include_dirs=include_dirs,
-        )
-    )
-    dlfw_builder_funcs.append(PyTorchBuilder)
-if framework in ("all", "jax"):
-    dlfw_builder_funcs.append(JaxBuilder)
-    # Trigger a better error when pybind11 isn't present.
-    # Sadly, if pybind11 was installed with `apt -y install pybind11-dev`
-    # This doesn't install a python packages. So the line bellow is too strict.
-    # When it fail, we need to detect if cmake will find pybind11.
-    # import pybind11
-if framework in ("all", "tensorflow"):
+    # Source files
-    dlfw_builder_funcs.append(TensorFlowBuilder)
+    src_dir = root_path / "transformer_engine" / "pytorch" / "csrc"
+    sources = [
+        src_dir / "extensions.cu",
+        src_dir / "common.cu",
+        src_dir / "ts_fp8_op.cpp",
+    ]
-dlfw_install_requires = ['pydantic']
+    # Header files
-for builder in dlfw_builder_funcs:
+    include_dirs = [
-    dlfw_install_requires = dlfw_install_requires + builder.install_requires()
+        root_path / "transformer_engine" / "common" / "include",
+        root_path / "transformer_engine" / "pytorch" / "csrc",
+        root_path / "3rdparty" / "cudnn-frontend" / "include",
+    ]
+    # Compiler flags
+    cxx_flags = ["-O3"]
+    nvcc_flags = [
+        "-O3",
+        "-gencode",
+        "arch=compute_70,code=sm_70",
+        "-U__CUDA_NO_HALF_OPERATORS__",
+        "-U__CUDA_NO_HALF_CONVERSIONS__",
+        "-U__CUDA_NO_BFLOAT16_OPERATORS__",
+        "-U__CUDA_NO_BFLOAT16_CONVERSIONS__",
+        "-U__CUDA_NO_BFLOAT162_OPERATORS__",
+        "-U__CUDA_NO_BFLOAT162_CONVERSIONS__",
+        "--expt-relaxed-constexpr",
+        "--expt-extended-lambda",
+        "--use_fast_math",
+    ]
-def get_cmake_bin():
+    # Version-dependent CUDA options
-    cmake_bin = "cmake"
    try:
-        out = subprocess.check_output([cmake_bin, "--version"])
+        version = cuda_version()
-    except OSError:
+    except FileNotFoundError:
-        cmake_installed_version = packaging.version.Version("0.0")
+        print("Could not determine CUDA Toolkit version")
    else:
-        cmake_installed_version = packaging.version.Version(
+        if version >= (11, 2):
-            re.search(r"version\s*([\d.]+)", out.decode()).group(1)
+            nvcc_flags.extend(["--threads", "4"])
-        )
+        if version >= (11, 0):
+            nvcc_flags.extend(["-gencode", "arch=compute_80,code=sm_80"])
-    if cmake_installed_version < packaging.version.Version("3.18.0"):
+        if version >= (11, 8):
-        print(
+            nvcc_flags.extend(["-gencode", "arch=compute_90,code=sm_90"])
-            "Could not find a recent CMake to build Transformer Engine. "
-            "Attempting to install CMake 3.18 to a temporary location via pip.",
+    # userbuffers support
-            flush=True,
+    if with_userbuffers():
-        )
+        if os.getenv("MPI_HOME"):
-        cmake_temp_dir = tempfile.TemporaryDirectory(prefix="nvte-cmake-tmp")
+            mpi_home = Path(os.getenv("MPI_HOME"))
-        atexit.register(cmake_temp_dir.cleanup)
+            include_dirs.append(mpi_home / "include")
-        try:
+        cxx_flags.append("-DNVTE_WITH_USERBUFFERS")
-            _ = subprocess.check_output(
+        nvcc_flags.append("-DNVTE_WITH_USERBUFFERS")
-                ["pip", "install", "--target", cmake_temp_dir.name, "cmake~=3.18.0"]
-            )
+    # Construct PyTorch CUDA extension
-        except Exception:
+    sources = [str(path) for path in sources]
-            raise RuntimeError(
+    include_dirs = [str(path) for path in include_dirs]
-                "Failed to install temporary CMake. "
+    from torch.utils.cpp_extension import CUDAExtension
-                "Please update your CMake to 3.18+."
+    return CUDAExtension(
-            )
+        name="transformer_engine_extensions",
-        cmake_bin = os.path.join(cmake_temp_dir.name, "bin", "run_cmake")
+        sources=sources,
-        with io.open(cmake_bin, "w") as f_run_cmake:
+        include_dirs=include_dirs,
-            f_run_cmake.write(
+        # libraries=["transformer_engine"], ### TODO (tmoon) Debug linker errors
-                f"#!/bin/sh\nPYTHONPATH={cmake_temp_dir.name} {os.path.join(cmake_temp_dir.name, 'bin', 'cmake')} \"$@\""
+        extra_compile_args={
-            )
+            "cxx": cxx_flags,
-        os.chmod(cmake_bin, 0o755)
+            "nvcc": nvcc_flags,
+        },
-    return cmake_bin
+    )
-class CMakeBuildExtension(build_ext, object):
-    def __init__(self, *args, **kwargs) -> None:
-        self.dlfw_flags = kwargs["dlfw_flags"]
-        super(CMakeBuildExtension, self).__init__(*args, **kwargs)
-    def build_extensions(self) -> None:
-        print("Building CMake extensions!")
-        cmake_bin = get_cmake_bin()
-        config = "Debug" if self.debug else "Release"
-        ext_name = self.extensions[0].name
-        build_dir = self.get_ext_fullpath(ext_name).replace(
-            self.get_ext_filename(ext_name), ""
-        )
-        build_dir = os.path.abspath(build_dir)
-        cmake_args = [
-            "-DCMAKE_BUILD_TYPE=" + config,
-            "-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_{}={}".format(config.upper(), build_dir),
-        ]
-        try:
-            import ninja
-        except ImportError:
-            pass
-        else:
-            cmake_args.append("-GNinja")
-        cmake_args = cmake_args + self.dlfw_flags
-        cmake_build_args = ["--config", config]
-        cmake_build_dir = os.path.join(self.build_temp, config)
-        if not os.path.exists(cmake_build_dir):
-            os.makedirs(cmake_build_dir)
-        config_and_build_commands = [
-            [cmake_bin, self.extensions[0].cmake_path] + cmake_args,
-            [cmake_bin, "--build", "."] + cmake_build_args,
-        ]
-        if True:
-            print(f"Running CMake in {cmake_build_dir}:")
-            for command in config_and_build_commands:
-                print(" ".join(command))
-            sys.stdout.flush()
-        # Config and build the extension
-        try:
-            for command in config_and_build_commands:
-                subprocess.check_call(command, cwd=cmake_build_dir)
-        except OSError as e:
-            raise RuntimeError("CMake failed: {}".format(str(e)))
-class TEBuildExtension(build_ext, object):
-    def __init__(self, *args, **kwargs) -> None:
-        self.dlfw_builder = []
+def main():
-        for functor in dlfw_builder_funcs:
-            self.dlfw_builder.append(functor(*args, **kwargs))
-        flags = []
+    # Submodules to install
-        if NVTE_WITH_USERBUFFERS:
+    packages = setuptools.find_packages(
-            flags.append('-DNVTE_WITH_USERBUFFERS=ON')
+        include=["transformer_engine", "transformer_engine.*"],
-        for builder in self.dlfw_builder:
+    )
-            flags = flags + builder.cmake_flags()
-        cmake_args = copy.deepcopy(args)
+    # Dependencies
-        cmake_kwargs = copy.deepcopy(kwargs)
+    setup_requires, install_requires, test_requires = setup_requirements()
-        cmake_kwargs["dlfw_flags"] = flags
-        self.cmake_build_extensions = CMakeBuildExtension(*cmake_args, **cmake_kwargs)
-        self.all_outputs = None
+    # Extensions
-        super(TEBuildExtension, self).__init__(*args, **kwargs)
+    ext_modules = [setup_common_extension()]
+    if "pytorch" in frameworks():
+        ext_modules.append(setup_pytorch_extension())
-    def initialize_options(self):
+    # Configure package
-        self.cmake_build_extensions.initialize_options()
+    setuptools.setup(
-        for builder in self.dlfw_builder:
+        name="transformer_engine",
-            builder.initialize_options()
+        version=te_version(),
-        super(TEBuildExtension, self).initialize_options()
+        packages=packages,
+        description="Transformer acceleration library",
+        ext_modules=ext_modules,
+        cmdclass={"build_ext": CMakeBuildExtension},
+        setup_requires=setup_requires,
+        install_requires=install_requires,
+        extras_require={"test": test_requires},
+        license_files=("LICENSE",),
+    )
-    def finalize_options(self):
-        self.cmake_build_extensions.finalize_options()
-        for builder in self.dlfw_builder:
-            builder.finalize_options()
-        super(TEBuildExtension, self).finalize_options()
-    def run(self) -> None:
+if __name__ == "__main__":
-        old_inplace, self.inplace = self.inplace, 0
+    main()
-        cmake_ext = [ext for ext in self.extensions if isinstance(ext, CMakeExtension)]
-        self.cmake_build_extensions.extensions = cmake_ext
-        self.cmake_build_extensions.run()
-        for builder in self.dlfw_builder:
-            builder.run(self.extensions)
-        self.all_outputs = []
-        for f in os.scandir(self.build_lib):
-            if f.is_file():
-                self.all_outputs.append(f.path)
-        self.inplace = old_inplace
-        if old_inplace:
-            self.copy_extensions_to_source()
-    def copy_extensions_to_source(self):
-        ext = self.extensions[0]
-        build_py = self.get_finalized_command("build_py")
-        fullname = self.get_ext_fullname(ext.name)
-        modpath = fullname.split(".")
-        package = ".".join(modpath[:-1])
-        package_dir = build_py.get_package_dir(package)
-        for f in os.scandir(self.build_lib):
-            if f.is_file():
-                src_filename = f.path
-                dest_filename = os.path.join(
-                    package_dir, os.path.basename(src_filename)
-                )
-                # Always copy, even if source is older than destination, to ensure
-                # that the right extensions for the current Python/platform are
-                # used.
-                copyfile(src_filename, dest_filename)
-    def get_outputs(self):
-        return self.all_outputs
-setup(
-    name="transformer_engine",
-    version=te_version,
-    packages=find_packages(
-        exclude=(
-            "build",
-            "csrc",
-            "include",
-            "tests",
-            "dist",
-            "docs",
-            "tests",
-            "examples",
-            "transformer_engine.egg-info",
-        )
-    ),
-    description="Transformer acceleration library",
-    ext_modules=ext_modules,
-    cmdclass={"build_ext": TEBuildExtension},
-    install_requires=dlfw_install_requires,
-    extras_require={
-        'test': ['pytest',
-                 'tensorflow_datasets'],
-        'test_pytest': ['onnxruntime',],
-    },
-    license_files=("LICENSE",),
-)
--- a/tests/jax/test_sanity_import.py
+++ b/tests/jax/test_sanity_import.py
@@ -2,11 +2,5 @@
 #
 # See LICENSE for license information.
-try:
+import transformer_engine.jax
-    import transformer_engine.jax
-    te_imported = True
-except:
-    te_imported = False
-assert te_imported, 'transformer_engine import failed'
 print("OK")
--- a/tests/pytorch/test_sanity_import.py
+++ b/tests/pytorch/test_sanity_import.py
@@ -2,11 +2,5 @@
 #
 # See LICENSE for license information.
-try:
+import transformer_engine.pytorch
-    import transformer_engine.pytorch
-    te_imported = True
-except:
-    te_imported = False
-assert te_imported, 'transformer_engine import failed'
 print("OK")
--- a/tests/tensorflow/test_sanity_import.py
+++ b/tests/tensorflow/test_sanity_import.py
@@ -2,11 +2,5 @@
 #
 # See LICENSE for license information.
-try:
+import transformer_engine.tensorflow
-    import transformer_engine.tensorflow
-    te_imported = True
-except:
-    te_imported = False
-assert te_imported, 'transformer_engine import failed'
 print("OK")
--- a/transformer_engine/CMakeLists.txt
+++ b/transformer_engine/CMakeLists.txt
@@ -28,16 +28,20 @@ include_directories(${PROJECT_SOURCE_DIR})
 add_subdirectory(common)
 if(NVTE_WITH_USERBUFFERS)
+    message(STATUS "userbuffers support enabled")
    add_subdirectory(pytorch/csrc/userbuffers)
 endif()
 option(ENABLE_JAX "Enable JAX in the building workflow." OFF)
+message(STATUS "JAX support: ${ENABLE_JAX}")
 if(ENABLE_JAX)
  find_package(pybind11 CONFIG REQUIRED)
  add_subdirectory(jax)
 endif()
 option(ENABLE_TENSORFLOW "Enable TensorFlow in the building workflow." OFF)
+message(STATUS "TensorFlow support: ${ENABLE_TENSORFLOW}")
 if(ENABLE_TENSORFLOW)
  find_package(pybind11 CONFIG REQUIRED)
  add_subdirectory(tensorflow)

--- a/transformer_engine/cmake/FindCUDNN.cmake
+++ b/transformer_engine/cmake/FindCUDNN.cmake
+# Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
 add_library(CUDNN::cudnn_all INTERFACE IMPORTED)
 find_path(
@@ -14,7 +18,7 @@ function(find_cudnn_library NAME)
        HINTS $ENV{CUDNN_PATH} ${CUDNN_PATH} ${CUDAToolkit_LIBRARY_DIR}
        PATH_SUFFIXES lib64 lib/x64 lib
    )
    if(${UPPERCASE_NAME}_LIBRARY)
        add_library(CUDNN::${NAME} UNKNOWN IMPORTED)
        set_target_properties(
@@ -48,7 +52,7 @@ if(CUDNN_INCLUDE_DIR AND CUDNN_LIBRARY)
    message(STATUS "cuDNN: ${CUDNN_LIBRARY}")
    message(STATUS "cuDNN: ${CUDNN_INCLUDE_DIR}")
    set(CUDNN_FOUND ON CACHE INTERNAL "cuDNN Library Found")
 else()
@@ -73,6 +77,5 @@ target_link_libraries(
    CUDNN::cudnn_adv_infer
    CUDNN::cudnn_cnn_infer
    CUDNN::cudnn_ops_infer
-    CUDNN::cudnn 
+    CUDNN::cudnn
 )
--- a/transformer_engine/common/CMakeLists.txt
+++ b/transformer_engine/common/CMakeLists.txt
@@ -77,3 +77,6 @@ set_source_files_properties(fused_softmax/scaled_masked_softmax.cu
                            COMPILE_OPTIONS "--use_fast_math")
 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -O3")
+# Install library
+install(TARGETS transformer_engine DESTINATION .)
--- a/transformer_engine/jax/CMakeLists.txt
+++ b/transformer_engine/jax/CMakeLists.txt
@@ -10,3 +10,4 @@ pybind11_add_module(
 )
 target_link_libraries(transformer_engine_jax PRIVATE CUDA::cudart CUDA::cublas CUDA::cublasLt transformer_engine)
+install(TARGETS transformer_engine_jax DESTINATION .)
--- a/transformer_engine/pytorch/csrc/userbuffers/CMakeLists.txt
+++ b/transformer_engine/pytorch/csrc/userbuffers/CMakeLists.txt
@@ -31,3 +31,6 @@ set_source_files_properties(userbuffers.cu
                            COMPILE_OPTIONS "$<$<COMPILE_LANGUAGE:CUDA>:-maxrregcount=64>")
 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -O3")
+# Install library
+install(TARGETS transformer_engine_userbuffers DESTINATION .)
--- a/transformer_engine/tensorflow/CMakeLists.txt
+++ b/transformer_engine/tensorflow/CMakeLists.txt
@@ -13,9 +13,9 @@ add_library(
 )
 # Includes
-execute_process(COMMAND ${Python_EXECUTABLE} -c "import tensorflow as tf; print(tf.sysconfig.get_include())" 
+execute_process(COMMAND ${Python_EXECUTABLE} -c "import tensorflow as tf; print(tf.sysconfig.get_include())"
                OUTPUT_VARIABLE Tensorflow_INCLUDE_DIRS OUTPUT_STRIP_TRAILING_WHITESPACE)
-execute_process(COMMAND ${Python_EXECUTABLE} -c "import numpy as np; print(np.get_include())" 
+execute_process(COMMAND ${Python_EXECUTABLE} -c "import numpy as np; print(np.get_include())"
                OUTPUT_VARIABLE Numpy_INCLUDE_DIRS OUTPUT_STRIP_TRAILING_WHITESPACE)
 target_include_directories(transformer_engine_tensorflow PRIVATE
@@ -25,7 +25,7 @@ target_include_directories(transformer_engine_tensorflow PRIVATE
 target_include_directories(_get_stream PRIVATE ${Tensorflow_INCLUDE_DIRS})
 # Libraries
-execute_process(COMMAND ${Python_EXECUTABLE} -c "import tensorflow as tf; print(tf.__file__)" 
+execute_process(COMMAND ${Python_EXECUTABLE} -c "import tensorflow as tf; print(tf.__file__)"
                OUTPUT_VARIABLE Tensorflow_LIB_PATH OUTPUT_STRIP_TRAILING_WHITESPACE)
 get_filename_component(Tensorflow_LIB_PATH ${Tensorflow_LIB_PATH} DIRECTORY)
 list(APPEND TF_LINKER_LIBS "${Tensorflow_LIB_PATH}/libtensorflow_framework.so.2")
@@ -40,3 +40,7 @@ target_link_libraries(_get_stream PRIVATE ${TF_LINKER_LIBS})
 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -O3")
+# Install library
+install(TARGETS transformer_engine_tensorflow DESTINATION .)
+install(TARGETS _get_stream DESTINATION .)