Tests for distributed (#1196)

* Tests for distributed Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added the test to the qa script Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Changed qa Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix to test_numerics file Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * pr fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update tests/pytorch/distributed/run_numerics.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Tests for distributed (#1196)
* Tests for distributed Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added the test to the qa script Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Changed qa Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix to test_numerics file Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * pr fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update tests/pytorch/distributed/run_numerics.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
60f738ff · Paweł Gadziński · GitHub · f8eb799a · 60f738ff · 60f738ff
Unverified Commit 60f738ff authored Oct 07, 2024 by Paweł Gadziński Committed by GitHub Oct 07, 2024
3 changed files
--- a/qa/L0_pytorch_distributed_unittest/test.sh
+++ b/qa/L0_pytorch_distributed_unittest/test.sh
+# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+set -e
+
+pip install pytest==7.2.0
+
+: ${TE_PATH:=/opt/transformerengine}
+pytest -v -s $TE_PATH/tests/pytorch/distributed/test_numerics.py
--- a/tests/pytorch/distributed/run_numerics.py
+++ b/tests/pytorch/distributed/run_numerics.py
--- a/tests/pytorch/distributed/test_numerics.py
+++ b/tests/pytorch/distributed/test_numerics.py
+# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# See LICENSE for license information.
+
+import os
+import subprocess
+from pathlib import Path
+
+import pytest
+import torch
+from transformer_engine.pytorch.fp8 import FP8GlobalStateManager
+
+"""
+    Distributed numerics tests
+
+    These tests test the numerical corectness of the TransformerEngine layers.
+    Tests are parametrized by the layer and fp8 precision.
+    One test consists of running multiple configurations from file run_numerics.py
+    Such design is due to the fact the initialization of one test is long
+    - 2 processes need to start and load torch and TE. Multiple configurations
+    are run in one test - this reduces the initialization overhead.
+
+"""
+
+
+if torch.cuda.device_count() < 2:
+    pytest.skip("Distributed training needs at least 2 GPUs.")
+
+fp8_available, reason_for_no_fp8 = FP8GlobalStateManager.is_fp8_available()
+
+TEST_ROOT = Path(__file__).parent.resolve()
+NUM_PROCS: int = min(4, torch.cuda.device_count())
+LAUNCH_CMD = ["torchrun", f"--nproc_per_node={NUM_PROCS}"]
+
+
+def _run_test(fp8):
+    test_path = TEST_ROOT / "run_numerics.py"
+    test_cmd = LAUNCH_CMD + [str(test_path)]
+
+    if fp8:
+        test_cmd += ["--fp8"]
+
+    result = subprocess.run(test_cmd, env=os.environ, capture_output=True, check=False)
+    if result.returncode != 0 or "NUMERICAL CHECK FAILED" in result.stderr.decode():
+        raise AssertionError(result.stderr.decode())
+
+
+all_boolean = [True, False]
+
+
+@pytest.mark.parametrize("fp8", all_boolean)
+def test_distributed(fp8):
+    if fp8 and not fp8_available:
+        pytest.skip(reason_for_no_fp8)
+    _run_test(fp8)