"cacheflow/vscode:/vscode.git/clone" did not exist on "655a5e48df3937bf793add53aa95ce0c992a24c6"
Unverified Commit 60f738ff authored by Paweł Gadziński's avatar Paweł Gadziński Committed by GitHub
Browse files

Tests for distributed (#1196)



* Tests for distributed
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* added the test to the qa script
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* Changed qa
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix to test_numerics file
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* pr fixes
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixes
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update tests/pytorch/distributed/run_numerics.py
Co-authored-by: default avatarTim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: default avatarPaweł Gadziński <62263673+pggPL@users.noreply.github.com>

---------
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com>
Signed-off-by: default avatarPaweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: default avatarpre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: default avatarPawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com>
Co-authored-by: default avatarTim Moon <4406448+timmoon10@users.noreply.github.com>
parent f8eb799a
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.
set -e
pip install pytest==7.2.0
: ${TE_PATH:=/opt/transformerengine}
pytest -v -s $TE_PATH/tests/pytorch/distributed/test_numerics.py
This diff is collapsed.
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.
import os
import subprocess
from pathlib import Path
import pytest
import torch
from transformer_engine.pytorch.fp8 import FP8GlobalStateManager
"""
Distributed numerics tests
These tests test the numerical corectness of the TransformerEngine layers.
Tests are parametrized by the layer and fp8 precision.
One test consists of running multiple configurations from file run_numerics.py
Such design is due to the fact the initialization of one test is long
- 2 processes need to start and load torch and TE. Multiple configurations
are run in one test - this reduces the initialization overhead.
"""
if torch.cuda.device_count() < 2:
pytest.skip("Distributed training needs at least 2 GPUs.")
fp8_available, reason_for_no_fp8 = FP8GlobalStateManager.is_fp8_available()
TEST_ROOT = Path(__file__).parent.resolve()
NUM_PROCS: int = min(4, torch.cuda.device_count())
LAUNCH_CMD = ["torchrun", f"--nproc_per_node={NUM_PROCS}"]
def _run_test(fp8):
test_path = TEST_ROOT / "run_numerics.py"
test_cmd = LAUNCH_CMD + [str(test_path)]
if fp8:
test_cmd += ["--fp8"]
result = subprocess.run(test_cmd, env=os.environ, capture_output=True, check=False)
if result.returncode != 0 or "NUMERICAL CHECK FAILED" in result.stderr.decode():
raise AssertionError(result.stderr.decode())
all_boolean = [True, False]
@pytest.mark.parametrize("fp8", all_boolean)
def test_distributed(fp8):
if fp8 and not fp8_available:
pytest.skip(reason_for_no_fp8)
_run_test(fp8)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment