"src/graph/vscode:/vscode.git/clone" did not exist on "eeeb52f464cea18ef41b9108b3823aa91ebce31a"
Unverified Commit 438aa017 authored by Shaden Smith's avatar Shaden Smith Committed by GitHub
Browse files

Enables NCCL backend in @distributed_test (#13)

* Enables NCCL backend in @distributed_test

* Adds pytest-forked to avoid CUDA re-initialization issue.

* paste typo

* transcription typo
parent 188f7d4e
...@@ -14,13 +14,14 @@ model convergence tests are found in `tests/model/`. ...@@ -14,13 +14,14 @@ model convergence tests are found in `tests/model/`.
### Unit Tests ### Unit Tests
[PyTest](https://docs.pytest.org/en/latest/) is used to execute tests. PyTest can be [PyTest](https://docs.pytest.org/en/latest/) is used to execute tests. PyTest can be
installed from PyPI via `pip install pytest`. Simply invoke `pytest` to run the unit installed from PyPI via `pip install pytest`. Simply invoke `pytest --forked` to run the
tests: unit tests:
pytest tests/unit/ pytest --forked tests/unit/
You can also provide the `-v` flag to `pytest` to see additional information about the You can also provide the `-v` flag to `pytest` to see additional information about the
tests. tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) and the
`--forked` flag are required to test CUDA functionality in distributed tests.
### Model Tests ### Model Tests
To execute model tests, first [install DeepSpeed](#installation). The To execute model tests, first [install DeepSpeed](#installation). The
......
...@@ -41,8 +41,7 @@ jobs: ...@@ -41,8 +41,7 @@ jobs:
displayName: 'Code linter' displayName: 'Code linter'
- script: | - script: |
pip install --user pytest pytest --forked --verbose tests/unit/
pytest --verbose tests/unit/
displayName: 'Unit tests' displayName: 'Unit tests'
- script: | - script: |
......
...@@ -6,4 +6,5 @@ tensorboardX==1.8 ...@@ -6,4 +6,5 @@ tensorboardX==1.8
tensorflow-gpu==1.14.0 tensorflow-gpu==1.14.0
nvidia-ml-py3 nvidia-ml-py3
pytest pytest
pytest-forked
pre-commit pre-commit
...@@ -7,11 +7,11 @@ from torch.multiprocessing import Process ...@@ -7,11 +7,11 @@ from torch.multiprocessing import Process
import pytest import pytest
# Worker timeout _after_ the first worker has completed. # Worker timeout *after* the first worker has completed.
DEEPSPEED_UNIT_WORKER_TIMEOUT = 5 DEEPSPEED_UNIT_WORKER_TIMEOUT = 10
def distributed_test(world_size=2, backend='gloo'): def distributed_test(world_size=2, backend='nccl'):
"""A decorator for executing a function (e.g., a unit test) in a distributed manner. """A decorator for executing a function (e.g., a unit test) in a distributed manner.
This decorator manages the spawning and joining of processes, initialization of This decorator manages the spawning and joining of processes, initialization of
torch.distributed, and catching of errors. torch.distributed, and catching of errors.
...@@ -38,9 +38,8 @@ def distributed_test(world_size=2, backend='gloo'): ...@@ -38,9 +38,8 @@ def distributed_test(world_size=2, backend='gloo'):
rank=local_rank, rank=local_rank,
world_size=num_procs) world_size=num_procs)
# XXX temporarily disabled due to CUDA runtime error? if torch.cuda.is_available():
#if torch.cuda.is_available(): torch.cuda.set_device(local_rank)
# torch.cuda.set_device(local_rank)
run_func(*func_args, **func_kwargs) run_func(*func_args, **func_kwargs)
......
...@@ -29,9 +29,10 @@ def test_dist_args(number, color): ...@@ -29,9 +29,10 @@ def test_dist_args(number, color):
_test_dist_args_helper(number, color=color) _test_dist_args_helper(number, color=color)
@distributed_test(world_size=2) @distributed_test(world_size=[1, 2, 4])
def test_dist_allreduce(): def test_dist_allreduce():
x = torch.ones(1, 3) * (dist.get_rank() + 1) x = torch.ones(1, 3).cuda() * (dist.get_rank() + 1)
result = torch.ones(1, 3) * 3 sum_of_ranks = (dist.get_world_size() * (dist.get_world_size() + 1)) // 2
result = torch.ones(1, 3).cuda() * sum_of_ranks
dist.all_reduce(x) dist.all_reduce(x)
assert torch.all(x == result) assert torch.all(x == result)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment