Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
dgl
Commits
05d9d496
Unverified
Commit
05d9d496
authored
Jul 21, 2022
by
Mufei Li
Committed by
GitHub
Jul 21, 2022
Browse files
Update (#4277)
Co-authored-by:
Ubuntu
<
ubuntu@ip-172-31-53-142.us-west-2.compute.internal
>
parent
05aca98d
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
66 additions
and
81 deletions
+66
-81
docs/source/guide/mixed_precision.rst
docs/source/guide/mixed_precision.rst
+66
-81
No files found.
docs/source/guide/mixed_precision.rst
View file @
05d9d496
...
@@ -2,59 +2,36 @@
...
@@ -2,59 +2,36 @@
Chapter
8
:
Mixed
Precision
Training
Chapter
8
:
Mixed
Precision
Training
===================================
===================================
DGL
is
compatible
with
`
PyTorch
's a
utomatic
m
ixed
p
recision package
DGL
is
compatible
with
the
`
PyTorch
A
utomatic
M
ixed
P
recision
(
AMP
)
package
<
https
://
pytorch
.
org
/
docs
/
stable
/
amp
.
html
>`
_
<
https
://
pytorch
.
org
/
docs
/
stable
/
amp
.
html
>`
_
for
mixed
precision
training
,
thus
saving
both
training
time
and
GPU
memory
for
mixed
precision
training
,
thus
saving
both
training
time
and
GPU
memory
consumption. To enable this feature, users need to install PyTorch 1.6+ with python 3.7+ and
consumption
.
This
feature
requires
DGL
0.9
+.
build DGL from source file to support ``float16`` data type (this feature is
still in its beta stage and we do not provide official pre-built pip wheels).
Installation
------------
First download DGL'
s
source
code
from
GitHub
and
build
the
shared
library
with
flag
``
USE_FP16
=
ON
``.
..
code
::
bash
git
clone
--
recurse
-
submodules
https
://
github
.
com
/
dmlc
/
dgl
.
git
cd
dgl
mkdir
build
cd
build
cmake
-
DUSE_CUDA
=
ON
-
DUSE_FP16
=
ON
..
make
-
j
Then
install
the
Python
binding
.
..
code
::
bash
cd
../
python
python
setup
.
py
install
Message
-
Passing
with
Half
Precision
Message
-
Passing
with
Half
Precision
-----------------------------------
-----------------------------------
DGL
with
fp16
support
allows
message
-
passing
on
``
float16
``
features
for
both
DGL
allows
message
-
passing
on
``
float16
(
fp16
)
``
features
for
both
UDF
(
User
Defined
Function
)
s
and
built
-
in
functions
(
e
.
g
.
``
dgl
.
function
.
sum
``,
UDF
s
(
User
Defined
Functions
)
and
built
-
in
functions
(
e
.
g
.
,
``
dgl
.
function
.
sum
``,
``
dgl
.
function
.
copy_u
``).
``
dgl
.
function
.
copy_u
``).
The
following
example
s
shows
how
to
use
DGL
's message-passing API on half-precision
The
following
example
shows
how
to
use
DGL
's message-passing API
s
on half-precision
features:
features:
>>> import torch
>>> import torch
>>> import dgl
>>> import dgl
>>> import dgl.function as fn
>>> import dgl.function as fn
>>> g = dgl.rand_graph(30, 100).to(0) # Create a graph on GPU w/ 30 nodes and 100 edges.
>>> dev = torch.device('
cuda
')
>>> g.ndata['
h
'] = torch.rand(30, 16).to(0).half() # Create fp16 node features.
>>> g = dgl.rand_graph(30, 100).to(dev) # Create a graph on GPU w/ 30 nodes and 100 edges.
>>> g.edata['
w
'] = torch.rand(100, 1).to(0).half() # Create fp16 edge features.
>>> g.ndata['
h
'] = torch.rand(30, 16).to(dev).half() # Create fp16 node features.
>>> g.edata['
w
'] = torch.rand(100, 1).to(dev).half() # Create fp16 edge features.
>>> # Use DGL'
s
built
-
in
functions
for
message
passing
on
fp16
features
.
>>> # Use DGL'
s
built
-
in
functions
for
message
passing
on
fp16
features
.
>>>
g
.
update_all
(
fn
.
u_mul_e
(
'h'
,
'w'
,
'm'
),
fn
.
sum
(
'm'
,
'x'
))
>>>
g
.
update_all
(
fn
.
u_mul_e
(
'h'
,
'w'
,
'm'
),
fn
.
sum
(
'm'
,
'x'
))
>>>
g
.
ndata
[
'x'
][
0
]
>>>
g
.
ndata
[
'x'
].
dtype
tensor
([
0.3391
,
0.2208
,
0.7163
,
0.6655
,
0.7031
,
0.5854
,
0.9404
,
0.7720
,
0.6562
,
torch
.
float16
0.4028
,
0.6943
,
0.5908
,
0.9307
,
0.5962
,
0.7827
,
0.5034
],
device
=
'cuda:0'
,
dtype
=
torch
.
float16
)
>>>
g
.
apply_edges
(
fn
.
u_dot_v
(
'h'
,
'x'
,
'hx'
))
>>>
g
.
apply_edges
(
fn
.
u_dot_v
(
'h'
,
'x'
,
'hx'
))
>>>
g
.
edata
[
'hx'
][
0
]
>>>
g
.
edata
[
'hx'
].
dtype
tensor
([
5.4570
],
device
=
'cuda:0'
,
dtype
=
torch
.
float16
)
torch
.
float16
>>>
#
Use
UDF
(
User
Defined
Functions
)
for
message
passing
on
fp16
features
.
>>>
#
Use
UDFs
for
message
passing
on
fp16
features
.
>>>
def
message
(
edges
):
>>>
def
message
(
edges
):
...
return
{
'm'
:
edges
.
src
[
'h'
]
*
edges
.
data
[
'w'
]}
...
return
{
'm'
:
edges
.
src
[
'h'
]
*
edges
.
data
[
'w'
]}
...
...
...
@@ -65,14 +42,11 @@ features:
...
@@ -65,14 +42,11 @@ features:
...
return
{
'hy'
:
(
edges
.
src
[
'h'
]
*
edges
.
dst
[
'y'
]).
sum
(-
1
,
keepdims
=
True
)}
...
return
{
'hy'
:
(
edges
.
src
[
'h'
]
*
edges
.
dst
[
'y'
]).
sum
(-
1
,
keepdims
=
True
)}
...
...
>>>
g
.
update_all
(
message
,
reduce
)
>>>
g
.
update_all
(
message
,
reduce
)
>>>
g
.
ndata
[
'y'
][
0
]
>>>
g
.
ndata
[
'y'
].
dtype
tensor
([
0.3394
,
0.2209
,
0.7168
,
0.6655
,
0.7026
,
0.5854
,
0.9404
,
0.7720
,
0.6562
,
torch
.
float16
0.4028
,
0.6943
,
0.5908
,
0.9307
,
0.5967
,
0.7827
,
0.5039
],
device
=
'cuda:0'
,
dtype
=
torch
.
float16
)
>>>
g
.
apply_edges
(
dot
)
>>>
g
.
apply_edges
(
dot
)
>>>
g
.
edata
[
'hy'
][
0
]
>>>
g
.
edata
[
'hy'
].
dtype
tensor
([
5.4609
],
device
=
'cuda:0'
,
dtype
=
torch
.
float16
)
torch
.
float16
End
-
to
-
End
Mixed
Precision
Training
End
-
to
-
End
Mixed
Precision
Training
-----------------------------------
-----------------------------------
...
@@ -80,33 +54,52 @@ DGL relies on PyTorch's AMP package for mixed precision training,
...
@@ -80,33 +54,52 @@ DGL relies on PyTorch's AMP package for mixed precision training,
and the user experience is exactly
and the user experience is exactly
the same as `PyTorch'
s
<
https
://
pytorch
.
org
/
docs
/
stable
/
notes
/
amp_examples
.
html
>`
_
.
the same as `PyTorch'
s
<
https
://
pytorch
.
org
/
docs
/
stable
/
notes
/
amp_examples
.
html
>`
_
.
By
wrapping
the
forward
pass
(
including
loss
computation
)
of
your
GNN
model
with
By
wrapping
the
forward
pass
with
``
torch
.
cuda
.
amp
.
autocast
()``,
PyTorch
automatically
``
torch
.
cuda
.
amp
.
autocast
()``,
PyTorch
automatically
selects
the
appropriate
datatype
selects
the
appropriate
datatype
for
each
op
and
tensor
.
Half
precision
tensors
are
memory
for
each
op
and
tensor
.
Half
precision
tensors
are
memory
efficient
,
most
operators
efficient
,
most
operators
on
half
precision
tensors
are
faster
as
they
leverage
GPU
tensorcores
.
on
half
precision
tensors
are
faster
as
they
leverage
GPU
's tensorcores.
Small Gradients in ``float16`` format have underflow problems (flush to zero), and
..
code
::
PyTorch provides a ``GradScaler`` module to address this issue. ``GradScaler`` multiplies
loss by a factor and invokes backward pass on scaled loss, and unscales graidents before
optimizers update the parameters, thus preventing the underflow problem.
The scale factor is determined automatically.
Following is the training script of 3-layer GAT on Reddit dataset (w/ 114 million edges),
import
torch
.
nn
.
functional
as
F
note the difference in codes when ``use_fp16`` is activated/not activated:
from
torch
.
cuda
.
amp
import
autocast
def
forward
(
g
,
feat
,
label
,
mask
,
model
,
use_fp16
):
with
autocast
(
enabled
=
use_fp16
):
logit
=
model
(
g
,
feat
)
loss
=
F
.
cross_entropy
(
logit
[
mask
],
label
[
mask
])
return
loss
Small
Gradients
in
``
float16
``
format
have
underflow
problems
(
flush
to
zero
).
PyTorch
provides
a
``
GradScaler
``
module
to
address
this
issue
.
It
multiplies
the
loss
by
a
factor
and
invokes
backward
pass
on
the
scaled
loss
to
prevent
the
underflow
problem
.
It
then
unscales
the
computed
gradients
before
the
optimizer
updates
the
parameters
.
The
scale
factor
is
determined
automatically
.
..
code
::
..
code
::
import torch
from
torch
.
cuda
.
amp
import
GradScaler
scaler
=
GradScaler
()
def
backward
(
scaler
,
loss
,
optimizer
):
scaler
.
scale
(
loss
).
backward
()
scaler
.
step
(
optimizer
)
scaler
.
update
()
The
following
example
trains
a
3
-
layer
GAT
on
the
Reddit
dataset
(
w
/
114
million
edges
).
Pay
attention
to
the
differences
in
the
code
when
``
use_fp16
``
is
activated
or
not
.
..
code
::
import
torch
import
torch
.
nn
as
nn
import
torch
.
nn
as
nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
import
dgl
import
dgl
from
dgl
.
data
import
RedditDataset
from
dgl
.
data
import
RedditDataset
from
dgl
.
nn
import
GATConv
from
dgl
.
nn
import
GATConv
from
dgl
.
transforms
import
AddSelfLoop
use_fp16
=
True
use_fp16
=
True
class
GAT
(
nn
.
Module
):
class
GAT
(
nn
.
Module
):
def
__init__
(
self
,
def
__init__
(
self
,
in_feats
,
in_feats
,
...
@@ -129,48 +122,40 @@ note the difference in codes when ``use_fp16`` is activated/not activated:
...
@@ -129,48 +122,40 @@ note the difference in codes when ``use_fp16`` is activated/not activated:
return
h
return
h
#
Data
loading
#
Data
loading
data = RedditDataset()
transform
=
AddSelfLoop
()
device = torch.device(0)
data
=
RedditDataset
(
transform
)
dev
=
torch
.
device
(
'cuda'
)
g
=
data
[
0
]
g
=
data
[
0
]
g = dgl.add_self_loop(g)
g
=
g
.
int
().
to
(
dev
)
g = g.int().to(device)
train_mask
=
g
.
ndata
[
'train_mask'
]
train_mask
=
g
.
ndata
[
'train_mask'
]
features = g.ndata['
feat
']
feat
=
g
.
ndata
[
'feat'
]
labels = g.ndata['
label
']
label
=
g
.
ndata
[
'label'
]
in_feats = features.shape[1]
in_feats
=
feat
.
shape
[
1
]
n_hidden
=
256
n_hidden
=
256
n_classes
=
data
.
num_classes
n_classes
=
data
.
num_classes
n_edges = g.number_of_edges()
heads
=
[
1
,
1
,
1
]
heads
=
[
1
,
1
,
1
]
model
=
GAT
(
in_feats
,
n_hidden
,
n_classes
,
heads
)
model
=
GAT
(
in_feats
,
n_hidden
,
n_classes
,
heads
)
model = model.to(device)
model
=
model
.
to
(
dev
)
model
.
train
()
#
Create
optimizer
#
Create
optimizer
optimizer
=
torch
.
optim
.
Adam
(
model
.
parameters
(),
lr
=
1e-3
,
weight_decay
=
5e-4
)
optimizer
=
torch
.
optim
.
Adam
(
model
.
parameters
(),
lr
=
1e-3
,
weight_decay
=
5e-4
)
# Create gradient scaler
scaler = GradScaler()
for
epoch
in
range
(
100
):
for
epoch
in
range
(
100
):
model.train()
optimizer
.
zero_grad
()
optimizer
.
zero_grad
()
loss
=
forward
(
g
,
feat
,
label
,
train_mask
,
model
,
use_fp16
)
# Wrap forward pass with autocast
with autocast(enabled=use_fp16):
logits = model(g, features)
loss = F.cross_entropy(logits[train_mask], labels[train_mask])
if
use_fp16
:
if
use_fp16
:
#
Backprop
w
/
gradient
scaling
#
Backprop
w
/
gradient
scaling
scaler.scale(loss).backward()
backward
(
scaler
,
loss
,
optimizer
)
scaler.step(optimizer)
scaler.update()
else
:
else
:
loss
.
backward
()
loss
.
backward
()
optimizer
.
step
()
optimizer
.
step
()
print
(
'Epoch {} | Loss {}'
.
format
(
epoch
,
loss
.
item
()))
print
(
'Epoch {} | Loss {}'
.
format
(
epoch
,
loss
.
item
()))
On
a
NVIDIA
V100
(
16
GB
)
machine
,
training
this
model
without
fp16
consumes
On
a
NVIDIA
V100
(
16
GB
)
machine
,
training
this
model
without
fp16
consumes
15.2
GB
GPU
memory
;
with
fp16
turned
on
,
the
training
consumes
12.8
G
15.2
GB
GPU
memory
;
with
fp16
turned
on
,
the
training
consumes
12.8
G
GPU
memory
,
the
loss
converges
to
similar
values
in
both
settings
.
GPU
memory
,
the
loss
converges
to
similar
values
in
both
settings
.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment