Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
dgl
Commits
05d9d496
Unverified
Commit
05d9d496
authored
Jul 21, 2022
by
Mufei Li
Committed by
GitHub
Jul 21, 2022
Browse files
Update (#4277)
Co-authored-by:
Ubuntu
<
ubuntu@ip-172-31-53-142.us-west-2.compute.internal
>
parent
05aca98d
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
66 additions
and
81 deletions
+66
-81
docs/source/guide/mixed_precision.rst
docs/source/guide/mixed_precision.rst
+66
-81
No files found.
docs/source/guide/mixed_precision.rst
View file @
05d9d496
...
@@ -2,59 +2,36 @@
...
@@ -2,59 +2,36 @@
Chapter
8
:
Mixed
Precision
Training
Chapter
8
:
Mixed
Precision
Training
===================================
===================================
DGL
is
compatible
with
`
PyTorch
's a
utomatic
m
ixed
p
recision package
DGL
is
compatible
with
the
`
PyTorch
A
utomatic
M
ixed
P
recision
(
AMP
)
package
<
https
://
pytorch
.
org
/
docs
/
stable
/
amp
.
html
>`
_
<
https
://
pytorch
.
org
/
docs
/
stable
/
amp
.
html
>`
_
for
mixed
precision
training
,
thus
saving
both
training
time
and
GPU
memory
for
mixed
precision
training
,
thus
saving
both
training
time
and
GPU
memory
consumption. To enable this feature, users need to install PyTorch 1.6+ with python 3.7+ and
consumption
.
This
feature
requires
DGL
0.9
+.
build DGL from source file to support ``float16`` data type (this feature is
still in its beta stage and we do not provide official pre-built pip wheels).
Installation
------------
First download DGL'
s
source
code
from
GitHub
and
build
the
shared
library
with
flag
``
USE_FP16
=
ON
``.
..
code
::
bash
git
clone
--
recurse
-
submodules
https
://
github
.
com
/
dmlc
/
dgl
.
git
cd
dgl
mkdir
build
cd
build
cmake
-
DUSE_CUDA
=
ON
-
DUSE_FP16
=
ON
..
make
-
j
Then
install
the
Python
binding
.
..
code
::
bash
cd
../
python
python
setup
.
py
install
Message
-
Passing
with
Half
Precision
Message
-
Passing
with
Half
Precision
-----------------------------------
-----------------------------------
DGL
with
fp16
support
allows
message
-
passing
on
``
float16
``
features
for
both
DGL
allows
message
-
passing
on
``
float16
(
fp16
)
``
features
for
both
UDF
(
User
Defined
Function
)
s
and
built
-
in
functions
(
e
.
g
.
``
dgl
.
function
.
sum
``,
UDF
s
(
User
Defined
Functions
)
and
built
-
in
functions
(
e
.
g
.
,
``
dgl
.
function
.
sum
``,
``
dgl
.
function
.
copy_u
``).
``
dgl
.
function
.
copy_u
``).
The
following
example
s
shows
how
to
use
DGL
's message-passing API on half-precision
The
following
example
shows
how
to
use
DGL
's message-passing API
s
on half-precision
features:
features:
>>> import torch
>>> import torch
>>> import dgl
>>> import dgl
>>> import dgl.function as fn
>>> import dgl.function as fn
>>> g = dgl.rand_graph(30, 100).to(0) # Create a graph on GPU w/ 30 nodes and 100 edges.
>>> dev = torch.device('
cuda
')
>>> g.ndata['
h
'] = torch.rand(30, 16).to(0).half() # Create fp16 node features.
>>> g = dgl.rand_graph(30, 100).to(dev) # Create a graph on GPU w/ 30 nodes and 100 edges.
>>> g.edata['
w
'] = torch.rand(100, 1).to(0).half() # Create fp16 edge features.
>>> g.ndata['
h
'] = torch.rand(30, 16).to(dev).half() # Create fp16 node features.
>>> g.edata['
w
'] = torch.rand(100, 1).to(dev).half() # Create fp16 edge features.
>>> # Use DGL'
s
built
-
in
functions
for
message
passing
on
fp16
features
.
>>> # Use DGL'
s
built
-
in
functions
for
message
passing
on
fp16
features
.
>>>
g
.
update_all
(
fn
.
u_mul_e
(
'h'
,
'w'
,
'm'
),
fn
.
sum
(
'm'
,
'x'
))
>>>
g
.
update_all
(
fn
.
u_mul_e
(
'h'
,
'w'
,
'm'
),
fn
.
sum
(
'm'
,
'x'
))
>>>
g
.
ndata
[
'x'
][
0
]
>>>
g
.
ndata
[
'x'
].
dtype
tensor
([
0.3391
,
0.2208
,
0.7163
,
0.6655
,
0.7031
,
0.5854
,
0.9404
,
0.7720
,
0.6562
,
torch
.
float16
0.4028
,
0.6943
,
0.5908
,
0.9307
,
0.5962
,
0.7827
,
0.5034
],
device
=
'cuda:0'
,
dtype
=
torch
.
float16
)
>>>
g
.
apply_edges
(
fn
.
u_dot_v
(
'h'
,
'x'
,
'hx'
))
>>>
g
.
apply_edges
(
fn
.
u_dot_v
(
'h'
,
'x'
,
'hx'
))
>>>
g
.
edata
[
'hx'
][
0
]
>>>
g
.
edata
[
'hx'
].
dtype
tensor
([
5.4570
],
device
=
'cuda:0'
,
dtype
=
torch
.
float16
)
torch
.
float16
>>>
#
Use
UDF
(
User
Defined
Functions
)
for
message
passing
on
fp16
features
.
>>>
#
Use
UDFs
for
message
passing
on
fp16
features
.
>>>
def
message
(
edges
):
>>>
def
message
(
edges
):
...
return
{
'm'
:
edges
.
src
[
'h'
]
*
edges
.
data
[
'w'
]}
...
return
{
'm'
:
edges
.
src
[
'h'
]
*
edges
.
data
[
'w'
]}
...
...
...
@@ -65,14 +42,11 @@ features:
...
@@ -65,14 +42,11 @@ features:
...
return
{
'hy'
:
(
edges
.
src
[
'h'
]
*
edges
.
dst
[
'y'
]).
sum
(-
1
,
keepdims
=
True
)}
...
return
{
'hy'
:
(
edges
.
src
[
'h'
]
*
edges
.
dst
[
'y'
]).
sum
(-
1
,
keepdims
=
True
)}
...
...
>>>
g
.
update_all
(
message
,
reduce
)
>>>
g
.
update_all
(
message
,
reduce
)
>>>
g
.
ndata
[
'y'
][
0
]
>>>
g
.
ndata
[
'y'
].
dtype
tensor
([
0.3394
,
0.2209
,
0.7168
,
0.6655
,
0.7026
,
0.5854
,
0.9404
,
0.7720
,
0.6562
,
torch
.
float16
0.4028
,
0.6943
,
0.5908
,
0.9307
,
0.5967
,
0.7827
,
0.5039
],
device
=
'cuda:0'
,
dtype
=
torch
.
float16
)
>>>
g
.
apply_edges
(
dot
)
>>>
g
.
apply_edges
(
dot
)
>>>
g
.
edata
[
'hy'
][
0
]
>>>
g
.
edata
[
'hy'
].
dtype
tensor
([
5.4609
],
device
=
'cuda:0'
,
dtype
=
torch
.
float16
)
torch
.
float16
End
-
to
-
End
Mixed
Precision
Training
End
-
to
-
End
Mixed
Precision
Training
-----------------------------------
-----------------------------------
...
@@ -80,33 +54,52 @@ DGL relies on PyTorch's AMP package for mixed precision training,
...
@@ -80,33 +54,52 @@ DGL relies on PyTorch's AMP package for mixed precision training,
and the user experience is exactly
and the user experience is exactly
the same as `PyTorch'
s
<
https
://
pytorch
.
org
/
docs
/
stable
/
notes
/
amp_examples
.
html
>`
_
.
the same as `PyTorch'
s
<
https
://
pytorch
.
org
/
docs
/
stable
/
notes
/
amp_examples
.
html
>`
_
.
By
wrapping
the
forward
pass
(
including
loss
computation
)
of
your
GNN
model
with
By
wrapping
the
forward
pass
with
``
torch
.
cuda
.
amp
.
autocast
()``,
PyTorch
automatically
``
torch
.
cuda
.
amp
.
autocast
()``,
PyTorch
automatically
selects
the
appropriate
datatype
selects
the
appropriate
datatype
for
each
op
and
tensor
.
Half
precision
tensors
are
memory
for
each
op
and
tensor
.
Half
precision
tensors
are
memory
efficient
,
most
operators
efficient
,
most
operators
on
half
precision
tensors
are
faster
as
they
leverage
GPU
tensorcores
.
on
half
precision
tensors
are
faster
as
they
leverage
GPU
's tensorcores.
..
code
::
import
torch
.
nn
.
functional
as
F
from
torch
.
cuda
.
amp
import
autocast
def
forward
(
g
,
feat
,
label
,
mask
,
model
,
use_fp16
):
with
autocast
(
enabled
=
use_fp16
):
logit
=
model
(
g
,
feat
)
loss
=
F
.
cross_entropy
(
logit
[
mask
],
label
[
mask
])
return
loss
Small
Gradients
in
``
float16
``
format
have
underflow
problems
(
flush
to
zero
).
PyTorch
provides
a
``
GradScaler
``
module
to
address
this
issue
.
It
multiplies
the
loss
by
a
factor
and
invokes
backward
pass
on
the
scaled
loss
to
prevent
the
underflow
problem
.
It
then
unscales
the
computed
gradients
before
the
optimizer
updates
the
parameters
.
The
scale
factor
is
determined
automatically
.
..
code
::
from
torch
.
cuda
.
amp
import
GradScaler
Small Gradients in ``float16`` format have underflow problems (flush to zero), and
scaler
=
GradScaler
()
PyTorch provides a ``GradScaler`` module to address this issue. ``GradScaler`` multiplies
loss by a factor and invokes backward pass on scaled loss, and unscales graidents before
def
backward
(
scaler
,
loss
,
optimizer
):
optimizers update the parameters, thus preventing the underflow problem.
scaler
.
scale
(
loss
).
backward
()
The scale factor is determined automatically.
scaler
.
step
(
optimizer
)
scaler
.
update
()
F
ollowing
is th
e train
ing script of
3-layer GAT on Reddit dataset (w/ 114 million edges)
,
The
f
ollowing
exampl
e
train
s
a
3
-
layer
GAT
on
the
Reddit
dataset
(
w
/
114
million
edges
)
.
note
the difference in code
s
when ``use_fp16`` is activated
/not activated:
Pay
attention
to
the
difference
s
in
the
code
when
``
use_fp16
``
is
activated
or
not
.
..
code
::
..
code
::
import
torch
import
torch
import
torch
.
nn
as
nn
import
torch
.
nn
as
nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
import
dgl
import
dgl
from
dgl
.
data
import
RedditDataset
from
dgl
.
data
import
RedditDataset
from
dgl
.
nn
import
GATConv
from
dgl
.
nn
import
GATConv
from
dgl
.
transforms
import
AddSelfLoop
use_fp16
=
True
use_fp16
=
True
class
GAT
(
nn
.
Module
):
class
GAT
(
nn
.
Module
):
def
__init__
(
self
,
def
__init__
(
self
,
in_feats
,
in_feats
,
...
@@ -129,48 +122,40 @@ note the difference in codes when ``use_fp16`` is activated/not activated:
...
@@ -129,48 +122,40 @@ note the difference in codes when ``use_fp16`` is activated/not activated:
return
h
return
h
#
Data
loading
#
Data
loading
data = RedditDataset()
transform
=
AddSelfLoop
()
device = torch.device(0)
data
=
RedditDataset
(
transform
)
dev
=
torch
.
device
(
'cuda'
)
g
=
data
[
0
]
g
=
data
[
0
]
g = dgl.add_self_loop(g)
g
=
g
.
int
().
to
(
dev
)
g = g.int().to(device)
train_mask
=
g
.
ndata
[
'train_mask'
]
train_mask
=
g
.
ndata
[
'train_mask'
]
features = g.ndata['
feat
']
feat
=
g
.
ndata
[
'feat'
]
labels = g.ndata['
label
']
label
=
g
.
ndata
[
'label'
]
in_feats = features.shape[1]
in_feats
=
feat
.
shape
[
1
]
n_hidden
=
256
n_hidden
=
256
n_classes
=
data
.
num_classes
n_classes
=
data
.
num_classes
n_edges = g.number_of_edges()
heads
=
[
1
,
1
,
1
]
heads
=
[
1
,
1
,
1
]
model
=
GAT
(
in_feats
,
n_hidden
,
n_classes
,
heads
)
model
=
GAT
(
in_feats
,
n_hidden
,
n_classes
,
heads
)
model = model.to(device)
model
=
model
.
to
(
dev
)
model
.
train
()
#
Create
optimizer
#
Create
optimizer
optimizer
=
torch
.
optim
.
Adam
(
model
.
parameters
(),
lr
=
1e-3
,
weight_decay
=
5e-4
)
optimizer
=
torch
.
optim
.
Adam
(
model
.
parameters
(),
lr
=
1e-3
,
weight_decay
=
5e-4
)
# Create gradient scaler
scaler = GradScaler()
for
epoch
in
range
(
100
):
for
epoch
in
range
(
100
):
model.train()
optimizer
.
zero_grad
()
optimizer
.
zero_grad
()
loss
=
forward
(
g
,
feat
,
label
,
train_mask
,
model
,
use_fp16
)
# Wrap forward pass with autocast
with autocast(enabled=use_fp16):
logits = model(g, features)
loss = F.cross_entropy(logits[train_mask], labels[train_mask])
if
use_fp16
:
if
use_fp16
:
#
Backprop
w
/
gradient
scaling
#
Backprop
w
/
gradient
scaling
scaler.scale(loss).backward()
backward
(
scaler
,
loss
,
optimizer
)
scaler.step(optimizer)
scaler.update()
else
:
else
:
loss
.
backward
()
loss
.
backward
()
optimizer
.
step
()
optimizer
.
step
()
print
(
'Epoch {} | Loss {}'
.
format
(
epoch
,
loss
.
item
()))
print
(
'Epoch {} | Loss {}'
.
format
(
epoch
,
loss
.
item
()))
On
a
NVIDIA
V100
(
16
GB
)
machine
,
training
this
model
without
fp16
consumes
On
a
NVIDIA
V100
(
16
GB
)
machine
,
training
this
model
without
fp16
consumes
15.2
GB
GPU
memory
;
with
fp16
turned
on
,
the
training
consumes
12.8
G
15.2
GB
GPU
memory
;
with
fp16
turned
on
,
the
training
consumes
12.8
G
GPU
memory
,
the
loss
converges
to
similar
values
in
both
settings
.
GPU
memory
,
the
loss
converges
to
similar
values
in
both
settings
.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment