"git@developer.sourcefind.cn:renzhc/diffusers_dcu.git" did not exist on "9f00c617a0bc50527c1498c36fde066f995a79dd"
Unverified Commit 1ca17bfa authored by Rhett Ying's avatar Rhett Ying Committed by GitHub
Browse files

[doc] update cpu_best_practices (#6588)

parent 34da58da
...@@ -29,17 +29,30 @@ To take advantage of optimizations *tcmalloc* provides, install it on your syste ...@@ -29,17 +29,30 @@ To take advantage of optimizations *tcmalloc* provides, install it on your syste
OpenMP settings OpenMP settings
--------------------------- ---------------------------
During training on CPU, the training and dataloading part need to be maintained simultaneously. As `OpenMP` is the default parallel backend, we could control performance
Best performance of parallelization in OpenMP including sampling and training via `dgl.utils.set_num_threads()`.
can be achieved by setting up the optimal number of working threads and dataloading workers.
Nodes with high number of CPU cores may benefit from higher number of dataloading workers. If number of OpenMP threads is not set and `num_workers` in dataloader is set
A good starting point could be setting num_threads=4 in Dataloader constructor for nodes with 32 cores or more. to 0, the OpenMP runtime typically use the number of available CPU cores by
If number of cores is rather small, the best performance might be achieved with just one default. This works well for most cases, and is also the default behavior in DGL.
dataloader worker or even with dataloader num_threads=0 for dataloading and trainig performed
in the same process If `num_workers` in dataloader is set to greater than 0, the number of
OpenMP threads will be set to **1** for each worker process. This is the
default behavior in PyTorch. In this case, we can set the number of OpenMP
threads to the number of CPU cores in the main process.
Performance tuning is highly dependent on the workload and hardware
configuration. We recommend users to try different settings and choose the
best one for their own cases.
**Dataloader CPU affinity** **Dataloader CPU affinity**
.. note::
This feature is available for `dgl.dataloading.DataLoader` only. Not
available for dataloaders in `dgl.graphbolt` yet.
If number of dataloader workers is more than 0, please consider using **use_cpu_affinity()** method If number of dataloader workers is more than 0, please consider using **use_cpu_affinity()** method
of DGL Dataloader class, it will generally result in significant performance improvement for training. of DGL Dataloader class, it will generally result in significant performance improvement for training.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment