[doc] update cpu_best_practices (#6588)

1ca17bfa · Rhett Ying · GitHub · 34da58da · 1ca17bfa
Unverified Commit 1ca17bfa authored Nov 22, 2023 by Rhett Ying Committed by GitHub Nov 22, 2023
Show whitespace changes
Inline Side-by-side

Showing with 21 additions and 8 deletions

tutorials/cpu/cpu_best_practises.py tutorials/cpu/cpu_best_practises.py +21 -8

No files found.
--- a/tutorials/cpu/cpu_best_practises.py
+++ b/tutorials/cpu/cpu_best_practises.py
@@ -29,17 +29,30 @@ To take advantage of optimizations *tcmalloc* provides, install it on your syste
 OpenMP settings
 ---------------------------
-During training on CPU, the training and dataloading part need to be maintained simultaneously.
+As `OpenMP` is the default parallel backend, we could control performance
-Best performance of parallelization in OpenMP
+including sampling and training via `dgl.utils.set_num_threads()`.
-can be achieved by setting up the optimal number of working threads and dataloading workers.
-Nodes with high number of CPU cores may benefit from higher number of dataloading workers.
+If number of OpenMP threads is not set and `num_workers` in dataloader is set
-A good starting point could be setting num_threads=4 in Dataloader constructor for nodes with 32 cores or more.
+to 0, the OpenMP runtime typically use the number of available CPU cores by
-If number of cores is rather small, the best performance might be achieved with just one
+default. This works well for most cases, and is also the default behavior in DGL.
-dataloader worker or even with dataloader num_threads=0 for dataloading and trainig performed
-in the same process
+If `num_workers` in dataloader is set to greater than 0, the number of
+OpenMP threads will be set to **1** for each worker process. This is the
+default behavior in PyTorch. In this case, we can set the number of OpenMP
+threads to the number of CPU cores in the main process.
+Performance tuning is highly dependent on the workload and hardware
+configuration. We recommend users to try different settings and choose the
+best one for their own cases.
 **Dataloader CPU affinity**
+.. note::
+    This feature is available for `dgl.dataloading.DataLoader` only. Not
+    available for dataloaders in `dgl.graphbolt` yet.
 If number of dataloader workers is more than 0, please consider using **use_cpu_affinity()** method
 of DGL Dataloader class, it will generally result in significant performance improvement for training.