0.5.0 updates (#183)

update the doc

0.5.0 updates (#183)
update the doc
ba78cbdc · Ruilong Li(李瑞龙) · GitHub · 8340e19d · ba78cbdc · ba78cbdc
Unverified Commit ba78cbdc authored Apr 03, 2023 by Ruilong Li(李瑞龙) Committed by GitHub Apr 03, 2023
3 changed files
--- a/docs/source/_static/images/coding/gpu_util.png
+++ b/docs/source/_static/images/coding/gpu_util.png
--- a/docs/source/methodology/coding.rst
+++ b/docs/source/methodology/coding.rst
+.. _`Efficient Coding`:
+Efficient Coding
+================
+Monitor the GPU Utilization
+----------------------------
+The rule of thumb is to maximize the computation power you have. When working with GPU, 
+that means to maximize the percentage of GPU kernels are running at the same time. This
+can be monitored by the GPU utilization (last column) reported by the `nvidia-smi` command:
+.. image:: ../_static/images/coding/gpu_util.png
+  :align: center
+|
+If the GPU utilization is less than 100%, it means there are some GPU kernels are idling
+from time to time (if not all the time). This is of course not good for the performance.
+Under the hood, Pytorch will try to use all the GPU kernels to parellelize the computation,
+but in most of the case you don't see 100%. Why is that?
+The first reason is simplly that there might not be too much to parellelize, for which
+we don't have too much to do other than increasing batch size. For example
+.. code-block:: python
+    # Only 1000 threads are running at the same time to create this tensor.
+    # So we see 28% GPU utilization.
+    while True: torch.zeros((1000), device="cuda") 
+    # Now we see 100% GPU utilization.
+    while True: torch.zeros((10000000), device="cuda")
+The second reason, which is more common, is that there is *CPU-GPU synchronization*
+happening in the code. For example:
+.. code-block:: python
+    data = torch.rand((10000000), device="cuda")
+    mask = data > 0.5
+    ids = torch.where(mask)[0]
+    assert torch.all(data[mask] == data[ids])
+    # 100% GPU utilization.
+    while True: data[ids]
+    # 95% GPU utilization.
+    while True: data[mask]
+Besides, if there are many cpu operations in the pipeline, such as data loading and 
+preprocessing, it might also cause the GPU utilization to be low. In this case, you
+can try to use `torch.utils.data.DataLoader` to overlap the data processing time
+with the GPU computation time.
+Avoid CPU-GPU Synchronization
+-----------------------------
+In the above example, if you time your code, you will see a significant difference:
+.. code-block:: python
+    # 177 µs ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
+    %timeit data[ids]   
+    # 355 µs ± 466 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
+    %timeit data[mask]  
+Explanation: In this case, the mask operation needs to decide the size of the output tensor, which 
+lives on CPU, based on the number of `True` values in the mask that lives on GPU. So
+a synchronization is required. The index selection operation, on the other hand, 
+already knows the size of the output tensor based the size of the `ids`, so no 
+synchronization is required.
+You may argue in this case we can't really improve things because the `ids` are computed
+from the `mask`, which would require the synchronization anyway (`torch.where`). However,
+in many cases we can avoid the synchronization by carefully writing the code. For example:
+.. code-block:: python
+    # no sync. 67.3 µs ± 5.01 ns per loop
+    while True: torch.zeros((10000000), device="cuda") 
+    # sync. 13.7 ms ± 320 µs per loop
+    while True: torch.zeros((10000000)).to("cuda") 
+Operations that require synchronization including `torch.where`, `tensor.item()`, 
+`print(tensor)`, `tensor.to(device)`, `torch.nonzero` etc. Just imagine those functions
+have a inherit `torch.cuda.synchronize()` called under the hood. See the 
+`official guide <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#avoid-unnecessary-cpu-gpu-synchronization>`_ 
+if with more interests.
+In our library, :func:`nerfacc.traverse_grids` is a function that requires synchronization,
+because it needs to know the size of the output tensor when traversing the grids. As a result,
+sampling with :class:`nerfacc.OccGridEstimator` also requires synchronization. But there is 
+no walkaround in this case so just be aware of it.
\ No newline at end of file
--- a/docs/source/methodology/sampling.rst
+++ b/docs/source/methodology/sampling.rst
@@ -116,19 +116,10 @@ In `nerfacc`, this is implemented via the :class:`nerfacc.PropNetEstimator` clas
 Which Estimator to use?
 -----------------------
- :class:`nerfacc.OccGridEstimator` is a generally more efficient when most of the space in the scene is empty, such as in the case of `NeRF-Synthetic`_ dataset. But but it still places samples within occupied but occluded areas that contribute little to the final rendering (e.g., the last sample in the above illustration).
+- :class:`nerfacc.OccGridEstimator` is a generally more efficient when most of the space in the scene is empty, such as in the case of `NeRF-Synthetic`_ dataset. But it still places samples within occluded areas that contribute little to the final rendering (e.g., the last sample in the above illustration).
 - :class:`nerfacc.PropNetEstimator` generally provide more accurate transmittance estimation, enabling samples to concentrate more on high-contribution areas (e.g., surfaces) and to be more spread out in both empty and occluded regions. Also this method works nicely on unbouned scenes as it does not require a preset bounding box of the scene. Thus datasets like `Mip-NeRF 360`_ are better suited with this estimator.
-.. .. currentmodule:: nerfacc
-.. .. autoclass:: OccGridEstimator
-..     :members:
-.. .. autoclass:: PropNetEstimator
-..     :members:
 .. _`SIGGRAPH 2017 Course: Production Volume Rendering`: https://graphics.pixar.com/library/ProductionVolumeRendering/paper.pdf
 .. _`Instant-NGP`: https://arxiv.org/abs/2201.05989
 .. _`Mip-NeRF 360`: https://arxiv.org/abs/2111.12077