Unverified Commit ba78cbdc authored by Ruilong Li(李瑞龙)'s avatar Ruilong Li(李瑞龙) Committed by GitHub
Browse files

0.5.0 updates (#183)

update the doc
parent 8340e19d
.. _`Efficient Coding`:
Efficient Coding
================
Monitor the GPU Utilization
----------------------------
The rule of thumb is to maximize the computation power you have. When working with GPU,
that means to maximize the percentage of GPU kernels are running at the same time. This
can be monitored by the GPU utilization (last column) reported by the `nvidia-smi` command:
.. image:: ../_static/images/coding/gpu_util.png
:align: center
|
If the GPU utilization is less than 100%, it means there are some GPU kernels are idling
from time to time (if not all the time). This is of course not good for the performance.
Under the hood, Pytorch will try to use all the GPU kernels to parellelize the computation,
but in most of the case you don't see 100%. Why is that?
The first reason is simplly that there might not be too much to parellelize, for which
we don't have too much to do other than increasing batch size. For example
.. code-block:: python
# Only 1000 threads are running at the same time to create this tensor.
# So we see 28% GPU utilization.
while True: torch.zeros((1000), device="cuda")
# Now we see 100% GPU utilization.
while True: torch.zeros((10000000), device="cuda")
The second reason, which is more common, is that there is *CPU-GPU synchronization*
happening in the code. For example:
.. code-block:: python
data = torch.rand((10000000), device="cuda")
mask = data > 0.5
ids = torch.where(mask)[0]
assert torch.all(data[mask] == data[ids])
# 100% GPU utilization.
while True: data[ids]
# 95% GPU utilization.
while True: data[mask]
Besides, if there are many cpu operations in the pipeline, such as data loading and
preprocessing, it might also cause the GPU utilization to be low. In this case, you
can try to use `torch.utils.data.DataLoader` to overlap the data processing time
with the GPU computation time.
Avoid CPU-GPU Synchronization
-----------------------------
In the above example, if you time your code, you will see a significant difference:
.. code-block:: python
# 177 µs ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit data[ids]
# 355 µs ± 466 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit data[mask]
Explanation: In this case, the mask operation needs to decide the size of the output tensor, which
lives on CPU, based on the number of `True` values in the mask that lives on GPU. So
a synchronization is required. The index selection operation, on the other hand,
already knows the size of the output tensor based the size of the `ids`, so no
synchronization is required.
You may argue in this case we can't really improve things because the `ids` are computed
from the `mask`, which would require the synchronization anyway (`torch.where`). However,
in many cases we can avoid the synchronization by carefully writing the code. For example:
.. code-block:: python
# no sync. 67.3 µs ± 5.01 ns per loop
while True: torch.zeros((10000000), device="cuda")
# sync. 13.7 ms ± 320 µs per loop
while True: torch.zeros((10000000)).to("cuda")
Operations that require synchronization including `torch.where`, `tensor.item()`,
`print(tensor)`, `tensor.to(device)`, `torch.nonzero` etc. Just imagine those functions
have a inherit `torch.cuda.synchronize()` called under the hood. See the
`official guide <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#avoid-unnecessary-cpu-gpu-synchronization>`_
if with more interests.
In our library, :func:`nerfacc.traverse_grids` is a function that requires synchronization,
because it needs to know the size of the output tensor when traversing the grids. As a result,
sampling with :class:`nerfacc.OccGridEstimator` also requires synchronization. But there is
no walkaround in this case so just be aware of it.
\ No newline at end of file
...@@ -116,19 +116,10 @@ In `nerfacc`, this is implemented via the :class:`nerfacc.PropNetEstimator` clas ...@@ -116,19 +116,10 @@ In `nerfacc`, this is implemented via the :class:`nerfacc.PropNetEstimator` clas
Which Estimator to use? Which Estimator to use?
----------------------- -----------------------
- :class:`nerfacc.OccGridEstimator` is a generally more efficient when most of the space in the scene is empty, such as in the case of `NeRF-Synthetic`_ dataset. But but it still places samples within occupied but occluded areas that contribute little to the final rendering (e.g., the last sample in the above illustration). - :class:`nerfacc.OccGridEstimator` is a generally more efficient when most of the space in the scene is empty, such as in the case of `NeRF-Synthetic`_ dataset. But it still places samples within occluded areas that contribute little to the final rendering (e.g., the last sample in the above illustration).
- :class:`nerfacc.PropNetEstimator` generally provide more accurate transmittance estimation, enabling samples to concentrate more on high-contribution areas (e.g., surfaces) and to be more spread out in both empty and occluded regions. Also this method works nicely on unbouned scenes as it does not require a preset bounding box of the scene. Thus datasets like `Mip-NeRF 360`_ are better suited with this estimator. - :class:`nerfacc.PropNetEstimator` generally provide more accurate transmittance estimation, enabling samples to concentrate more on high-contribution areas (e.g., surfaces) and to be more spread out in both empty and occluded regions. Also this method works nicely on unbouned scenes as it does not require a preset bounding box of the scene. Thus datasets like `Mip-NeRF 360`_ are better suited with this estimator.
.. .. currentmodule:: nerfacc
.. .. autoclass:: OccGridEstimator
.. :members:
.. .. autoclass:: PropNetEstimator
.. :members:
.. _`SIGGRAPH 2017 Course: Production Volume Rendering`: https://graphics.pixar.com/library/ProductionVolumeRendering/paper.pdf .. _`SIGGRAPH 2017 Course: Production Volume Rendering`: https://graphics.pixar.com/library/ProductionVolumeRendering/paper.pdf
.. _`Instant-NGP`: https://arxiv.org/abs/2201.05989 .. _`Instant-NGP`: https://arxiv.org/abs/2201.05989
.. _`Mip-NeRF 360`: https://arxiv.org/abs/2111.12077 .. _`Mip-NeRF 360`: https://arxiv.org/abs/2111.12077
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment