Optimize examples for better performance (#59)

* zeros -> emtpy in cuda * disable cuda hint if is been built * update examples with better perf * bump version to 0.2.0 * update readme * update index and readm * clean up doc

Optimize examples for better performance (#59)
* zeros -> emtpy in cuda * disable cuda hint if is been built * update examples with better perf * bump version to 0.2.0 * update readme * update index and readm * clean up doc
5637cc9a · Ruilong Li(李瑞龙) · GitHub · 3d958321 · 5637cc9a · 5637cc9a
Unverified Commit 5637cc9a authored Oct 08, 2022 by Ruilong Li(李瑞龙) Committed by GitHub Oct 08, 2022
20 changed files
--- a/README.md
+++ b/README.md
@@ -4,7 +4,20 @@

 https://www.nerfacc.com/

-This is a **tiny** toolbox  for **accelerating** NeRF training & rendering using PyTorch CUDA extensions. Plug-and-play for most of the NeRFs!
+NerfAcc is a PyTorch Nerf acceleration toolbox for both training and inference. It focus on
+efficient volumetric rendering of radiance fields, which is universal and plug-and-play for most of the NeRFs.
+
+Using NerfAcc, 
+
+- The `vanilla NeRF` model with 8-layer MLPs can be trained to *better quality* (+~0.5 PNSR) \
+  in *1 hour* rather than *1~2 days* as in the paper.
+- The `Instant-NGP NeRF` model can be trained to *better quality* (+~0.7 PSNR) with *9/10th* of \
+  the training time (4.5 minutes) comparing to the official pure-CUDA implementation.
+- The `D-NeRF` model for *dynamic* objects can also be trained in *1 hour* \
+  rather than *2 days* as in the paper, and with *better quality* (+~0.5 PSNR).
+- Both *bounded* and *unbounded* scenes are supported.
+
+**And it is pure Python interface with flexible APIs!**

 ## Installation

@@ -12,31 +25,98 @@ This is a **tiny** toolbox  for **accelerating** NeRF training & rendering using
 pip install nerfacc
 ```

+## Usage
+
+The idea of NerfAcc is to perform efficient ray marching and volumetric rendering. So NerfAcc can work with any user-defined radiance field. To plug the NerfAcc rendering pipeline into your code and enjoy the acceleration, you only need to define two functions with your radience field.
+- `sigma_fn`: Compute density at each sample. It will be used by `nerfacc.ray_marching()` to skip the empty and occluded space during ray marching, which is where the major speedup comes from. 
+- `rgb_sigma_fn`: Compute color and density at each sample. It will be used by `nerfacc.rendering()` to conduct differentiable volumetric rendering. This function will receive gradients to update your network.
+
+An simple example is like this:
+
+``` python
+import torch
+from torch import Tensor
+import nerfacc 
+
+radiance_field = ...  # network: a NeRF model
+optimizer = ...  # network optimizer
+rays_o: Tensor = ...  # ray origins. (n_rays, 3)
+rays_d: Tensor = ...  # ray normalized directions. (n_rays, 3)
+
+def sigma_fn(
+    t_starts: Tensor, t_ends:Tensor, ray_indices: Tensor
+) -> Tensor:
+    """ Query density values from a user-defined radiance field.
+    :params t_starts: Start of the sample interval along the ray. (n_samples, 1).
+    :params t_ends: End of the sample interval along the ray. (n_samples, 1).
+    :params ray_indices: Ray indices that each sample belongs to. (n_samples,).
+    :returns The post-activation density values. (n_samples, 1).
+    """
+    t_origins = rays_o[ray_indices]  # (n_samples, 3)
+    t_dirs = rays_d[ray_indices]  # (n_samples, 3)
+    positions = t_origins + t_dirs * (t_starts + t_ends) / 2.0
+    sigmas = radiance_field.query_density(positions) 
+    return sigmas  # (n_samples, 1)
+
+def rgb_sigma_fn(
+    t_starts: Tensor, t_ends: Tensor, ray_indices: Tensor
+) -> Tuple[Tensor, Tensor]:
+    """ Query rgb and density values from a user-defined radiance field.
+    :params t_starts: Start of the sample interval along the ray. (n_samples, 1).
+    :params t_ends: End of the sample interval along the ray. (n_samples, 1).
+    :params ray_indices: Ray indices that each sample belongs to. (n_samples,).
+    :returns The post-activation rgb and density values. 
+        (n_samples, 3), (n_samples, 1).
+    """
+    t_origins = rays_o[ray_indices]  # (n_samples, 3)
+    t_dirs = rays_d[ray_indices]  # (n_samples, 3)
+    positions = t_origins + t_dirs * (t_starts + t_ends) / 2.0
+    rgbs, sigmas = radiance_field(positions, condition=t_dirs)  
+    return rgbs, sigmas  # (n_samples, 3), (n_samples, 1)
+
+# Efficient Raymarching: Skip empty and occluded space, pack samples from all rays.
+# packed_info: (n_rays, 2). t_starts: (n_samples, 1). t_ends: (n_samples, 1).
+packed_info, t_starts, t_ends = nerfacc.ray_marching(
+    rays_o, rays_d, sigma_fn=sigma_fn, near_plane=0.2, far_plane=1.0, 
+    early_stop_eps=1e-4, alpha_thre=1e-2, 
+)
+
+# Differentiable Volumetric Rendering.
+# colors: (n_rays, 3). opaicity: (n_rays, 1). depth: (n_rays, 1).
+color, opacity, depth = nerfacc.rendering(rgb_sigma_fn, packed_info, t_starts, t_ends)
+
+# Optimize the radience field.
+optimizer.zero_grad()
+loss = F.mse_loss(color, color_gt)
+loss.backward()
+optimizer.step()
+```
+
 ## Examples: 

 Before running those example scripts, please check the script about which dataset it is needed, and download
 the dataset first.

 ``` bash
-# Instant-NGP NeRF in 4.5 minutes.
+# Instant-NGP NeRF in 4.5 minutes with better performance!
 # See results at here: https://www.nerfacc.com/en/latest/examples/ngp.html
 python examples/train_ngp_nerf.py --train_split trainval --scene lego
 ```

 ``` bash
-# Vanilla MLP NeRF in 1 hour.
+# Vanilla MLP NeRF in 1 hour with better performance!
 # See results at here: https://www.nerfacc.com/en/latest/examples/vanilla.html
 python examples/train_mlp_nerf.py --train_split train --scene lego
 ```

 ```bash
-# T-NeRF for Dynamic objects in 1 hour.
+# D-NeRF for Dynamic objects in 1 hour with better performance!
 # See results at here: https://www.nerfacc.com/en/latest/examples/dnerf.html
 python examples/train_mlp_dnerf.py --train_split train --scene lego
 ```

 ```bash
-# Unbounded scene in 1 hour.
+# Instant-NGP on unbounded scenes in 20 minutes!
 # See results at here: https://www.nerfacc.com/en/latest/examples/unbounded.html
 python examples/train_ngp_nerf.py --train_split train --scene garden --auto_aabb --unbounded --cone_angle=0.004
 ```
--- a/docs/source/examples/dnerf.rst
+++ b/docs/source/examples/dnerf.rst
@@ -9,7 +9,7 @@ Benchmarks
 Here we trained a 8-layer-MLP for the radiance field and a 4-layer-MLP for the warping field,
 (similar to the T-Nerf model in the `D-Nerf`_ paper) on the `D-Nerf dataset`_. We used train 
 split for training and test split for evaluation. Our experiments are conducted on a 
-single NVIDIA TITAN RTX GPU. 
+single NVIDIA TITAN RTX GPU. The training memory footprint is about 11GB.

 .. note::

@@ -19,12 +19,12 @@ single NVIDIA TITAN RTX GPU.
    It is not optimal but still makes the rendering very efficient.

 +----------------------+----------+---------+-------+---------+-------+--------+---------+-------+-------+
-|                      | bouncing | hell    | hook  | jumping | lego  | mutant | standup | trex  | AVG   |
+| PSNR                 | bouncing | hell    | hook  | jumping | lego  | mutant | standup | trex  | MEAN  |
 |                      | balls    | warrior |       | jacks   |       |        |         |       |       |
 +======================+==========+=========+=======+=========+=======+========+=========+=======+=======+
-| D-Nerf (PSNR: ~2day) | 38.93    | 25.02   | 29.25 | 32.80   | 21.64 | 31.29  | 32.79   | 31.75 | 30.43 |
+| D-Nerf (~ days)      | 38.93    | 25.02   | 29.25 | 32.80   | 21.64 | 31.29  | 32.79   | 31.75 | 30.43 |
 +----------------------+----------+---------+-------+---------+-------+--------+---------+-------+-------+
-| Ours  (PSNR: ~50min) | 39.60    | 22.41   | 30.64 | 29.79   | 24.75 | 35.20  | 34.50   | 31.83 | 31.09 |
+| Ours  (~ 50min)      | 39.60    | 22.41   | 30.64 | 29.79   | 24.75 | 35.20  | 34.50   | 31.83 | 31.09 |
 +----------------------+----------+---------+-------+---------+-------+--------+---------+-------+-------+
 | Ours  (Training time)| 45min    | 49min   | 51min | 46min   | 53min | 57min  | 49min   | 46min | 50min |
 +----------------------+----------+---------+-------+---------+-------+--------+---------+-------+-------+

--- a/docs/source/examples/ngp.rst
+++ b/docs/source/examples/ngp.rst
@@ -7,6 +7,7 @@ See code `examples/train_ngp_nerf.py` at our `github repository`_ for details.

 Benchmarks
 ------------
+*updated on 2022-10-08*

 Here we trained a `Instant-NGP Nerf`_ model on the `Nerf-Synthetic dataset`_. We follow the same
 settings with the Instant-NGP paper, which uses trainval split for training and test split for
@@ -18,16 +19,16 @@ memory footprint is about 3GB.
    The Instant-NGP paper makes use of the alpha channel in the images to apply random background
    augmentation during training. Yet we only uses RGB values with a constant white background.

-+----------------------+-------+-------+------------+-------+--------+--------+--------+--------+--------+
-|                      | Lego  | Mic   | Materials  |Chair  |Hotdog  | Ficus  | Drums  | Ship   | AVG    |
-|                      |       |       |            |       |        |        |        |        |        |
-+======================+=======+=======+============+=======+========+========+========+========+========+
-|Instant-NGP(PSNR:5min)| 36.39 | 36.22 | 29.78      | 35.00 | 37.40  | 33.51  | 26.02  | 31.10  | 33.18  |
-+----------------------+-------+-------+------------+-------+--------+--------+--------+--------+--------+
-| Ours  (PSNR:4.5min)  | 36.71 | 36.78 | 29.06      | 36.10 | 37.88  | 32.07  | 25.83  | 31.39  | 33.23  |
-+----------------------+-------+-------+------------+-------+--------+--------+--------+--------+--------+
-| Ours  (Training time)| 286s  | 251s  | 250s       | 311s  | 275s   | 254s   | 249s   | 255s   | 266s   |
-+----------------------+-------+-------+------------+-------+--------+--------+--------+--------+--------+
+----------------------+-------+-------+---------+-------+-------+-------+-------+-------+-------+
+| PSNR                 | Lego  | Mic   |Materials| Chair |Hotdog | Ficus | Drums | Ship  | MEAN  |
+|                      |       |       |         |       |       |       |       |       |       |
+======================+=======+=======+=========+=======+=======+=======+=======+=======+=======+
+| Instant-NGP (5min)   | 36.39 | 36.22 | 29.78   | 35.00 | 37.40 | 33.51 | 26.02 | 31.10 | 33.18 |
+----------------------+-------+-------+---------+-------+-------+-------+-------+-------+-------+
+| Ours  (~4.5min)      | 36.82 | 37.61 | 30.18   | 36.13 | 38.11 | 34.48 | 26.62 | 31.37 | 33.92 |
+----------------------+-------+-------+---------+-------+-------+-------+-------+-------+-------+
+| Ours  (Training time)| 288s  | 259s  | 256s    | 324s  | 288s  | 245s  | 262s  | 257s  | 272s  |
+----------------------+-------+-------+---------+-------+-------+-------+-------+-------+-------+

 .. _`Instant-NGP Nerf`: https://arxiv.org/abs/2201.05989
 .. _`github repository`: https://github.com/KAIR-BAIR/nerfacc/

--- a/docs/source/examples/unbounded.rst
+++ b/docs/source/examples/unbounded.rst
@@ -5,10 +5,11 @@ See code `examples/train_ngp_nerf.py` at our `github repository`_ for details.

 Benchmarks
 ------------
+*updated on 2022-10-08*

 Here we trained a `Instant-NGP Nerf`_  on the `MipNerf360`_ dataset. We used train 
 split for training and test split for evaluation. Our experiments are conducted on a 
-single NVIDIA TITAN RTX GPU. 
+single NVIDIA TITAN RTX GPU. The training memory footprint is about 6-9GB.

 The main difference between working with unbounded scenes and bounded scenes, is that
 a contraction method is needed to map the infinite space to a finite :ref:`Occupancy Grid`.
@@ -23,18 +24,18 @@ that takes from `MipNerf360`_.
    show how to use the library, we didn't want to make it too complicated.


-+----------------------+-------+-------+------------+-------+--------+--------+--------+
-|                      |Garden |Bicycle| Bonsai     |Counter|Kitchen | Room   | Stump  |
-|                      |       |       |            |       |        |        |        |
-+======================+=======+=======+============+=======+========+========+========+
-|Nerf++(PSNR:~days)    | 24.32 | 22.64 | 29.15      | 26.38 | 27.80  | 28.87  | 24.34  |
-+----------------------+-------+-------+------------+-------+--------+--------+--------+
-|MipNerf360(PSNR:~days)| 26.98 | 24.37 | 33.46      | 29.55 | 32.23  | 31.63  | 28.65  |
-+----------------------+-------+-------+------------+-------+--------+--------+--------+
-| Ours  (PSNR:~1hr)    | 25.41 | 22.89 | 27.35      | 23.15 | 27.74  | 30.66  | 21.83  |
-+----------------------+-------+-------+------------+-------+--------+--------+--------+
-| Ours  (Training time)| 40min | 35min | 47min      | 39min | 60min  | 41min  | 28min  |
-+----------------------+-------+-------+------------+-------+--------+--------+--------+
+----------------------+-------+-------+-------+-------+-------+-------+-------+-------+
+| PSNR                 |Garden |Bicycle|Bonsai |Counter|Kitchen| Room  | Stump | MEAN  |
+|                      |       |       |       |       |       |       |       |       |
+======================+=======+=======+=======+=======+=======+=======+=======+=======+
+| Nerf++ (~days)       | 24.32 | 22.64 | 29.15 | 26.38 | 27.80 | 28.87 | 24.34 | 26.21 |
+----------------------+-------+-------+-------+-------+-------+-------+-------+-------+
+| MipNerf360 (~days)   | 26.98 | 24.37 | 33.46 | 29.55 | 32.23 | 31.63 | 28.65 | 29.55 |
+----------------------+-------+-------+-------+-------+-------+-------+-------+-------+
+| Ours (~20 mins)      | 25.41 | 22.97 | 30.71 | 27.34 | 30.32 | 31.00 | 23.43 | 27.31 |
+----------------------+-------+-------+-------+-------+-------+-------+-------+-------+
+| Ours (Training time) | 25min | 17min | 19min | 23min | 28min | 20min | 17min | 21min |
+----------------------+-------+-------+-------+-------+-------+-------+-------+-------+

 .. _`Instant-NGP Nerf`: https://arxiv.org/abs/2201.05989
 .. _`MipNerf360`: https://arxiv.org/abs/2111.12077

--- a/docs/source/examples/vanilla.rst
+++ b/docs/source/examples/vanilla.rst
@@ -8,7 +8,7 @@ Benchmarks

 Here we trained a 8-layer-MLP for the radiance field as in the `vanilla Nerf`_. We used the 
 train split for training and test split for evaluation as in the Nerf paper. Our experiments are 
-conducted on a single NVIDIA TITAN RTX GPU. 
+conducted on a single NVIDIA TITAN RTX GPU. The training memory footprint is about 10GB.

 .. note:: 
    The vanilla Nerf paper uses two MLPs for course-to-fine sampling. Instead here we only use a 
@@ -17,16 +17,16 @@ conducted on a single NVIDIA TITAN RTX GPU.
    so we can simplly increase the number of samples with a single MLP, to achieve the same goal 
    with the coarse-to-fine sampling, without runtime or memory issue.

-+----------------------+-------+-------+------------+-------+--------+--------+--------+--------+--------+
-|                      | Lego  | Mic   | Materials  |Chair  |Hotdog  | Ficus  | Drums  | Ship   | AVG    |
-|                      |       |       |            |       |        |        |        |        |        |
-+======================+=======+=======+============+=======+========+========+========+========+========+
-| NeRF  (PSNR: ~days)  | 32.54 | 32.91 | 29.62      | 33.00 | 36.18  | 30.13  | 25.01  | 28.65  | 31.00  |
-+----------------------+-------+-------+------------+-------+--------+--------+--------+--------+--------+
-| Ours  (PSNR: ~50min) | 33.69 | 33.76 | 29.73      | 33.32 | 35.80  | 32.52  | 25.39  | 28.18  | 31.55  |
-+----------------------+-------+-------+------------+-------+--------+--------+--------+--------+--------+
-| Ours  (Training time)| 58min | 53min | 46min      | 62min | 56min  | 42min  | 52min  | 49min  | 52min  |
-+----------------------+-------+-------+------------+-------+--------+--------+--------+--------+--------+
+----------------------+-------+-------+---------+-------+-------+-------+-------+-------+-------+
+| PSNR                 | Lego  | Mic   |Materials| Chair |Hotdog | Ficus | Drums | Ship  | MEAN  |
+|                      |       |       |         |       |       |       |       |       |       |
+======================+=======+=======+=========+=======+=======+=======+=======+=======+=======+
+| NeRF  (~ days)       | 32.54 | 32.91 | 29.62   | 33.00 | 36.18 | 30.13 | 25.01 | 28.65 | 31.00 |
+----------------------+-------+-------+---------+-------+-------+-------+-------+-------+-------+
+| Ours  (~ 50min)      | 33.69 | 33.76 | 29.73   | 33.32 | 35.80 | 32.52 | 25.39 | 28.18 | 31.55 |
+----------------------+-------+-------+---------+-------+-------+-------+-------+-------+-------+
+| Ours  (Training time)| 58min | 53min | 46min   | 62min | 56min | 42min | 52min | 49min | 52min |
+----------------------+-------+-------+---------+-------+-------+-------+-------+-------+-------+

 .. _`github repository`: : https://github.com/KAIR-BAIR/nerfacc/
 .. _`vanilla Nerf`: https://arxiv.org/abs/2003.08934
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
 NerfAcc Documentation
 ===================================

-NerfAcc is a PyTorch Nerf acceleration toolbox for both training and inference. 
+NerfAcc is a PyTorch Nerf acceleration toolbox for both training and inference. It focus on
+efficient volumetric rendering of radiance fields, which is universal and plug-and-play for most of the NeRFs.

 Using NerfAcc, 

 - The `vanilla Nerf`_ model with 8-layer MLPs can be trained to *better quality* (+~0.5 PNSR) \
  in *1 hour* rather than *1~2 days* as in the paper.
- The `Instant-NGP Nerf`_ model can be trained to *equal quality* with *9/10th* of the training time (4.5 minutes) \
-  comparing to the official pure-CUDA implementation.
+- The `Instant-NGP Nerf`_ model can be trained to *better quality* (+~0.7 PSNR) with *9/10th* of \
+  the training time (4.5 minutes) comparing to the official pure-CUDA implementation.
 - The `D-Nerf`_ model for *dynamic* objects can also be trained in *1 hour* \
  rather than *2 days* as in the paper, and with *better quality* (+~0.5 PSNR).
- Both the *bounded* and *unbounded* scenes are supported.
+- Both *bounded* and *unbounded* scenes are supported.

-*And it is pure python interface with flexible apis!*
+**And it is pure Python interface with flexible APIs!**
+
+| Github: https://github.com/KAIR-BAIR/nerfacc
+| Authors: `Ruilong Li`_, `Matthew Tancik`_, `Angjoo Kanazawa`_

 .. note::

-   This repo is focusing on the single scene situation. Generalizable Nerfs across \
+   This repo is focusing on the single scene situation. Generalizable Nerfs across
   multiple scenes is currently out of the scope of this repo. But you may still find
   some useful tricks in this repo. :)

+
 Installation:
 -------------

@@ -28,6 +33,82 @@ Installation:

   $ pip install nerfacc

+Usage:
+-------------
+
+The idea of NerfAcc is to perform efficient ray marching and volumetric rendering. 
+So NerfAcc can work with any user-defined radiance field. To plug the NerfAcc rendering
+pipeline into your code and enjoy the acceleration, you only need to define two functions 
+with your radience field.
+
+- `sigma_fn`: Compute density at each sample. It will be used by :func:`nerfacc.ray_marching` to skip the empty and occluded space during ray marching, which is where the major speedup comes from. 
+- `rgb_sigma_fn`: Compute color and density at each sample. It will be used by :func:`nerfacc.rendering` to conduct differentiable volumetric rendering. This function will receive gradients to update your network.
+
+An simple example is like this:
+
+.. code-block:: python
+
+   import torch
+   from torch import Tensor
+   import nerfacc 
+
+   radiance_field = ...  # network: a NeRF model
+   optimizer = ...  # network optimizer
+   rays_o: Tensor = ...  # ray origins. (n_rays, 3)
+   rays_d: Tensor = ...  # ray normalized directions. (n_rays, 3)
+
+   def sigma_fn(
+      t_starts: Tensor, t_ends:Tensor, ray_indices: Tensor
+   ) -> Tensor:
+      """ Query density values from a user-defined radiance field.
+      :params t_starts: Start of the sample interval along the ray. (n_samples, 1).
+      :params t_ends: End of the sample interval along the ray. (n_samples, 1).
+      :params ray_indices: Ray indices that each sample belongs to. (n_samples,).
+      :returns The post-activation density values. (n_samples, 1).
+      """
+      t_origins = rays_o[ray_indices]  # (n_samples, 3)
+      t_dirs = rays_d[ray_indices]  # (n_samples, 3)
+      positions = t_origins + t_dirs * (t_starts + t_ends) / 2.0
+      sigmas = radiance_field.query_density(positions) 
+      return sigmas  # (n_samples, 1)
+
+   def rgb_sigma_fn(
+      t_starts: Tensor, t_ends: Tensor, ray_indices: Tensor
+   ) -> Tuple[Tensor, Tensor]:
+      """ Query rgb and density values from a user-defined radiance field.
+      :params t_starts: Start of the sample interval along the ray. (n_samples, 1).
+      :params t_ends: End of the sample interval along the ray. (n_samples, 1).
+      :params ray_indices: Ray indices that each sample belongs to. (n_samples,).
+      :returns The post-activation rgb and density values. 
+         (n_samples, 3), (n_samples, 1).
+      """
+      t_origins = rays_o[ray_indices]  # (n_samples, 3)
+      t_dirs = rays_d[ray_indices]  # (n_samples, 3)
+      positions = t_origins + t_dirs * (t_starts + t_ends) / 2.0
+      rgbs, sigmas = radiance_field(positions, condition=t_dirs)  
+      return rgbs, sigmas  # (n_samples, 3), (n_samples, 1)
+
+   # Efficient Raymarching: Skip empty and occluded space, pack samples from all rays.
+   # packed_info: (n_rays, 2). t_starts: (n_samples, 1). t_ends: (n_samples, 1).
+   packed_info, t_starts, t_ends = nerfacc.ray_marching(
+      rays_o, rays_d, sigma_fn=sigma_fn, near_plane=0.2, far_plane=1.0, 
+      early_stop_eps=1e-4, alpha_thre=1e-2, 
+   )
+
+   # Differentiable Volumetric Rendering.
+   # colors: (n_rays, 3). opaicity: (n_rays, 1). depth: (n_rays, 1).
+   color, opacity, depth = nerfacc.rendering(rgb_sigma_fn, packed_info, t_starts, t_ends)
+
+   # Optimize the radience field.
+   optimizer.zero_grad()
+   loss = F.mse_loss(color, color_gt)
+   loss.backward()
+   optimizer.step()
+
+
+Links:
+-------------
+
 .. toctree::
   :glob:
   :maxdepth: 1
@@ -53,4 +134,9 @@ Installation:
 .. _`Instant-NGP Nerf`: https://arxiv.org/abs/2201.05989
 .. _`D-Nerf`: https://arxiv.org/abs/2011.13961
 .. _`MipNerf360`: https://arxiv.org/abs/2111.12077
-.. _`pixel-Nerf`: https://arxiv.org/abs/2012.02190
\ No newline at end of file
+.. _`pixel-Nerf`: https://arxiv.org/abs/2012.02190
+.. _`Nerf++`: https://arxiv.org/abs/2010.07492
+
+.. _`Ruilong Li`: https://www.liruilong.cn/
+.. _`Matthew Tancik`: https://www.matthewtancik.com/
+.. _`Angjoo Kanazawa`: https://people.eecs.berkeley.edu/~kanazawa/
\ No newline at end of file
--- a/examples/radiance_fields/mlp.py
+++ b/examples/radiance_fields/mlp.py
@@ -248,8 +248,8 @@ class VanillaNeRFRadianceField(nn.Module):
 class DNeRFRadianceField(nn.Module):
    def __init__(self) -> None:
        super().__init__()
-        self.posi_encoder = SinusoidalEncoder(3, 0, 0, True)
-        self.time_encoder = SinusoidalEncoder(1, 0, 0, True)
+        self.posi_encoder = SinusoidalEncoder(3, 0, 4, True)
+        self.time_encoder = SinusoidalEncoder(1, 0, 4, True)
        self.warp = MLP(
            input_dim=self.posi_encoder.latent_dim
            + self.time_encoder.latent_dim,

--- a/examples/radiance_fields/ngp.py
+++ b/examples/radiance_fields/ngp.py
@@ -141,17 +141,6 @@ class NGPradianceField(torch.nn.Module):
            },
        )

-    def query_opacity(self, x, step_size):
-        density = self.query_density(x)
-        if self.unbounded:
-            # NOTE: In principle, we should use the following formula to scale
-            # up the step size, but in practice, it is somehow not helpful.
-            # derivitive = contract_to_unisphere(x, self.aabb, derivative=True)
-            # step_size = step_size / derivitive.norm(dim=-1, keepdim=True)
-            pass
-        opacity = density * step_size
-        return opacity
-
    def query_density(self, x, return_feat: bool = False):
        if self.unbounded:
            x = contract_to_unisphere(x, self.aabb)

--- a/examples/requirements.txt
+++ b/examples/requirements.txt
@@ -2,4 +2,5 @@ git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
 opencv-python
 imageio
 numpy
-tqdm
\ No newline at end of file
+tqdm
+scipy
\ No newline at end of file
--- a/examples/train_mlp_dnerf.py
+++ b/examples/train_mlp_dnerf.py
@@ -76,7 +76,7 @@ if __name__ == "__main__":
    ).item()

    # setup the radiance field we want to train.
-    max_steps = 50000
+    max_steps = 30000
    grad_scaler = torch.cuda.amp.GradScaler(1)
    radiance_field = DNeRFRadianceField().to(device)
    optimizer = torch.optim.Adam(radiance_field.parameters(), lr=5e-4)
@@ -156,9 +156,12 @@ if __name__ == "__main__":
                render_step_size=render_step_size,
                render_bkgd=render_bkgd,
                cone_angle=args.cone_angle,
+                alpha_thre=0.01 if step > 1000 else 0.00,
                # dnerf options
                timestamps=timestamps,
            )
+            if n_rendering_samples == 0:
+                continue

            # dynamic batch size for rays to keep sample batch size constant.
            num_rays = len(pixels)
@@ -213,6 +216,7 @@ if __name__ == "__main__":
                            render_step_size=render_step_size,
                            render_bkgd=render_bkgd,
                            cone_angle=args.cone_angle,
+                            alpha_thre=0.01,
                            # test options
                            test_chunk_size=args.test_chunk_size,
                            # dnerf options

--- a/examples/train_mlp_nerf.py
+++ b/examples/train_mlp_nerf.py
@@ -186,6 +186,8 @@ if __name__ == "__main__":
                render_bkgd=render_bkgd,
                cone_angle=args.cone_angle,
            )
+            if n_rendering_samples == 0:
+                continue

            # dynamic batch size for rays to keep sample batch size constant.
            num_rays = len(pixels)

--- a/examples/train_ngp_nerf.py
+++ b/examples/train_ngp_nerf.py
@@ -140,6 +140,7 @@ if __name__ == "__main__":
        near_plane = 0.2
        far_plane = 1e4
        render_step_size = 1e-2
+        alpha_thre = 1e-2
    else:
        contraction_type = ContractionType.AABB
        scene_aabb = torch.tensor(args.aabb, dtype=torch.float32, device=device)
@@ -150,9 +151,10 @@ if __name__ == "__main__":
            * math.sqrt(3)
            / render_n_samples
        ).item()
+        alpha_thre = 0.0

    # setup the radiance field we want to train.
-    max_steps = 40000 if args.unbounded else 20000
+    max_steps = 20000
    grad_scaler = torch.cuda.amp.GradScaler(2**10)
    radiance_field = NGPradianceField(
        aabb=args.aabb,
@@ -185,13 +187,33 @@ if __name__ == "__main__":
            rays = data["rays"]
            pixels = data["pixels"]

+            def occ_eval_fn(x):
+                if args.cone_angle > 0.0:
+                    # randomly sample a camera for computing step size.
+                    camera_ids = torch.randint(
+                        0, len(train_dataset), (x.shape[0],), device=device
+                    )
+                    origins = train_dataset.camtoworlds[camera_ids, :3, -1]
+                    t = (origins - x).norm(dim=-1, keepdim=True)
+                    # compute actual step size used in marching, based on the distance to the camera.
+                    step_size = torch.clamp(
+                        t * args.cone_angle, min=render_step_size
+                    )
+                    # filter out the points that are not in the near far plane.
+                    if (near_plane is not None) and (near_plane is not None):
+                        step_size = torch.where(
+                            (t > near_plane) & (t < far_plane),
+                            step_size,
+                            torch.zeros_like(step_size),
+                        )
+                else:
+                    step_size = render_step_size
+                # compute occupancy
+                density = radiance_field.query_density(x)
+                return density * step_size
+
            # update occupancy grid
-            occupancy_grid.every_n_step(
-                step=step,
-                occ_eval_fn=lambda x: radiance_field.query_opacity(
-                    x, render_step_size
-                ),
-            )
+            occupancy_grid.every_n_step(step=step, occ_eval_fn=occ_eval_fn)

            # render
            rgb, acc, depth, n_rendering_samples = render_image(
@@ -205,7 +227,10 @@ if __name__ == "__main__":
                render_step_size=render_step_size,
                render_bkgd=render_bkgd,
                cone_angle=args.cone_angle,
+                alpha_thre=alpha_thre,
            )
+            if n_rendering_samples == 0:
+                continue

            # dynamic batch size for rays to keep sample batch size constant.
            num_rays = len(pixels)
@@ -254,11 +279,12 @@ if __name__ == "__main__":
                            rays,
                            scene_aabb,
                            # rendering options
-                            near_plane=None,
-                            far_plane=None,
+                            near_plane=near_plane,
+                            far_plane=far_plane,
                            render_step_size=render_step_size,
                            render_bkgd=render_bkgd,
                            cone_angle=args.cone_angle,
+                            alpha_thre=alpha_thre,
                            # test options
                            test_chunk_size=args.test_chunk_size,
                        )

--- a/examples/utils.py
+++ b/examples/utils.py
@@ -30,6 +30,7 @@ def render_image(
    render_step_size: float = 1e-3,
    render_bkgd: Optional[torch.Tensor] = None,
    cone_angle: float = 0.0,
+    alpha_thre: float = 0.0,
    # test options
    test_chunk_size: int = 8192,
    # only useful for dnerf
@@ -95,6 +96,7 @@ def render_image(
            render_step_size=render_step_size,
            stratified=radiance_field.training,
            cone_angle=cone_angle,
+            alpha_thre=alpha_thre,
        )
        rgb, opacity, depth = rendering(
            rgb_sigma_fn,

--- a/nerfacc/__init__.py
+++ b/nerfacc/__init__.py
@@ -50,4 +50,5 @@ __all__ = [
    "unpack_info",
    "ray_resampling",
    "loss_distortion",
+    "unpack_to_ray_indices",
 ]
--- a/nerfacc/cuda/_backend.py
+++ b/nerfacc/cuda/_backend.py
@@ -7,7 +7,7 @@ import os
 from subprocess import DEVNULL, call

 from rich.console import Console
-from torch.utils.cpp_extension import load
+from torch.utils.cpp_extension import _get_build_directory, load

 PATH = os.path.dirname(os.path.abspath(__file__))

@@ -21,21 +21,32 @@ def cuda_toolkit_available():
        return False


+def load_extention(name: str):
+    return load(
+        name=name,
+        sources=glob.glob(os.path.join(PATH, "csrc/*.cu")),
+        extra_cflags=["-O3"],
+        extra_cuda_cflags=["-O3"],
+    )
+
+
 _C = None
-if cuda_toolkit_available():
-    console = Console()
-    with console.status(
-        "[bold yellow]Setting up CUDA (This may take a few minutes the first time)",
-        spinner="bouncingBall",
-    ):
-        _C = load(
-            name="nerfacc_cuda",
-            sources=glob.glob(os.path.join(PATH, "csrc/*.cu")),
-            extra_cflags=["-O3"],
-            extra_cuda_cflags=["-O3"],
-        )
+name = "nerfacc_cuda"
+if os.listdir(_get_build_directory(name, verbose=False)) != []:
+    # If the build exists, we assume the extension has been built
+    # and we can load it.
+    _C = load_extention(name)
 else:
-    console = Console()
-    console.print("[bold red]No CUDA toolkit found. NerfAcc will be disabled.")
+    # First time to build the extension
+    if cuda_toolkit_available():
+        with Console().status(
+            "[bold yellow]NerfAcc: Setting up CUDA (This may take a few minutes the first time)",
+            spinner="bouncingBall",
+        ):
+            _C = load_extention(name)
+    else:
+        Console().print(
+            "[yellow]NerfAcc: No CUDA toolkit found. NerfAcc will be disabled.[/yellow]"
+        )

 __all__ = ["_C"]
--- a/nerfacc/cuda/csrc/contraction.cu
+++ b/nerfacc/cuda/csrc/contraction.cu
@@ -73,7 +73,7 @@ torch::Tensor contract(
    const int threads = 256;
    const int blocks = CUDA_N_BLOCKS_NEEDED(n_samples, threads);

-    torch::Tensor out_samples = torch::zeros({n_samples, 3}, samples.options());
+    torch::Tensor out_samples = torch::empty({n_samples, 3}, samples.options());

    contract_kernel<<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(
        n_samples,
@@ -99,7 +99,7 @@ torch::Tensor contract_inv(
    const int threads = 256;
    const int blocks = CUDA_N_BLOCKS_NEEDED(n_samples, threads);

-    torch::Tensor out_samples = torch::zeros({n_samples, 3}, samples.options());
+    torch::Tensor out_samples = torch::empty({n_samples, 3}, samples.options());

    contract_inv_kernel<<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(
        n_samples,

--- a/nerfacc/cuda/csrc/pack.cu
+++ b/nerfacc/cuda/csrc/pack.cu
@@ -91,7 +91,7 @@ torch::Tensor unpack_info(const torch::Tensor packed_info)
    const int blocks = CUDA_N_BLOCKS_NEEDED(n_rays, threads);

    int n_samples = packed_info[n_rays - 1].sum(0).item<int>();
-    torch::Tensor ray_indices = torch::zeros(
+    torch::Tensor ray_indices = torch::empty(
        {n_samples}, packed_info.options().dtype(torch::kInt32));

    unpack_info_kernel<<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(

--- a/nerfacc/cuda/csrc/ray_marching.cu
+++ b/nerfacc/cuda/csrc/ray_marching.cu
@@ -223,7 +223,7 @@ std::vector<torch::Tensor> ray_marching(
    const int blocks = CUDA_N_BLOCKS_NEEDED(n_rays, threads);

    // helper counter
-    torch::Tensor num_steps = torch::zeros(
+    torch::Tensor num_steps = torch::empty(
        {n_rays}, rays_o.options().dtype(torch::kInt32));

    // count number of samples per ray
@@ -253,8 +253,8 @@ std::vector<torch::Tensor> ray_marching(

    // output samples starts and ends
    int total_steps = cum_steps[cum_steps.size(0) - 1].item<int>();
-    torch::Tensor t_starts = torch::zeros({total_steps, 1}, rays_o.options());
-    torch::Tensor t_ends = torch::zeros({total_steps, 1}, rays_o.options());
+    torch::Tensor t_starts = torch::empty({total_steps, 1}, rays_o.options());
+    torch::Tensor t_ends = torch::empty({total_steps, 1}, rays_o.options());

    ray_marching_kernel<<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(
        // rays
@@ -328,7 +328,7 @@ torch::Tensor grid_query(
    const int threads = 256;
    const int blocks = CUDA_N_BLOCKS_NEEDED(n_samples, threads);

-    torch::Tensor occs = torch::zeros({n_samples}, grid_value.options());
+    torch::Tensor occs = torch::empty({n_samples}, grid_value.options());

    AT_DISPATCH_FLOATING_TYPES_AND(
        at::ScalarType::Bool,

--- a/nerfacc/ray_marching.py
+++ b/nerfacc/ray_marching.py
@@ -187,7 +187,7 @@ def ray_marching(
    if sigma_fn is not None:
        # Query sigma without gradients
        ray_indices = unpack_info(packed_info)
-        sigmas = sigma_fn(t_starts, t_ends, ray_indices)
+        sigmas = sigma_fn(t_starts, t_ends, ray_indices.long())
        assert (
            sigmas.shape == t_starts.shape
        ), "sigmas must have shape of (N, 1)! Got {}".format(sigmas.shape)

--- a/nerfacc/vol_rendering.py
+++ b/nerfacc/vol_rendering.py
@@ -80,7 +80,7 @@ def rendering(
    ray_indices = unpack_info(packed_info)

    # Query sigma and color with gradients
-    rgbs, sigmas = rgb_sigma_fn(t_starts, t_ends, ray_indices)
+    rgbs, sigmas = rgb_sigma_fn(t_starts, t_ends, ray_indices.long())
    assert rgbs.shape[-1] == 3, "rgbs must have 3 channels, got {}".format(
        rgbs.shape
    )