<li><ahref="#mipmaps-and-texture-dimensions">Mipmaps and texture dimensions</a></li>
<li><ahref="#rasterizing-with-cuda-vs-opengl-new">Rasterizing with CUDA vs OpenGL <spanstyle="color:red">(New!)</span></a></li>
<li><ahref="#running-on-multiple-gpus">Running on multiple GPUs</a>
<ul>
<li><ahref="#note-on-torch.nn.dataparallel">Note on torch.nn.DataParallel</a></li>
...
...
@@ -347,7 +348,7 @@ div.image-parent {
</nav></div>
<h2id="overview">Overview</h2>
<p>Nvdiffrast is a PyTorch/TensorFlow library that provides high-performance primitive operations for rasterization-based differentiable rendering. It is a lower-level library compared to previous ones such as <ahref="https://github.com/BachiLi/redner">redner</a>, <ahref="https://github.com/ShichenLiu/SoftRas">SoftRas</a>, or <ahref="https://github.com/facebookresearch/pytorch3d">PyTorch3D</a>— nvdiffrast has no built-in camera models, lighting/material models, etc. Instead, the provided operations encapsulate only the most graphics-centric steps in the modern hardware graphics pipeline: rasterization, interpolation, texturing, and antialiasing. All of these operations (and their gradients) are GPU-accelerated, either via CUDA or via the hardware graphics pipeline.</p>
<p>Nvdiffrast is a PyTorch/TensorFlow library that provides high-performance primitive operations for rasterization-based differentiable rendering. It is a lower-level library compared to previous ones such as <ahref="https://github.com/BachiLi/redner">redner</a>, <ahref="https://github.com/ShichenLiu/SoftRas">SoftRas</a>, or <ahref="https://github.com/facebookresearch/pytorch3d">PyTorch3D</a>— nvdiffrast has no built-in camera models, lighting/material models, etc. Instead, the provided operations encapsulate only the most graphics-centric steps in the modern hardware graphics pipeline: rasterization, interpolation, texturing, and antialiasing. All of these operations (and their gradients) are GPU-accelerated, either via CUDA or via the hardware graphics pipeline.</p>
This documentation is intended to serve as a user's guide to nvdiffrast. For detailed discussion on the design principles, implementation details, and benchmarks, please see our paper:
<blockquote>
<strong>Modular Primitives for High-Performance Differentiable Rendering</strong><br> Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, Timo Aila<br> ACM Transactions on Graphics 39(6) (proc. SIGGRAPH Asia 2020)
...
...
@@ -364,19 +365,19 @@ Examples of things we've done with nvdiffrast
</div>
</div>
<h2id="installation">Installation</h2>
<p>Requirements:</p>
<p>Minimum requirements:</p>
<ul>
<li>Linux or Windows operating system.</li>
<li>64-bit Python 3.6 or 3.7. We recommend Anaconda3 with numpy 1.14.3 or newer.</li>
<li>PyTorch 1.6 (recommended) or TensorFlow 1.14. TensorFlow 2.x is currently not supported.</li>
<li>A high-end NVIDIA GPU, NVIDIA drivers, CUDA 10.2 toolkit, and cuDNN 7.6.</li>
<li>64-bit Python 3.6.</li>
<li>PyTorch (recommended) 1.6 or TensorFlow 1.14. TensorFlow 2.x is currently not supported.</li>
<li>A high-end NVIDIA GPU, NVIDIA drivers, CUDA 10.2 toolkit.</li>
</ul>
<p>To download nvdiffrast, either download the repository at <ahref="https://github.com/NVlabs/nvdiffrast">https://github.com/NVlabs/nvdiffrast</a> as a .zip file, or clone the repository using git:</p>
<p>We recommend running nvdiffrast on <ahref="https://www.docker.com/">Docker</a>. To build a Docker image with nvdiffrast and PyTorch 1.6 installed, run:</p>
<p>We recommend using Ubuntu, as some Linux distributions might not have all the required packages available— at least CentOS is reportedly problematic.</p>
<p>We recommend using Ubuntu, as some Linux distributions might not have all the required packages available— at least CentOS is reportedly problematic.</p>
<p>To try out some of the provided code examples, run:</p>
<p>Alternatively, if you have all the dependencies taken care of (consult the included Dockerfile for reference), you can install nvdiffrast in your local Python site-packages by running</p>
...
...
@@ -401,7 +402,7 @@ Examples of things we've done with nvdiffrast
<p>The rasterization operation takes as inputs a tensor of vertex positions and a tensor of vertex index triplets that specify the triangles. Vertex positions are specified in clip space, i.e., after modelview and projection transformations. Performing these transformations is left as the user's responsibility. In clip space, the view frustum is a cube in homogeneous coordinates where <spanclass="math inline"><em>x</em>/<em>w</em></span>, <spanclass="math inline"><em>y</em>/<em>w</em></span>, <spanclass="math inline"><em>z</em>/<em>w</em></span> are all between -1 and +1.</p>
<p>The output of the rasterization operation is a 4-channel float32 image with tuple (<spanclass="math inline"><em>u</em></span>, <spanclass="math inline"><em>v</em></span>, <spanclass="math inline"><em>z</em>/<em>w</em></span>, <spanclass="math inline"><em>t</em><em>r</em><em>i</em><em>a</em><em>n</em><em>g</em><em>l</em><em>e</em>_<em>i</em><em>d</em></span>) in each pixel. Values <spanclass="math inline"><em>u</em></span> and <spanclass="math inline"><em>v</em></span> are the barycentric coordinates within a triangle: the first vertex in the vertex index triplet obtains <spanclass="math inline">(<em>u</em>, <em>v</em>) = (1, 0)</span>, the second vertex <spanclass="math inline">(<em>u</em>, <em>v</em>) = (0, 1)</span> and the third vertex <spanclass="math inline">(<em>u</em>, <em>v</em>) = (0, 0)</span>. Normalized depth value <spanclass="math inline"><em>z</em>/<em>w</em></span> is used later by the antialiasing operation to infer occlusion relations between triangles, and it does not propagate gradients to the vertex position input. Field <spanclass="math inline"><em>t</em><em>r</em><em>i</em><em>a</em><em>n</em><em>g</em><em>l</em><em>e</em>_<em>i</em><em>d</em></span> is the triangle index, offset by one. Pixels where no triangle was rasterized will receive a zero in all channels.</p>
<p>Rasterization is point-sampled, i.e., the geometry is not smoothed, blurred, or made partially transparent in any way, in contrast to some previous differentiable rasterizers. The contents of a pixel always represent a single surface point that is on the closest surface visible along the ray through the pixel center.</p>
<p>Point-sampled coverage does not produce vertex position gradients related to occlusion and visibility effects. This is because the motion of vertices does not change the coverage in a continuous way— a triangle is either rasterized into a pixel or not. In nvdiffrast, the occlusion/visibility related gradients are generated in the antialiasing operation that typically occurs towards the end of the rendering pipeline.</p>
<p>Point-sampled coverage does not produce vertex position gradients related to occlusion and visibility effects. This is because the motion of vertices does not change the coverage in a continuous way— a triangle is either rasterized into a pixel or not. In nvdiffrast, the occlusion/visibility related gradients are generated in the antialiasing operation that typically occurs towards the end of the rendering pipeline.</p>
<divclass="image-parent">
<divclass="image-row">
<divclass="image-caption">
...
...
@@ -464,9 +465,9 @@ Background replaced with white
<p>where <code>rast_out</code> is the output of the rasterization operation. We simply test if the <spanclass="math inline"><em>t</em><em>r</em><em>i</em><em>a</em><em>n</em><em>g</em><em>l</em><em>e</em>_<em>i</em><em>d</em></span> field, i.e., channel 3 of the rasterizer output, is greater than zero, indicating that a triangle was rendered in that pixel. If so, we take the color from the textured image, and otherwise we take constant 1.0.</p>
<h3id="antialiasing">Antialiasing</h3>
<p>The last of the four primitive operations in nvdiffrast is antialiasing. Based on the geometry input (vertex positions and triangles), it will smooth out discontinuties at silhouette edges in a given image. The smoothing is based on a local approximation of coverage— an approximate integral over a pixel is calculated based on the exact location of relevant edges and the point-sampled colors at pixel centers.</p>
<p>The last of the four primitive operations in nvdiffrast is antialiasing. Based on the geometry input (vertex positions and triangles), it will smooth out discontinuties at silhouette edges in a given image. The smoothing is based on a local approximation of coverage— an approximate integral over a pixel is calculated based on the exact location of relevant edges and the point-sampled colors at pixel centers.</p>
<p>In this context, a silhouette is any edge that connects to just one triangle, or connects two triangles so that one folds behind the other. Specifically, this includes both silhouettes against the background and silhouettes against another surface, unlike some previous methods (<ahref="https://github.com/nv-tlabs/DIB-R">DIB-R</a>) that only support the former kind.</p>
<p>It is worth discussing why we might want to go through this trouble to improve the image a tiny bit. If we're attempting to, say, match a real-world photograph, a slightly smoother edge probably won't match the captured image much better than a jagged one. However, that is not the point of the antialiasing operation— the real goal is to obtain gradients w.r.t. vertex positions related to occlusion, visibility, and coverage.</p>
<p>It is worth discussing why we might want to go through this trouble to improve the image a tiny bit. If we're attempting to, say, match a real-world photograph, a slightly smoother edge probably won't match the captured image much better than a jagged one. However, that is not the point of the antialiasing operation— the real goal is to obtain gradients w.r.t. vertex positions related to occlusion, visibility, and coverage.</p>
<p>Remember that everything up to this point in the rendering pipeline is point-sampled. In particular, the coverage, i.e., which triangle is rasterized to which pixel, changes discontinuously in the rasterization operation.</p>
<p>This is the reason why previous differentiable rasterizers apply nonstandard image synthesis model with blur and transparency: Something has to make coverage continuous w.r.t. vertex positions if we wish to optimize vertex positions, camera position, etc., based on an image-space loss. In nvdiffrast, we do everything point-sampled so that we know that every pixel corresponds to a single, well-defined surface point. This lets us perform arbitrary shading computations without worrying about things like accidentally blurring texture coordinates across silhouettes, or having attributes mysteriously tend towards background color when getting close to the edge of the object. Only towards the end of the pipeline, the antialiasing operation ensures that the motion of vertex positions results in continuous change on silhouettes.</p>
<p>The antialiasing operation supports any number of channels in the image to be antialiased. Thus, if your rendering pipeline produces an abstract representation that is fed to a neural network for further processing, that is not a problem.</p>
...
...
@@ -492,7 +493,7 @@ Closeup, after AA
</div>
</div>
</div>
<p>The left image above shows the result image from the last step, after performing antialiasing. The effect is quite small— some boundary pixels become less jagged, as shown in the closeups.</p>
<p>The left image above shows the result image from the last step, after performing antialiasing. The effect is quite small— some boundary pixels become less jagged, as shown in the closeups.</p>
<p>Notably, not all boundary pixels are antialiased as revealed by the left-side image below. This is because the accuracy of the antialiasing operation in nvdiffrast depends on the rendered size of triangles: Because we store knowledge of just one surface point per pixel, antialiasing is possible only when the triangle that contains the actual geometric silhouette edge is visible in the image. The example image is rendered in very low resolution and the triangles are tiny compared to pixels. Thus, triangles get easily lost between the pixels.</p>
<p>This results in incomplete-looking antialiasing, and the gradients provided by antialiasing become noisier when edge triangles are missed. Therefore it is advisable to render images in resolutions where the triangles are large enough to show up in the image at least most of the time.</p>
<divclass="image-parent">
...
...
@@ -516,7 +517,7 @@ Rendered in 4×4 higher resolution and downsampled
<h2id="beyond-the-basics">Beyond the basics</h2>
<p>Rendering images is easy with nvdiffrast, but there are a few practical things that you will need to take into account. The topics in this section explain the operation and usage of nvdiffrast in more detail, and hopefully help you avoid any potential misunderstandings and pitfalls.</p>
<p>Nvdiffrast follows OpenGL's coordinate systems and other conventions. This is partially because we use OpenGL to accelerate the rasterization operation, but mostly so that there is a <ahref="https://xkcd.com/927/">single standard to follow</a>.</p>
<p>Nvdiffrast follows OpenGL's coordinate systems and other conventions. This is partially because we support OpenGL to accelerate the rasterization operation, but mostly so that there is a <ahref="https://xkcd.com/927/">single standard to follow</a>.</p>
<ul>
<li>
In OpenGL convention, the perspective projection matrix (as implemented in, e.g., <ahref="https://github.com/NVlabs/nvdiffrast/blob/main/samples/torch/util.py#L16-L20"><code>utils.projection()</code></a> in our samples and <ahref="https://www.khronos.org/registry/OpenGL-Refpages/gl2.1/xhtml/glFrustum.xml"><code>glFrustum()</code></a> in OpenGL) treats the view-space <spanclass="math inline"><em>z</em></span> as increasing towards the viewer. However, <em>after</em> multiplication by perspective projection matrix, the homogeneous <ahref="https://en.wikipedia.org/wiki/Clip_coordinates">clip-space</a> coordinate <spanclass="math inline"><em>z</em></span>/<spanclass="math inline"><em>w</em></span> increases away from the viewer. Hence, a larger depth value in the rasterizer output tensor also corresponds to a surface further away from the viewer.
...
...
@@ -537,7 +538,7 @@ For 2D textures, the coordinate origin <span class="math inline">(<em>s</em>,
<p>We skirted around a pretty fundamental question in the description of the texturing operation above. In order to determine the proper amount of prefiltering for sampling a texture, we need to know how densely it is being sampled. But how can we know the sampling density when each pixel knows of a just a single surface point?</p>
<p>The solution is to track the image-space derivatives of all things leading up to the texture sampling operation. <em>These are not the same thing as the gradients used in the backward pass</em>, even though they both involve differentiation! Consider the barycentrics <spanclass="math inline">(<em>u</em>, <em>v</em>)</span> produced by the rasterization operation. They change by some amount when moving horizontally or vertically in the image plane. If we denote the image-space coordinates as <spanclass="math inline">(<em>X</em>, <em>Y</em>)</span>, the image-space derivatives of the barycentrics would be <spanclass="math inline">∂<em>u</em>/∂<em>X</em></span>, <spanclass="math inline">∂<em>u</em>/∂<em>Y</em></span>, <spanclass="math inline">∂<em>v</em>/∂<em>X</em></span>, and <spanclass="math inline">∂<em>v</em>/∂<em>Y</em></span>. We can organize these into a 2×2 Jacobian matrix that describes the local relationship between <spanclass="math inline">(<em>u</em>, <em>v</em>)</span> and <spanclass="math inline">(<em>X</em>, <em>Y</em>)</span>. This matrix is generally different at every pixel. For the purpose of image-space derivatives, the units of <spanclass="math inline"><em>X</em></span> and <spanclass="math inline"><em>Y</em></span> are pixels. Hence, <spanclass="math inline">∂<em>u</em>/∂<em>X</em></span> is the local approximation of how much <spanclass="math inline"><em>u</em></span> changes when moving a distance of one pixel in the horizontal direction, and so on.</p>
<p>Once we know how the barycentrics change w.r.t. pixel position, the interpolation operation can use this to determine how the attributes change w.r.t. pixel position. When attributes are used as texture coordinates, we can therefore tell how the texture sampling position (in texture space) changes when moving around within the pixel (up to a local, linear approximation, that is). This <em>texture footprint</em> tells us the scale on which the texture should be prefiltered. In more practical terms, it tells us which mipmap level(s) to use when sampling the texture.</p>
<p>In nvdiffrast, the rasterization operation can be configured to output the image-space derivatives of the barycentrics in an auxiliary 4-channel output tensor, ordered (<spanclass="math inline">∂<em>u</em>/∂<em>X</em></span>, <spanclass="math inline">∂<em>u</em>/∂<em>Y</em></span>, <spanclass="math inline">∂<em>v</em>/∂<em>X</em></span>, <spanclass="math inline">∂<em>v</em>/∂<em>Y</em></span>) from channel 0 to 3. The interpolation operation can take this auxiliary tensor as input and compute image-space derivatives of any set of attributes being interpolated. Finally, the texture sampling operation can use the image-space derivatives of the texture coordinates to determine the amount of prefiltering.</p>
<p>In nvdiffrast, the rasterization operation outputs the image-space derivatives of the barycentrics in an auxiliary 4-channel output tensor, ordered (<spanclass="math inline">∂<em>u</em>/∂<em>X</em></span>, <spanclass="math inline">∂<em>u</em>/∂<em>Y</em></span>, <spanclass="math inline">∂<em>v</em>/∂<em>X</em></span>, <spanclass="math inline">∂<em>v</em>/∂<em>Y</em></span>) from channel 0 to 3. The interpolation operation can take this auxiliary tensor as input and compute image-space derivatives of any set of attributes being interpolated. Finally, the texture sampling operation can use the image-space derivatives of the texture coordinates to determine the amount of prefiltering.</p>
<p>There is nothing magic about these image-space derivatives. They are tensors like the, e.g., the texture coordinates themselves, they propagate gradients backwards, and so on. For example, if you want to artificially blur or sharpen the texture when sampling it, you can simply multiply the tensor carrying the image-space derivatives of the texture coordinates <spanclass="math inline">∂{<em>s</em>, <em>t</em>}/∂{<em>X</em>, <em>Y</em>}</span> by a scalar value before feeding it into the texture sampling operation. This scales the texture footprints and thus adjusts the amount of prefiltering. If your loss function prefers a different level of sharpness, this multiplier will receive a nonzero gradient. <em>Update:</em> Since version 0.2.1, the texture sampling operation also supports a separate mip level bias input that would be better suited for this particular task, but the gist is the same nonetheless.</p>
<p>One might wonder if it would have been easier to determine the texture footprints simply from the texture coordinates in adjacent pixels, and skip all this derivative rubbish? In easy cases the answer is yes, but silhouettes, occlusions, and discontinuous texture parameterizations would make this approach rather unreliable in practice. Computing the image-space derivatives analytically keeps everything point-like, local, and well-behaved.</p>
<p>It should be noted that computing gradients related to image-space derivatives is somewhat involved and requires additional computation. At the same time, they are often not crucial for the convergence of the training/optimization. Because of this, the primitive operations in nvdiffrast offer options to disable the calculation of these gradients. We're talking about things like <spanclass="math inline">∂<em>L</em><em>o</em><em>s</em><em>s</em>/∂(∂{<em>u</em>, <em>v</em>}/∂{<em>X</em>, <em>Y</em>})</span> that may look second-order-ish, but they're not.</p>
...
...
@@ -731,13 +732,21 @@ Mip level 5
</tr>
</table>
</div>
<p>Scaling the atlas to, say, 256×32 pixels would feel silly because the dimensions of the sub-images are perfectly fine, and downsampling the different sub-images together— which would happen after the 5×1 resolution— would not make sense anyway. For this reason, the texture sampling operation allows the user to specify the maximum number of mipmap levels to be constructed and used. In this case, setting <code>max_mip_level=5</code> would stop at the 5×1 mipmap and prevent the error.</p>
<p>Scaling the atlas to, say, 256×32 pixels would feel silly because the dimensions of the sub-images are perfectly fine, and downsampling the different sub-images together— which would happen after the 5×1 resolution— would not make sense anyway. For this reason, the texture sampling operation allows the user to specify the maximum number of mipmap levels to be constructed and used. In this case, setting <code>max_mip_level=5</code> would stop at the 5×1 mipmap and prevent the error.</p>
<p>It is a deliberate design choice that nvdiffrast doesn't just stop automatically at a mipmap size it cannot downsample, but requires the user to specify a limit when the texture dimensions are not powers of two. The goal is to avoid bugs where prefiltered texture sampling mysteriously doesn't work due to an oddly sized texture. It would be confusing if a 256×256 texture gave beautifully prefiltered texture samples, a 255×255 texture suddenly had no prefiltering at all, and a 254×254 texture did just a bit of prefiltering (one level) but not more.</p>
<p>If you compute your own mipmaps, their sizes must follow the scheme described above. There is no need to specify mipmaps all the way to 1×1 resolution, but the stack can end at any point and it will work equivalently to an internally constructed mipmap stack with a <code>max_mip_level</code> limit. Importantly, the gradients of user-provided mipmaps are not propagated automatically to the base texture — naturally so, because nvdiffrast knows nothing about the relation between them. Instead, the tensors that specify the mip levels in a user-provided mipmap stack will receive gradients of their own.</p>
<p>If you compute your own mipmaps, their sizes must follow the scheme described above. There is no need to specify mipmaps all the way to 1×1 resolution, but the stack can end at any point and it will work equivalently to an internally constructed mipmap stack with a <code>max_mip_level</code> limit. Importantly, the gradients of user-provided mipmaps are not propagated automatically to the base texture — naturally so, because nvdiffrast knows nothing about the relation between them. Instead, the tensors that specify the mip levels in a user-provided mipmap stack will receive gradients of their own.</p>
<h3id="rasterizing-with-cuda-vs-opengl-new">Rasterizing with CUDA vs OpenGL <spanstyle="color:red">(New!)</span></h3>
<p>Since version 0.3.0, nvdiffrast on PyTorch supports executing the rasterization operation using either CUDA or OpenGL. Earlier versions and the Tensorflow bindings support OpenGL only.</p>
<p>When rasterization is executed on OpenGL, we use the GPU's graphics pipeline to determine which triangles land on which pixels. GPUs have amazingly efficient hardware for doing this — it is their original <i>raison d'être</i> — and thus it makes sense to exploit it. Unfortunately, some computing environments haven't been designed with this in mind, and it can be difficult to get OpenGL to work correctly and interoperate with CUDA cleanly. On Windows, compatibility is generally good because the GPU drivers required to run CUDA also include OpenGL support. Linux is more complicated, as various drivers can be installed separately and there isn't a standardized way to acquire access to the hardware graphics pipeline.</p>
<p>Rasterizing in CUDA pretty much reverses these considerations. Compatibility is obviously not an issue on any CUDA-enabled platform. On the other hand, implementing the rasterization process correctly and efficiently on a massively data-parallel programming model is non-trivial. The CUDA rasterizer in nvdiffrast follows the approach described in research paper <em>High-Performance Software Rasterization on GPUs</em> by Laine and Karras, HPG 2011. Our code is based on the paper's publicly released CUDA kernels, with considerable modifications to support current hardware architectures and to match nvdiffrast's needs.</p>
<p>The CUDA rasterizer does not support output resolutions greater than 2048×2048, and both dimensions must be multiples of 8. In addition, the number of triangles that can be rendered in one batch is limited to around 16 million. Subpixel precision is limited to 4 bits and depth peeling is less accurate than with OpenGL. Memory consumption depends on many factors.</p>
<p>It is difficult to predict which rasterizer offers better performance. For complex meshes and high resolutions OpenGL will most likely outperform the CUDA rasterizer, although it has certain overheads that the CUDA rasterizer does not have. For simple meshes and low resolutions the CUDA rasterizer may be faster, but it has its own overheads, too. Measuring the performance on actual data, on the target platform, and in the context of the entire program is the only way to know for sure.</p>
<p>To run rasterization in CUDA, create a <code>RasterizeCudaContext</code> and supply it to the <code>rasterize()</code> operation. For OpenGL, use a <code>RasterizeGLContext</code> instead. Easy!</p>
<h3id="running-on-multiple-gpus">Running on multiple GPUs</h3>
<p>Nvdiffrast supports computation on multiple GPUs in both PyTorch and TensorFlow. As is the convention in PyTorch, the operations are always executed on the device on which the input tensors reside. All GPU input tensors must reside on the same device, and the output tensors will unsurprisingly end up on that same device. In addition, the rasterization operation requires that its OpenGL context was created for the correct device. In TensorFlow, the OpenGL context is automatically created on the device of the rasterization operation when it is executed for the first time.</p>
<p>On Windows, nvdiffrast implements OpenGL device selection in a way that can be done only once per process — after one context is created, all future ones will end up on the same GPU. Hence you cannot expect to run the rasterization operation on multiple GPUs within the same process. Trying to do so will either cause a crash or incur a significant performance penalty. However, with PyTorch it is common to distribute computation across GPUs by launching a separate process for each GPU, so this is not a huge concern. Note that any OpenGL context created within the same process, even for something like a GUI window, will prevent changing the device later. Therefore, if you want to run the rasterization operation on other than the default GPU, be sure to create its OpenGL context before initializing any other OpenGL-powered libraries.</p>
<p>On Linux everything just works, and you can create rasterizer OpenGL contexts on multiple devices within the same process.</p>
<p>Nvdiffrast supports computation on multiple GPUs in both PyTorch and TensorFlow. As is the convention in PyTorch, the operations are always executed on the device on which the input tensors reside. All GPU input tensors must reside on the same device, and the output tensors will unsurprisingly end up on that same device. In addition, the rasterization operation requires that its context was created for the correct device. In TensorFlow, the rasterizer context is automatically created on the device of the rasterization operation when it is executed for the first time.</p>
<p><i>The remainder of this section applies only to OpenGL rasterizer contexts. CUDA rasterizer contexts require no special considerations besides making sure they're on the correct device.</i></p>
<p>On Windows, nvdiffrast implements OpenGL device selection in a way that can be done only once per process — after one context is created, all future ones will end up on the same GPU. Hence you cannot expect to run the rasterization operation on multiple GPUs within the same process using an OpenGL context. Trying to do so will either cause a crash or incur a significant performance penalty. However, with PyTorch it is common to distribute computation across GPUs by launching a separate process for each GPU, so this is not a huge concern. Note that any OpenGL context created within the same process, even for something like a GUI window, will prevent changing the device later. Therefore, if you want to run the rasterization operation on other than the default GPU, be sure to create its OpenGL context before initializing any other OpenGL-powered libraries.</p>
<p>On Linux everything just works, and you can create OpenGL rasterizer contexts on multiple devices within the same process.</p>
<h4id="note-on-torch.nn.dataparallel">Note on torch.nn.DataParallel</h4>
<p>PyTorch offers <code>torch.nn.DataParallel</code> wrapper class for splitting the execution of a minibatch across multiple threads. Unfortunately, this class is fundamentally incompatible with OpenGL-dependent operations, as it spawns a new set of threads at each call (as of PyTorch 1.9.0, at least). Using previously created OpenGL contexts in these new threads, even if taking care to not use the same context in multiple threads, causes them to be migrated around and this has resulted in ever-growing GPU memory usage and abysmal GPU utilization. Therefore, we advise against using <code>torch.nn.DataParallel</code> for rasterization operations that depend on the OpenGL contexts.</p>
<p>Notably, <code>torch.nn.DistributedDataParallel</code> spawns subprocesses that are much more persistent. The subprocesses must create their own OpenGL contexts as part of initialization, and as such they do not suffer from this problem.</p>
<spanid="cb8-4"><ahref="#cb8-4"aria-hidden="true"tabindex="-1"></a> (process <spanclass="kw">or</span> store the results)</span></code></pre></div>
<p>There is no performance penalty compared to the basic rasterization op if you end up extracting only the first depth layer. In other words, the code above with <code>num_layers=1</code> runs exactly as fast as calling <code>rasterize</code> once.</p>
<p>Depth peeling is only supported in the PyTorch version of nvdiffrast. For implementation reasons, depth peeling reserves the OpenGL context so that other rasterization operations cannot be performed while the peeling is ongoing, i.e., inside the <code>with</code> block. Hence you cannot start a nested depth peeling operation or call <code>rasterize</code> inside the <code>with</code> block, unless you use a different OpenGL context.</p>
<p>For the sake of completeness, let us note the following small caveat: Depth peeling relies on depth values to distinguish surface points from each other. Therefore, culling "previously rendered surface points" actually means culling all surface points at the same or closer depth as those rendered into the pixel in previous passes. This matters only if you have multiple layers of geometry at matching depths— if your geometry consists of, say, nothing but two exactly overlapping triangles, you will see one of them in the first pass but never see the other one in subsequent passes, as it's at the exact depth that is already considered done.</p>
<p>Depth peeling is only supported in the PyTorch version of nvdiffrast. For implementation reasons, depth peeling reserves the rasterizer context so that other rasterization operations cannot be performed while the peeling is ongoing, i.e., inside the <code>with</code> block. Hence you cannot start a nested depth peeling operation or call <code>rasterize</code> inside the <code>with</code> block unless you use a different context.</p>
<p>For the sake of completeness, let us note the following small caveat: Depth peeling relies on depth values to distinguish surface points from each other. Therefore, culling "previously rendered surface points" actually means culling all surface points at the same or closer depth as those rendered into the pixel in previous passes. This matters only if you have multiple layers of geometry at matching depths— if your geometry consists of, say, nothing but two exactly overlapping triangles, you will see one of them in the first pass but never see the other one in subsequent passes, as it's at the exact depth that is already considered done.</p>
<h3id="differences-between-pytorch-and-tensorflow">Differences between PyTorch and TensorFlow</h3>
<p>Nvdiffrast can be used from PyTorch and from TensorFlow 1.x; the latter may change to TensorFlow 2.x if there is demand. These frameworks operate somewhat differently and that is reflected in the respective APIs. Simplifying a bit, in TensorFlow 1.x you construct a persistent graph out of persistent nodes, and run many batches of data through it. In PyTorch, there is no persistent graph or nodes, but a new, ephemeral graph is constructed for each batch of data and destroyed immediately afterwards. Therefore, there is also no persistent state for the operations. There is the <code>torch.nn.Module</code> abstraction for festooning operations with persistent state, but we do not use it.</p>
<p>As a consequence, things that would be part of persistent state of an nvdiffrast operation in TensorFlow must be stored by the user in PyTorch, and supplied to the operations as needed. In practice, this is a very small difference and amounts to just a couple of lines of code in most cases.</p>
...
...
@@ -788,11 +797,14 @@ Third depth layer
<p>In manual mode, the user assumes the responsibility of setting and releasing the OpenGL context. Most of the time, if you don't have any other libraries that would be using OpenGL, you can just set the context once after having created it and keep it set until the program exits. However, keep in mind that the active OpenGL context is a thread-local resource, so it needs to be set in the same CPU thread as it will be used, and it cannot be set simultaneously in multiple CPU threads.</p>
<h2id="samples">Samples</h2>
<p>Nvdiffrast comes with a set of samples that were crafted to support the research paper. Each sample is available in both PyTorch and TensorFlow versions. Details such as command-line parameters, logging format, etc., may not be identical between the versions, and generally the PyTorch versions should be considered definitive. The command-line examples below are for the PyTorch versions.</p>
<p>Enabling interactive display using the <code>--display-interval</code> parameter works on Windows but is likely to fail on Linux. Our Dockerfile is set up to support headless rendering only, and thus cannot show an interactive result window.</p>
<p>All PyTorch samples support selecting between CUDA and OpenGL rasterizer contexts. The default is to do rasterization in CUDA, and switching to OpenGL is done by specifying command-line option <code>--opengl</code>.</p>
<p>Enabling interactive display using the <code>--display-interval</code> parameter is likely to fail on Linux when using OpenGL rasterization. This is because the interactive display window is shown using OpenGL, and on Linux this conflicts with the internal OpenGL rasterization in nvdiffrast. Using a CUDA context should work, assuming that OpenGL is correctly installed in the system (for displaying the window). Our Dockerfile is set up to support headless rendering only, and thus cannot show an interactive result window.</p>
<p>This is a minimal sample that renders a triangle and saves the resulting image into a file (<code>tri.png</code>) in the current directory. Running this should be the first step to verify that you have everything set up correctly. Rendering is done using the rasterization and interpolation operations, so getting the correct output image means that both OpenGL and CUDA are working as intended under the hood.</p>
<p>Example command line:</p>
<pre><code>python triangle.py</code></pre>
<p>This is a minimal sample that renders a triangle and saves the resulting image into a file (<code>tri.png</code>) in the current directory. Running this should be the first step to verify that you have everything set up correctly. Rendering is done using the rasterization and interpolation operations, so getting the correct output image means that both OpenGL (if specified on command line) and CUDA are working as intended under the hood.</p>
<p>This is the only sample where you must specify either <code>--cuda</code> or <code>--opengl</code> on command line. Other samples default to CUDA rasterization and provide only the <code>--opengl</code> option.</p>
<p>Example command lines:</p>
<pre><code>python triangle.py --cuda
python triangle.py --opengl</code></pre>
<divclass="image-parent">
<divclass="image-row">
<divclass="image-caption">
...
...
@@ -901,6 +913,12 @@ Interactive view of pose.py
<p>The interactive view shows, from left to right: target pose, best found pose, and current pose. When viewed live, the two stages of optimization are clearly visible. In the first phase, the best pose updates intermittently when a better initialization is found. In the second phase, the solution converges smoothly to the target via gradient-based optimization.</p>
<h2id="pytorch-api-reference">PyTorch API reference</h2>
<pclass="shortdesc">Create a new Cuda rasterizer context.</p><pclass="longdesc">The context is deleted and internal storage is released when the object is
destroyed.</p><divclass="arguments">Arguments:</div><tableclass="args"><trclass="arg"><tdclass="argname">device</td><tdclass="arg_short">Cuda device on which the context is created. Type can be
<code>torch.device</code>, string (e.g., <code>'cuda:1'</code>), or int. If not
specified, context will be created on currently active Cuda
device.</td></tr></table><divclass="returns">Returns:<divclass="return_description">The newly created Cuda rasterizer context.</div></div></div>
<pclass="shortdesc">Create a new OpenGL rasterizer context.</p><pclass="longdesc">Creating an OpenGL context is a slow operation so you should usually reuse the same
context in all calls to <code>rasterize()</code> on the same CPU thread. The OpenGL context
...
...
@@ -918,13 +936,13 @@ device.</td></tr></table><div class="methods">Methods, only available if context
<pclass="shortdesc">Rasterize triangles.</p><pclass="longdesc">All input tensors must be contiguous and reside in GPU memory except for
the <code>ranges</code> tensor that, if specified, has to reside in CPU memory. The
output tensors will be contiguous and reside in GPU memory.</p><divclass="arguments">Arguments:</div><tableclass="args"><trclass="arg"><tdclass="argname">glctx</td><tdclass="arg_short">OpenGL context of type <code>RasterizeGLContext</code>.</td></tr><trclass="arg"><tdclass="argname">pos</td><tdclass="arg_short">Vertex position tensor with dtype <code>torch.float32</code>. To enable range
output tensors will be contiguous and reside in GPU memory.</p><divclass="arguments">Arguments:</div><tableclass="args"><trclass="arg"><tdclass="argname">glctx</td><tdclass="arg_short">Rasterizer context of type <code>RasterizeGLContext</code> or <code>RasterizeCudaContext</code>.</td></tr><trclass="arg"><tdclass="argname">pos</td><tdclass="arg_short">Vertex position tensor with dtype <code>torch.float32</code>. To enable range
mode, this tensor should have a 2D shape [num_vertices, 4]. To enable
instanced mode, use a 3D shape [minibatch_size, num_vertices, 4].</td></tr><trclass="arg"><tdclass="argname">tri</td><tdclass="arg_short">Triangle tensor with shape [num_triangles, 3] and dtype <code>torch.int32</code>.</td></tr><trclass="arg"><tdclass="argname">resolution</td><tdclass="arg_short">Output resolution as integer tuple (height, width).</td></tr><trclass="arg"><tdclass="argname">ranges</td><tdclass="arg_short">In range mode, tensor with shape [minibatch_size, 2] and dtype
<code>torch.int32</code>, specifying start indices and counts into <code>tri</code>.
Ignored in instanced mode.</td></tr><trclass="arg"><tdclass="argname">grad_db</td><tdclass="arg_short">Propagate gradients of image-space derivatives of barycentrics
into <code>pos</code> in backward pass. Ignored if OpenGL context was
not configured to output image-space derivatives.</td></tr></table><divclass="returns">Returns:<divclass="return_description">A tuple of two tensors. The first output tensor has shape [minibatch_size,
into <code>pos</code> in backward pass. Ignored if using an OpenGL context that
was not configured to output image-space derivatives.</td></tr></table><divclass="returns">Returns:<divclass="return_description">A tuple of two tensors. The first output tensor has shape [minibatch_size,
height, width, 4] and contains the main rasterizer output in order (u, v, z/w,
triangle_id). If the OpenGL context was configured to output image-space
derivatives of barycentrics, the second output tensor will also have shape
...
...
@@ -991,7 +1009,13 @@ constant. This avoids reconstructing it every time <code>texture()</code> is cal
in the <code>mip</code> argument.</div></div></div>
<pclass="shortdesc">Perform antialiasing.</p><pclass="longdesc">All input tensors must be contiguous and reside in GPU memory. The output tensor
will be contiguous and reside in GPU memory.</p><divclass="arguments">Arguments:</div><tableclass="args"><trclass="arg"><tdclass="argname">color</td><tdclass="arg_short">Input image to antialias with shape [minibatch_size, height, width, num_channels].</td></tr><trclass="arg"><tdclass="argname">rast</td><tdclass="arg_short">Main output tensor from <code>rasterize()</code>.</td></tr><trclass="arg"><tdclass="argname">pos</td><tdclass="arg_short">Vertex position tensor used in the rasterization operation.</td></tr><trclass="arg"><tdclass="argname">tri</td><tdclass="arg_short">Triangle tensor used in the rasterization operation.</td></tr><trclass="arg"><tdclass="argname">topology_hash</td><tdclass="arg_short">(Optional) Preconstructed topology hash for the triangle tensor. If not
will be contiguous and reside in GPU memory.</p><pclass="longdesc">Note that silhouette edge determination is based on vertex indices in the triangle
tensor. For it to work properly, a vertex belonging to multiple triangles must be
referred to using the same vertex index in each triangle. Otherwise, nvdiffrast will always
classify the adjacent edges as silhouette edges, which leads to bad performance and
potentially incorrect gradients. If you are unsure whether your data is good, check
which pixels are modified by the antialias operation and compare to the example in the
documentation.</p><divclass="arguments">Arguments:</div><tableclass="args"><trclass="arg"><tdclass="argname">color</td><tdclass="arg_short">Input image to antialias with shape [minibatch_size, height, width, num_channels].</td></tr><trclass="arg"><tdclass="argname">rast</td><tdclass="arg_short">Main output tensor from <code>rasterize()</code>.</td></tr><trclass="arg"><tdclass="argname">pos</td><tdclass="arg_short">Vertex position tensor used in the rasterization operation.</td></tr><trclass="arg"><tdclass="argname">tri</td><tdclass="arg_short">Triangle tensor used in the rasterization operation.</td></tr><trclass="arg"><tdclass="argname">topology_hash</td><tdclass="arg_short">(Optional) Preconstructed topology hash for the triangle tensor. If not
specified, the topology hash is constructed internally and discarded afterwards.</td></tr><trclass="arg"><tdclass="argname">pos_gradient_boost</td><tdclass="arg_short">(Optional) Multiplier for gradients propagated to <code>pos</code>.</td></tr></table><divclass="returns">Returns:<divclass="return_description">A tensor containing the antialiased image with the same shape as <code>color</code> input tensor.</div></div></div>
<pclass="shortdesc">Construct a topology hash for a triangle tensor.</p><pclass="longdesc">This function can be used for constructing a topology hash for a triangle tensor that is
...
...
@@ -1012,7 +1036,7 @@ severity will be silent.</td></tr></table></div>
<p>This work is made available under the <ahref="https://github.com/NVlabs/nvdiffrast/blob/main/LICENSE.txt">Nvidia Source Code License</a>.</p>
<p>For business inquiries, please visit our website and submit the form: <ahref="https://www.nvidia.com/en-us/research/inquiries/">NVIDIA Research Licensing</a></p>
<p>We do not currently accept outside contributions in the form of pull requests.</p>
RenderModeFlag_EnableDepthPeeling=1<<1,// Enable depth peeling. Must have a peel buffer set.
};
public:
CudaRaster(void);
~CudaRaster(void);
voidsetViewportSize(intwidth,intheight,intnumImages);// Width and height must be multiples of tile size (8x8).
voidsetRenderModeFlags(unsignedintrenderModeFlags);// Affects all subsequent calls to drawTriangles(). Defaults to zero.
voiddeferredClear(unsignedintclearColor);// Clears color and depth buffers during next call to drawTriangles().
voidsetVertexBuffer(void*vertices,intnumVertices);// GPU pointer managed by caller. Vertex positions in clip space as float4 (x, y, z, w).
voidsetIndexBuffer(void*indices,intnumTriangles);// GPU pointer managed by caller. Triangle index+color quadruplets as uint4 (idx0, idx1, idx2, color).
booldrawTriangles(constint*ranges,cudaStream_tstream);// Ranges (offsets and counts) as #triangles entries, not as bytes. If NULL, draw all triangles. Returns false in case of internal overflow.
void*getColorBuffer(void);// GPU pointer managed by CudaRaster.
void*getDepthBuffer(void);// GPU pointer managed by CudaRaster.
voidswapDepthAndPeel(void);// Swap depth and peeling buffers.
template <class T> __device__ __inline__ void sortShared(T* ptr, int numItems); // Assumes that numItems <= threadsInBlock. Must sync before & after the call.
__device__ __inline__ const CRImageParams& getImageParams(const CRParams& p, int idx)