Depth peeling support

7b346299 · Samuli Laine · 3ffece8f · 7b346299 · 7b346299 · 7b346299
Commit 7b346299 authored Feb 10, 2021 by Samuli Laine
11 changed files
--- a/docs/img/spot_peel1.png
+++ b/docs/img/spot_peel1.png
--- a/docs/img/spot_peel2.png
+++ b/docs/img/spot_peel2.png
--- a/docs/index.html
+++ b/docs/index.html
@@ -220,7 +220,7 @@ div.image-parent {
 .apifunc h4 .defarg {
    color:MediumBlue;
 }
-.apifunc h4 .sym_class,.sym_function {
+.apifunc h4 .sym_class,.sym_function,.sym_method {
    border-radius: 4px;
    padding: 0px 5px 0px 5px;
    border: 0;
@@ -235,6 +235,9 @@ div.image-parent {
 .apifunc h4 .sym_function {
    background-color: #66f;
 }
+.apifunc h4 .sym_method {
+    background-color: #6a9;
+}
 .apifunc p {
    margin-top: var(--func-vert-padding);
    margin-bottom: var(--func-vert-padding);
@@ -316,6 +319,7 @@ div.image-parent {
 <li><a href="#image-space-derivatives">Image-space derivatives</a></li>
 <li><a href="#mipmaps-and-texture-dimensions">Mipmaps and texture dimensions</a></li>
 <li><a href="#running-on-multiple-gpus">Running on multiple GPUs</a></li>
+<li><a href="#rendering-multiple-depth-layers">Rendering multiple depth layers</a></li>
 <li><a href="#differences-between-pytorch-and-tensorflow">Differences between PyTorch and TensorFlow</a><ul>
 <li><a href="#manual-opengl-contexts-in-pytorch">Manual OpenGL contexts in PyTorch</a></li>
 </ul></li>
@@ -524,8 +528,8 @@ For 2D textures, the coordinate origin <span class="math inline">(<em>s</em>, 
 <p>We skirted around a pretty fundamental question in the description of the texturing operation above. In order to determine the proper amount of prefiltering for sampling a texture, we need to know how densely it is being sampled. But how can we know the sampling density when each pixel knows of a just a single surface point?</p>
 <p>The solution is to track the image-space derivatives of all things leading up to the texture sampling operation. <em>These are not the same thing as the gradients used in the backward pass</em>, even though they both involve differentiation! Consider the barycentrics <span class="math inline">(<em>u</em>, <em>v</em>)</span> produced by the rasterization operation. They change by some amount when moving horizontally or vertically in the image plane. If we denote the image-space coordinates as <span class="math inline">(<em>X</em>, <em>Y</em>)</span>, the image-space derivatives of the barycentrics would be <span class="math inline">∂<em>u</em>/∂<em>X</em></span>, <span class="math inline">∂<em>u</em>/∂<em>Y</em></span>, <span class="math inline">∂<em>v</em>/∂<em>X</em></span>, and <span class="math inline">∂<em>v</em>/∂<em>Y</em></span>. We can organize these into a 2×2 Jacobian matrix that describes the local relationship between <span class="math inline">(<em>u</em>, <em>v</em>)</span> and <span class="math inline">(<em>X</em>, <em>Y</em>)</span>. This matrix is generally different at every pixel. For the purpose of image-space derivatives, the units of <span class="math inline"><em>X</em></span> and <span class="math inline"><em>Y</em></span> are pixels. Hence, <span class="math inline">∂<em>u</em>/∂<em>X</em></span> is the local approximation of how much <span class="math inline"><em>u</em></span> changes when moving a distance of one pixel in the horizontal direction, and so on.</p>
 <p>Once we know how the barycentrics change w.r.t. pixel position, the interpolation operation can use this to determine how the attributes change w.r.t. pixel position. When attributes are used as texture coordinates, we can therefore tell how the texture sampling position (in texture space) changes when moving around within the pixel (up to a local, linear approximation, that is). This <em>texture footprint</em> tells us the scale on which the texture should be prefiltered. In more practical terms, it tells us which mipmap level(s) to use when sampling the texture.</p>
-<p>In nvdiffrast, the rasterization operation can be configured to output the image-space derivatives of the barycentrics in an auxiliary 4-channel output tensor, ordered (<span class="math inline">∂<em>u</em>/∂<em>X</em></span>, <span class="math inline">∂<em>u</em>/∂<em>Y</em></span>, <span class="math inline">∂<em>v</em>/∂<em>X</em></span>, <span class="math inline">∂<em>v</em>/∂<em>Y</em></span>) from channel 0 to 3. The interpolation operation can take this auxiliary tensor as input and compute image-space derivatives of any set of attributes being interpolated. Finally, the texture sampling operation requires the image-space derivatives of the texture coordinates if a prefiltered sampling mode is being used.</p>
-<p>There is nothing magic about these image-space derivatives. They are tensors like the, e.g., the texture coordinates themselves, they propagate gradients backwards, and so on. For example, if you want to artificially blur or sharpen the texture when sampling it, you can simply multiply the tensor carrying the image-space derivatives of the texture coordinates <span class="math inline">∂{<em>s</em>, <em>t</em>}/∂{<em>X</em>, <em>Y</em>}</span> by a scalar value before feeding it into the texture sampling operation. This scales the texture footprints and thus adjusts the amount of prefiltering. If your loss function prefers a different level of sharpness, this multiplier will receive a nonzero gradient.</p>
+<p>In nvdiffrast, the rasterization operation can be configured to output the image-space derivatives of the barycentrics in an auxiliary 4-channel output tensor, ordered (<span class="math inline">∂<em>u</em>/∂<em>X</em></span>, <span class="math inline">∂<em>u</em>/∂<em>Y</em></span>, <span class="math inline">∂<em>v</em>/∂<em>X</em></span>, <span class="math inline">∂<em>v</em>/∂<em>Y</em></span>) from channel 0 to 3. The interpolation operation can take this auxiliary tensor as input and compute image-space derivatives of any set of attributes being interpolated. Finally, the texture sampling operation can use the image-space derivatives of the texture coordinates to determine the amount of prefiltering.</p>
+<p>There is nothing magic about these image-space derivatives. They are tensors like the, e.g., the texture coordinates themselves, they propagate gradients backwards, and so on. For example, if you want to artificially blur or sharpen the texture when sampling it, you can simply multiply the tensor carrying the image-space derivatives of the texture coordinates <span class="math inline">∂{<em>s</em>, <em>t</em>}/∂{<em>X</em>, <em>Y</em>}</span> by a scalar value before feeding it into the texture sampling operation. This scales the texture footprints and thus adjusts the amount of prefiltering. If your loss function prefers a different level of sharpness, this multiplier will receive a nonzero gradient. <em>Update:</em> Since version 0.2.1, the texture sampling operation also supports a separate mip level bias input that would be better suited for this particular task, but the gist is the same nonetheless.</p>
 <p>One might wonder if it would have been easier to determine the texture footprints simply from the texture coordinates in adjacent pixels, and skip all this derivative rubbish? In easy cases the answer is yes, but silhouettes, occlusions, and discontinuous texture parameterizations would make this approach rather unreliable in practice. Computing the image-space derivatives analytically keeps everything point-like, local, and well-behaved.</p>
 <p>It should be noted that computing gradients related to image-space derivatives is somewhat involved and requires additional computation. At the same time, they are often not crucial for the convergence of the training/optimization. Because of this, the primitive operations in nvdiffrast offer options to disable the calculation of these gradients. We're talking about things like <span class="math inline">∂<em>L</em><em>o</em><em>s</em><em>s</em>/∂(∂{<em>u</em>, <em>v</em>}/∂{<em>X</em>, <em>Y</em>})</span> that may look second-order-ish, but they're not.</p>
 <h3 id="mipmaps-and-texture-dimensions">Mipmaps and texture dimensions</h3>
@@ -725,6 +729,39 @@ Mip level 5
 <p>Nvdiffrast supports computation on multiple GPUs in both PyTorch and TensorFlow. As is the convention in PyTorch, the operations are always executed on the device on which the input tensors reside. All GPU input tensors must reside on the same device, and the output tensors will unsurprisingly end up on that same device. In addition, the rasterization operation requires that its OpenGL context was created for the correct device. In TensorFlow, the OpenGL context is automatically created on the device of the rasterization operation when it is executed for the first time.</p>
 <p>On Windows, nvdiffrast implements OpenGL device selection in a way that can be done only once per process — after one context is created, all future ones will end up on the same GPU. Hence you cannot expect to run the rasterization operation on multiple GPUs within the same process. Trying to do so will either cause a crash or incur a significant performance penalty. However, with PyTorch it is common to distribute computation across GPUs by launching a separate process for each GPU, so this is not a huge concern. Note that any OpenGL context created within the same process, even for something like a GUI window, will prevent changing the device later. Therefore, if you want to run the rasterization operation on other than the default GPU, be sure to create its OpenGL context before initializing any other OpenGL-powered libraries.</p>
 <p>On Linux everything just works, and you can create rasterizer OpenGL contexts on multiple devices within the same process.</p>
+<h3 id="rendering-multiple-depth-layers">Rendering multiple depth layers</h3>
+<p>Sometimes there is a need to render scenes with partially transparent surfaces. In this case, it is not sufficient to find only the surfaces that are closest to the camera, as you may also need to know what lies behind them. For this purpose, nvdiffrast supports <em>depth peeling</em> that lets you extract multiple closest surfaces for each pixel.</p>
+<p>With depth peeling, we start by rasterizing the closest surfaces as usual. We then perform a second rasterization pass with the same geometry, but this time we cull all previously rendered surface points at each pixel, effectively extracting the second-closest depth layer. This can be repeated as many times as desired, so that we can extract as many depth layers as we like. See the images below for example results of depth peeling with each depth layer shaded and antialiased.</p>
+<div class="image-parent">
+<div class="image-row">
+<div class="image-caption">
+<img class="brd" src="img/spot_aa.png"/>
+<div class="caption">
+First depth layer
+</div>
+</div>
+<div class="image-caption">
+<img class="brd" src="img/spot_peel1.png"/>
+<div class="caption">
+Second depth layer
+</div>
+</div>
+<div class="image-caption">
+<img class="brd" src="img/spot_peel2.png"/>
+<div class="caption">
+Third depth layer
+</div>
+</div>
+</div>
+</div>
+<p>The API for depth peeling is based on <code>DepthPeeler</code> object that acts as a <a href="https://docs.python.org/3/reference/datamodel.html#context-managers">context manager</a>, and its <code>rasterize_next_layer</code> method. The first call to <code>rasterize_next_layer</code> is equivalent to calling the traditional <code>rasterize</code> function, and subsequent calls report further depth layers. The arguments for rasterization are specified when instantiating the <code>DepthPeeler</code> object. Concretely, your code might look something like this:</p>
+<div class="sourceCode" id="cb8"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb8-1" data-line-number="1"><span class="cf">with</span> nvdiffrast.torch.DepthPeeler(glctx, pos, tri, resolution) <span class="im">as</span> peeler:</a>
+<a class="sourceLine" id="cb8-2" data-line-number="2">  <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(num_layers):</a>
+<a class="sourceLine" id="cb8-3" data-line-number="3">    rast, rast_db <span class="op">=</span> peeler.rasterize_next_layer()</a>
+<a class="sourceLine" id="cb8-4" data-line-number="4">    (process <span class="kw">or</span> store the results)</a></code></pre></div>
+<p>There is no performance penalty compared to the basic rasterization op if you end up extracting only the first depth layer. In other words, the code above with <code>num_layers=1</code> runs exactly as fast as calling <code>rasterize</code> once.</p>
+<p>Depth peeling is only supported in the PyTorch version of nvdiffrast. For implementation reasons, depth peeling reserves the OpenGL context so that other rasterization operations cannot be performed while the peeling is ongoing, i.e., inside the <code>with</code> block. Hence you cannot start a nested depth peeling operation or call <code>rasterize</code> inside the <code>with</code> block, unless you use a different OpenGL context.</p>
+<p>For the sake of completeness, let us note the following small caveat: Depth peeling relies on depth values to distinguish surface points from each other. Therefore, culling &quot;previously rendered surface points&quot; actually means culling all surface points at the same or closer depth as those rendered into the pixel in previous passes. This matters only if you have multiple layers of geometry at matching depths — if your geometry consists of, say, nothing but two exactly overlapping triangles, you will see one of them in the first pass but never see the other one in subsequent passes, as it's at the exact depth that is already considered done.</p>
 <h3 id="differences-between-pytorch-and-tensorflow">Differences between PyTorch and TensorFlow</h3>
 <p>Nvdiffrast can be used from PyTorch and from TensorFlow 1.x; the latter may change to TensorFlow 2.x if there is demand. These frameworks operate somewhat differently and that is reflected in the respective APIs. Simplifying a bit, in TensorFlow 1.x you construct a persistent graph out of persistent nodes, and run many batches of data through it. In PyTorch, there is no persistent graph or nodes, but a new, ephemeral graph is constructed for each batch of data and destroyed immediately afterwards. Therefore, there is also no persistent state for the operations. There is the <code>torch.nn.Module</code> abstraction for festooning operations with persistent state, but we do not use it.</p>
 <p>As a consequence, things that would be part of persistent state of an nvdiffrast operation in TensorFlow must be stored by the user in PyTorch, and supplied to the operations as needed. In practice, this is a very small difference and amounts to just a couple of lines of code in most cases.</p>
@@ -859,20 +896,25 @@ specified, context will be created on currently active Cuda
 device.</td></tr></table><div class="methods">Methods, only available if context was created in manual mode:</div><table class="args"><tr class="arg"><td class="argname">set_context()</td><td class="arg_short">Set (activate) OpenGL context in the current CPU thread.</td></tr><tr class="arg"><td class="argname">release_context()</td><td class="arg_short">Release (deactivate) currently active OpenGL context.</td></tr></table><div class="returns">Returns:<div class="return_description">The newly created OpenGL rasterizer context.</div></div></div>
 <div class="apifunc"><h4><code>nvdiffrast.torch.rasterize(<em>glctx</em>, <em>pos</em>, <em>tri</em>, <em>resolution</em>, <em>ranges</em>=<span class="defarg">None</span>, <em>grad_db</em>=<span class="defarg">True</span>)</code>&nbsp;<span class="sym_function">Function</span></h4>
 <p class="shortdesc">Rasterize triangles.</p><p class="longdesc">All input tensors must be contiguous and reside in GPU memory except for
-the <code>ranges</code> tensor that, if specified, has to reside in CPU memory. The 
+the <code>ranges</code> tensor that, if specified, has to reside in CPU memory. The
 output tensors will be contiguous and reside in GPU memory.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">glctx</td><td class="arg_short">OpenGL context of type <code>RasterizeGLContext</code>.</td></tr><tr class="arg"><td class="argname">pos</td><td class="arg_short">Vertex position tensor with dtype <code>torch.float32</code>. To enable range
 mode, this tensor should have a 2D shape [num_vertices, 4]. To enable
-instanced mode, use a 3D shape [minibatch_size, num_vertices, 4].</td></tr><tr class="arg"><td class="argname">tri</td><td class="arg_short">Triangle tensor with shape [num_triangles, 3] and dtype <code>torch.int32</code>.</td></tr><tr class="arg"><td class="argname">resolution</td><td class="arg_short">Output resolution as integer tuple (height, width).</td></tr><tr class="arg"><td class="argname">ranges</td><td class="arg_short">In range mode, tensor with shape [minibatch_size, 2] and dtype 
+instanced mode, use a 3D shape [minibatch_size, num_vertices, 4].</td></tr><tr class="arg"><td class="argname">tri</td><td class="arg_short">Triangle tensor with shape [num_triangles, 3] and dtype <code>torch.int32</code>.</td></tr><tr class="arg"><td class="argname">resolution</td><td class="arg_short">Output resolution as integer tuple (height, width).</td></tr><tr class="arg"><td class="argname">ranges</td><td class="arg_short">In range mode, tensor with shape [minibatch_size, 2] and dtype
 <code>torch.int32</code>, specifying start indices and counts into <code>tri</code>.
 Ignored in instanced mode.</td></tr><tr class="arg"><td class="argname">grad_db</td><td class="arg_short">Propagate gradients of image-space derivatives of barycentrics
 into <code>pos</code> in backward pass. Ignored if OpenGL context was
-not configured to output image-space derivatives.</td></tr></table><div class="returns">Returns:<div class="return_description">A tuple of two tensors. The first output tensor has shape [minibatch_size, 
+not configured to output image-space derivatives.</td></tr></table><div class="returns">Returns:<div class="return_description">A tuple of two tensors. The first output tensor has shape [minibatch_size,
 height, width, 4] and contains the main rasterizer output in order (u, v, z/w,
 triangle_id). If the OpenGL context was configured to output image-space
 derivatives of barycentrics, the second output tensor will also have shape
 [minibatch_size, height, width, 4] and contain said derivatives in order
 (du/dX, du/dY, dv/dX, dv/dY). Otherwise it will be an empty tensor with shape
 [minibatch_size, height, width, 0].</div></div></div>
+<div class="apifunc"><h4><code>nvdiffrast.torch.DepthPeeler(<em>...</em>)</code>&nbsp;<span class="sym_class">Class</span></h4>
+<p class="shortdesc">Create a depth peeler object for rasterizing multiple depth layers.</p><p class="longdesc">Arguments are the same as in <code>rasterize()</code>.</p><div class="returns">Returns:<div class="return_description">The newly created depth peeler.</div></div></div>
+<div class="apifunc"><h4><code>nvdiffrast.torch.DepthPeeler.rasterize_next_layer()</code>&nbsp;<span class="sym_method">Method</span></h4>
+<p class="shortdesc">Rasterize next depth layer.</p><p class="longdesc">Operation is equivalent to <code>rasterize()</code> except that previously reported
+surface points are culled away.</p><div class="returns">Returns:<div class="return_description">A tuple of two tensors as in <code>rasterize()</code>.</div></div></div>
 <div class="apifunc"><h4><code>nvdiffrast.torch.interpolate(<em>attr</em>, <em>rast</em>, <em>tri</em>, <em>rast_db</em>=<span class="defarg">None</span>, <em>diff_attrs</em>=<span class="defarg">None</span>)</code>&nbsp;<span class="sym_function">Function</span></h4>
 <p class="shortdesc">Interpolate vertex attributes.</p><p class="longdesc">All input tensors must be contiguous and reside in GPU memory. The output tensors
 will be contiguous and reside in GPU memory.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">attr</td><td class="arg_short">Attribute tensor with dtype <code>torch.float32</code>. 

--- a/nvdiffrast/__init__.py
+++ b/nvdiffrast/__init__.py
@@ -6,4 +6,4 @@
 # distribution of this software and related documentation without an express
 # license agreement from NVIDIA CORPORATION is strictly prohibited.

-__version__ = '0.2.3'
+__version__ = '0.2.4'
--- a/nvdiffrast/common/rasterize.cpp
+++ b/nvdiffrast/common/rasterize.cpp
@@ -191,6 +191,29 @@ void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s, int cudaDeviceId
                }
            )
        );
+
+        // Set up fragment shader for depth peeling.
+        compileGLShader(NVDR_CTX_PARAMS, &s.glFragmentShaderDP, GL_FRAGMENT_SHADER,
+            "#version 430\n"
+            STRINGIFY_SHADER_SOURCE(
+                in vec4 var_uvzw;
+                in vec4 var_db;
+                in int gl_Layer;
+                in int gl_PrimitiveID;
+                layout(binding = 0) uniform sampler2DArray out_prev;
+                layout(location = 0) out vec4 out_raster;
+                layout(location = 1) out vec4 out_db;
+                void main()
+                {
+                    vec4 prev = texelFetch(out_prev, ivec3(gl_FragCoord.x, gl_FragCoord.y, gl_Layer), 0);
+                    float depth_new = var_uvzw.z / var_uvzw.w;
+                    if (prev.w == 0 || depth_new <= prev.z)
+                        discard;
+                    out_raster = vec4(var_uvzw.x, var_uvzw.y, depth_new, float(gl_PrimitiveID + 1));
+                    out_db = var_db * var_uvzw.w;
+                }
+            )
+        );
    }
    else
    {
@@ -228,10 +251,31 @@ void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s, int cudaDeviceId
                }
            )
        );
+
+        // Depth peeling variant of fragment shader.
+        compileGLShader(NVDR_CTX_PARAMS, &s.glFragmentShaderDP, GL_FRAGMENT_SHADER,
+            "#version 430\n"
+            STRINGIFY_SHADER_SOURCE(
+                in vec4 var_uvzw;
+                in int gl_Layer;
+                in int gl_PrimitiveID;
+                layout(binding = 0) uniform sampler2DArray out_prev;
+                layout(location = 0) out vec4 out_raster;
+                void main()
+                {
+                    vec4 prev = texelFetch(out_prev, ivec3(gl_FragCoord.x, gl_FragCoord.y, gl_Layer), 0);
+                    float depth_new = var_uvzw.z / var_uvzw.w;
+                    if (prev.w == 0 || depth_new <= prev.z)
+                        discard;
+                    out_raster = vec4(var_uvzw.x, var_uvzw.y, var_uvzw.z / var_uvzw.w, float(gl_PrimitiveID + 1));
+                }
+            )
+        );
    }

-    // Finalize program.
+    // Finalize programs.
    constructGLProgram(NVDR_CTX_PARAMS, &s.glProgram, s.glVertexShader, s.glGeometryShader, s.glFragmentShader);
+    constructGLProgram(NVDR_CTX_PARAMS, &s.glProgramDP, s.glVertexShader, s.glGeometryShader, s.glFragmentShaderDP);

    // Construct main fbo and bind permanently.
    NVDR_CHECK_GL_ERROR(glGenFramebuffers(1, &s.glFBO));
@@ -255,11 +299,6 @@ void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s, int cudaDeviceId
    NVDR_CHECK_GL_ERROR(glGenBuffers(1, &s.glTriBuffer));
    NVDR_CHECK_GL_ERROR(glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, s.glTriBuffer));

-    // Bind color outputs and activate program.
-    NVDR_CHECK_GL_ERROR(glBindFragDataLocation(s.glProgram, 0, "out_raster"));
-    NVDR_CHECK_GL_ERROR(glBindFragDataLocation(s.glProgram, 1, "out_db"));
-    NVDR_CHECK_GL_ERROR(glUseProgram(s.glProgram));
-
    // Set up depth test.
    NVDR_CHECK_GL_ERROR(glEnable(GL_DEPTH_TEST));
    NVDR_CHECK_GL_ERROR(glDepthFunc(GL_LESS));
@@ -277,6 +316,9 @@ void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s, int cudaDeviceId
    NVDR_CHECK_GL_ERROR(glGenTextures(1, &s.glDepthStencilBuffer));
    NVDR_CHECK_GL_ERROR(glBindTexture(GL_TEXTURE_2D_ARRAY, s.glDepthStencilBuffer));
    NVDR_CHECK_GL_ERROR(glFramebufferTexture(GL_FRAMEBUFFER, GL_DEPTH_STENCIL_ATTACHMENT, s.glDepthStencilBuffer, 0));
+
+    // Create texture name for previous output buffer (depth peeling).
+    NVDR_CHECK_GL_ERROR(glGenTextures(1, &s.glPrevOutBuffer));
 }

 void rasterizeResizeBuffers(NVDR_CTX_ARGS, RasterizeGLState& s, int posCount, int triCount, int width, int height, int depth)
@@ -311,6 +353,12 @@ void rasterizeResizeBuffers(NVDR_CTX_ARGS, RasterizeGLState& s, int posCount, in
            for (int i=0; i < num_outputs; i++)
                NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnregisterResource(s.cudaColorBuffer[i]));

+        if (s.cudaPrevOutBuffer)
+        {
+            NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnregisterResource(s.cudaPrevOutBuffer));
+            s.cudaPrevOutBuffer = 0;
+        }
+
        // New framebuffer size.
        s.width  = (width > s.width) ? width : s.width;
        s.height = (height > s.height) ? height : s.height;
@@ -324,6 +372,10 @@ void rasterizeResizeBuffers(NVDR_CTX_ARGS, RasterizeGLState& s, int posCount, in
        {
            NVDR_CHECK_GL_ERROR(glBindTexture(GL_TEXTURE_2D_ARRAY, s.glColorBuffer[i]));
            NVDR_CHECK_GL_ERROR(glTexImage3D(GL_TEXTURE_2D_ARRAY, 0, GL_RGBA32F, s.width, s.height, s.depth, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0));
+            NVDR_CHECK_GL_ERROR(glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_MAG_FILTER, GL_NEAREST));
+            NVDR_CHECK_GL_ERROR(glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_MIN_FILTER, GL_NEAREST));
+            NVDR_CHECK_GL_ERROR(glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE));
+            NVDR_CHECK_GL_ERROR(glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE));
        }

        // Allocate depth/stencil buffer.
@@ -344,34 +396,79 @@ void rasterizeResizeBuffers(NVDR_CTX_ARGS, RasterizeGLState& s, int posCount, in
    }
 }

-void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, const float* posPtr, int posCount, int vtxPerInstance, const int32_t* triPtr, int triCount, const int32_t* rangesPtr, int width, int height, int depth)
+void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, const float* posPtr, int posCount, int vtxPerInstance, const int32_t* triPtr, int triCount, const int32_t* rangesPtr, int width, int height, int depth, int peeling_idx)
 {
-    if (triPtr)
+    // Only copy inputs if we are on first iteration of depth peeling or not doing it at all.
+    if (peeling_idx < 1)
    {
-        // Copy both position and triangle buffers.
-        void* glPosPtr = NULL;
-        void* glTriPtr = NULL;
-        size_t posBytes = 0;
-        size_t triBytes = 0;
-        NVDR_CHECK_CUDA_ERROR(cudaGraphicsMapResources(2, &s.cudaPosBuffer, stream));
-        NVDR_CHECK_CUDA_ERROR(cudaGraphicsResourceGetMappedPointer(&glPosPtr, &posBytes, s.cudaPosBuffer));
-        NVDR_CHECK_CUDA_ERROR(cudaGraphicsResourceGetMappedPointer(&glTriPtr, &triBytes, s.cudaTriBuffer));
-        NVDR_CHECK(posBytes >= posCount * sizeof(float), "mapped GL position buffer size mismatch");
-        NVDR_CHECK(triBytes >= triCount * sizeof(int32_t), "mapped GL triangle buffer size mismatch");
-        NVDR_CHECK_CUDA_ERROR(cudaMemcpyAsync(glPosPtr, posPtr, posCount * sizeof(float), cudaMemcpyDeviceToDevice, stream));
-        NVDR_CHECK_CUDA_ERROR(cudaMemcpyAsync(glTriPtr, triPtr, triCount * sizeof(int32_t), cudaMemcpyDeviceToDevice, stream));
-        NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnmapResources(2, &s.cudaPosBuffer, stream));
+        if (triPtr)
+        {
+            // Copy both position and triangle buffers.
+            void* glPosPtr = NULL;
+            void* glTriPtr = NULL;
+            size_t posBytes = 0;
+            size_t triBytes = 0;
+            NVDR_CHECK_CUDA_ERROR(cudaGraphicsMapResources(2, &s.cudaPosBuffer, stream));
+            NVDR_CHECK_CUDA_ERROR(cudaGraphicsResourceGetMappedPointer(&glPosPtr, &posBytes, s.cudaPosBuffer));
+            NVDR_CHECK_CUDA_ERROR(cudaGraphicsResourceGetMappedPointer(&glTriPtr, &triBytes, s.cudaTriBuffer));
+            NVDR_CHECK(posBytes >= posCount * sizeof(float), "mapped GL position buffer size mismatch");
+            NVDR_CHECK(triBytes >= triCount * sizeof(int32_t), "mapped GL triangle buffer size mismatch");
+            NVDR_CHECK_CUDA_ERROR(cudaMemcpyAsync(glPosPtr, posPtr, posCount * sizeof(float), cudaMemcpyDeviceToDevice, stream));
+            NVDR_CHECK_CUDA_ERROR(cudaMemcpyAsync(glTriPtr, triPtr, triCount * sizeof(int32_t), cudaMemcpyDeviceToDevice, stream));
+            NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnmapResources(2, &s.cudaPosBuffer, stream));
+        }
+        else
+        {
+            // Copy position buffer only. Triangles are already copied and known to be constant.
+            void* glPosPtr = NULL;
+            size_t posBytes = 0;
+            NVDR_CHECK_CUDA_ERROR(cudaGraphicsMapResources(1, &s.cudaPosBuffer, stream));
+            NVDR_CHECK_CUDA_ERROR(cudaGraphicsResourceGetMappedPointer(&glPosPtr, &posBytes, s.cudaPosBuffer));
+            NVDR_CHECK(posBytes >= posCount * sizeof(float), "mapped GL position buffer size mismatch");
+            NVDR_CHECK_CUDA_ERROR(cudaMemcpyAsync(glPosPtr, posPtr, posCount * sizeof(float), cudaMemcpyDeviceToDevice, stream));
+            NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnmapResources(1, &s.cudaPosBuffer, stream));
+        }
+    }
+
+    // Select program based on whether we have a depth peeling input or not.
+    if (peeling_idx < 1)
+    {
+        // Normal case: No peeling, or peeling disabled.
+        NVDR_CHECK_GL_ERROR(glUseProgram(s.glProgram));
    }
    else
    {
-        // Copy position buffer only. Triangles are already copied and known to be constant.
-        void* glPosPtr = NULL;
-        size_t posBytes = 0;
-        NVDR_CHECK_CUDA_ERROR(cudaGraphicsMapResources(1, &s.cudaPosBuffer, stream));
-        NVDR_CHECK_CUDA_ERROR(cudaGraphicsResourceGetMappedPointer(&glPosPtr, &posBytes, s.cudaPosBuffer));
-        NVDR_CHECK(posBytes >= posCount * sizeof(float), "mapped GL position buffer size mismatch");
-        NVDR_CHECK_CUDA_ERROR(cudaMemcpyAsync(glPosPtr, posPtr, posCount * sizeof(float), cudaMemcpyDeviceToDevice, stream));
-        NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnmapResources(1, &s.cudaPosBuffer, stream));
+        // If we don't have a third buffer yet, create one.
+        if (!s.cudaPrevOutBuffer)
+        {
+            NVDR_CHECK_GL_ERROR(glBindTexture(GL_TEXTURE_2D_ARRAY, s.glPrevOutBuffer));
+            NVDR_CHECK_GL_ERROR(glTexImage3D(GL_TEXTURE_2D_ARRAY, 0, GL_RGBA32F, s.width, s.height, s.depth, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0));
+            NVDR_CHECK_GL_ERROR(glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_MAG_FILTER, GL_NEAREST));
+            NVDR_CHECK_GL_ERROR(glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_MIN_FILTER, GL_NEAREST));
+            NVDR_CHECK_GL_ERROR(glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE));
+            NVDR_CHECK_GL_ERROR(glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE));
+            NVDR_CHECK_CUDA_ERROR(cudaGraphicsGLRegisterImage(&s.cudaPrevOutBuffer, s.glPrevOutBuffer, GL_TEXTURE_3D, cudaGraphicsRegisterFlagsReadOnly));
+        }
+
+        // Swap the GL buffers.
+        GLuint glTempBuffer = s.glPrevOutBuffer;
+        s.glPrevOutBuffer = s.glColorBuffer[0];
+        s.glColorBuffer[0] = glTempBuffer;
+
+        // Swap the Cuda buffers.
+        cudaGraphicsResource_t cudaTempBuffer = s.cudaPrevOutBuffer;
+        s.cudaPrevOutBuffer = s.cudaColorBuffer[0];
+        s.cudaColorBuffer[0] = cudaTempBuffer;
+
+        // Bind the new output buffer.
+        NVDR_CHECK_GL_ERROR(glBindTexture(GL_TEXTURE_2D_ARRAY, s.glColorBuffer[0]));
+        NVDR_CHECK_GL_ERROR(glFramebufferTexture(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, s.glColorBuffer[0], 0));
+
+        // Bind old buffer as the input texture.
+        NVDR_CHECK_GL_ERROR(glBindTexture(GL_TEXTURE_2D_ARRAY, s.glPrevOutBuffer));
+
+        // Activate the correct program.
+        NVDR_CHECK_GL_ERROR(glUseProgram(s.glProgramDP));
    }

    // Set viewport, clear color buffer(s) and depth/stencil buffer.

--- a/nvdiffrast/common/rasterize.h
+++ b/nvdiffrast/common/rasterize.h
@@ -65,15 +65,19 @@ struct RasterizeGLState
    GLContext               glctx;
    GLuint                  glFBO;
    GLuint                  glColorBuffer[2];
+    GLuint                  glPrevOutBuffer;
    GLuint                  glDepthStencilBuffer;
    GLuint                  glVAO;
    GLuint                  glTriBuffer;
    GLuint                  glPosBuffer;
    GLuint                  glProgram;
+    GLuint                  glProgramDP;
    GLuint                  glVertexShader;
    GLuint                  glGeometryShader;
    GLuint                  glFragmentShader;
+    GLuint                  glFragmentShaderDP;
    cudaGraphicsResource_t  cudaColorBuffer[2];
+    cudaGraphicsResource_t  cudaPrevOutBuffer;
    cudaGraphicsResource_t  cudaPosBuffer;
    cudaGraphicsResource_t  cudaTriBuffer;
    std::vector<GLDrawCmd>  drawCmdBuffer;
@@ -85,7 +89,7 @@ struct RasterizeGLState

 void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s, int cudaDeviceIdx);
 void rasterizeResizeBuffers(NVDR_CTX_ARGS, RasterizeGLState& s, int posCount, int triCount, int width, int height, int depth);
-void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, const float* posPtr, int posCount, int vtxPerInstance, const int32_t* triPtr, int triCount, const int32_t* rangesPtr, int width, int height, int depth);
+void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, const float* posPtr, int posCount, int vtxPerInstance, const int32_t* triPtr, int triCount, const int32_t* rangesPtr, int width, int height, int depth, int peeling_idx);
 void rasterizeCopyResults(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, float** outputPtr, int width, int height, int depth);

 //------------------------------------------------------------------------

--- a/nvdiffrast/tensorflow/tf_rasterize.cu
+++ b/nvdiffrast/tensorflow/tf_rasterize.cu
@@ -89,7 +89,7 @@ struct RasterizeFwdOp : public OpKernel
        const int32_t* rangesPtr = instance_mode ? 0 : ranges.flat<int32_t>().data(); // This is in CPU memory.
        const int32_t* triPtr = (initCtx || !m_tri_const) ? tri.flat<int32_t>().data() : NULL; // Copy triangles only if needed.
        int vtxPerInstance = instance_mode ? pos.dim_size(1) : 0;
-        rasterizeRender(ctx, m_glState, stream, posPtr, posCount, vtxPerInstance, triPtr, triCount, rangesPtr, width, height, depth);
+        rasterizeRender(ctx, m_glState, stream, posPtr, posCount, vtxPerInstance, triPtr, triCount, rangesPtr, width, height, depth, -1);

        // Allocate output tensors.
        TensorShape output_shape;

--- a/nvdiffrast/torch/__init__.py
+++ b/nvdiffrast/torch/__init__.py
@@ -6,5 +6,5 @@
 # distribution of this software and related documentation without an express
 # license agreement from NVIDIA CORPORATION is strictly prohibited.

-from .ops import RasterizeGLContext, get_log_level, set_log_level, rasterize, interpolate, texture, texture_construct_mip, antialias, antialias_construct_topology_hash
-__all__ = ["RasterizeGLContext", "get_log_level", "set_log_level", "rasterize", "interpolate", "texture", "texture_construct_mip", "antialias", "antialias_construct_topology_hash"]
+from .ops import RasterizeGLContext, get_log_level, set_log_level, rasterize, DepthPeeler, interpolate, texture, texture_construct_mip, antialias, antialias_construct_topology_hash
+__all__ = ["RasterizeGLContext", "get_log_level", "set_log_level", "rasterize", "DepthPeeler", "interpolate", "texture", "texture_construct_mip", "antialias", "antialias_construct_topology_hash"]
--- a/nvdiffrast/torch/ops.py
+++ b/nvdiffrast/torch/ops.py
@@ -148,6 +148,7 @@ class RasterizeGLContext:
            with torch.cuda.device(device):
                cuda_device_idx = torch.cuda.current_device()
        self.cpp_wrapper = _get_plugin().RasterizeGLStateWrapper(output_db, mode == 'automatic', cuda_device_idx)
+        self.active_depth_peeler = None # For error checking only

    def set_context(self):
        '''Set (activate) OpenGL context in the current CPU thread.
@@ -169,8 +170,8 @@ class RasterizeGLContext:

 class _rasterize_func(torch.autograd.Function):
    @staticmethod
-    def forward(ctx, glctx, pos, tri, resolution, ranges, grad_db):
-        out, out_db = _get_plugin().rasterize_fwd(glctx.cpp_wrapper, pos, tri, resolution, ranges)
+    def forward(ctx, glctx, pos, tri, resolution, ranges, grad_db, peeling_idx):
+        out, out_db = _get_plugin().rasterize_fwd(glctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
        ctx.save_for_backward(pos, tri, out)
        ctx.saved_grad_db = grad_db
        return out, out_db
@@ -182,14 +183,14 @@ class _rasterize_func(torch.autograd.Function):
            g_pos = _get_plugin().rasterize_grad_db(pos, tri, out, dy, ddb)
        else:
            g_pos = _get_plugin().rasterize_grad(pos, tri, out, dy)
-        return None, g_pos, None, None, None, None
+        return None, g_pos, None, None, None, None, None

 # Op wrapper.
 def rasterize(glctx, pos, tri, resolution, ranges=None, grad_db=True):
    '''Rasterize triangles.

    All input tensors must be contiguous and reside in GPU memory except for
-    the `ranges` tensor that, if specified, has to reside in CPU memory. The 
+    the `ranges` tensor that, if specified, has to reside in CPU memory. The
    output tensors will be contiguous and reside in GPU memory.

    Args:
@@ -199,7 +200,7 @@ def rasterize(glctx, pos, tri, resolution, ranges=None, grad_db=True):
             instanced mode, use a 3D shape [minibatch_size, num_vertices, 4].
        tri: Triangle tensor with shape [num_triangles, 3] and dtype `torch.int32`.
        resolution: Output resolution as integer tuple (height, width).
-        ranges: In range mode, tensor with shape [minibatch_size, 2] and dtype 
+        ranges: In range mode, tensor with shape [minibatch_size, 2] and dtype
                `torch.int32`, specifying start indices and counts into `tri`.
                Ignored in instanced mode.
        grad_db: Propagate gradients of image-space derivatives of barycentrics
@@ -207,7 +208,7 @@ def rasterize(glctx, pos, tri, resolution, ranges=None, grad_db=True):
                 not configured to output image-space derivatives.

    Returns:
-        A tuple of two tensors. The first output tensor has shape [minibatch_size, 
+        A tuple of two tensors. The first output tensor has shape [minibatch_size,
        height, width, 4] and contains the main rasterizer output in order (u, v, z/w,
        triangle_id). If the OpenGL context was configured to output image-space
        derivatives of barycentrics, the second output tensor will also have shape
@@ -227,8 +228,82 @@ def rasterize(glctx, pos, tri, resolution, ranges=None, grad_db=True):
    else:
        assert isinstance(ranges, torch.Tensor)

+    # Check that context is not currently reserved for depth peeling.
+    if glctx.active_depth_peeler is not None:
+        return RuntimeError("Cannot call rasterize() during depth peeling operation, use rasterize_next_layer() instead")
+
    # Instantiate the function.
-    return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db)
+    return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1)
+
+#----------------------------------------------------------------------------
+# Depth peeler context manager for rasterizing multiple depth layers.
+#----------------------------------------------------------------------------
+
+class DepthPeeler:
+    def __init__(self, glctx, pos, tri, resolution, ranges=None, grad_db=True):
+        '''Create a depth peeler object for rasterizing multiple depth layers.
+
+        Arguments are the same as in `rasterize()`.
+
+        Returns:
+          The newly created depth peeler.
+        '''
+        assert isinstance(glctx, RasterizeGLContext)
+        assert grad_db is True or grad_db is False
+        grad_db = grad_db and glctx.output_db
+
+        # Sanitize inputs as usual.
+        assert isinstance(pos, torch.Tensor) and isinstance(tri, torch.Tensor)
+        resolution = tuple(resolution)
+        if ranges is None:
+            ranges = torch.empty(size=(0, 2), dtype=torch.int32, device='cpu')
+        else:
+            assert isinstance(ranges, torch.Tensor)
+
+        # Store all the parameters.
+        self.glctx = glctx
+        self.pos = pos
+        self.tri = tri
+        self.resolution = resolution
+        self.ranges = ranges
+        self.grad_db = grad_db
+        self.peeling_idx = None
+
+    def __enter__(self):
+        if self.glctx is None:
+            raise RuntimeError("Cannot re-enter a terminated depth peeling operation")
+        if self.glctx.active_depth_peeler is not None:
+            raise RuntimeError("Cannot have multiple depth peelers active simultaneously in a RasterizeGLContext")
+        self.glctx.active_depth_peeler = self
+        self.peeling_idx = 0
+        return self
+
+    def __exit__(self, *args):
+        assert self.glctx.active_depth_peeler is self
+        self.glctx.active_depth_peeler = None
+        self.glctx = None # Remove all references to input tensor so they're not left dangling.
+        self.pos = None
+        self.tri = None
+        self.resolution = None
+        self.ranges = None
+        self.grad_db = None
+        self.peeling_idx = None
+        return None
+
+    def rasterize_next_layer(self):
+        '''Rasterize next depth layer.
+
+        Operation is equivalent to `rasterize()` except that previously reported
+        surface points are culled away.
+
+        Returns:
+          A tuple of two tensors as in `rasterize()`.
+        '''
+        assert self.glctx.active_depth_peeler is self
+        assert self.peeling_idx >= 0
+        result = _rasterize_func.apply(self.glctx, self.pos, self.tri, self.resolution, self.ranges, self.grad_db, self.peeling_idx)
+        self.peeling_idx += 1
+        return result

 #----------------------------------------------------------------------------
 # Interpolate.

--- a/nvdiffrast/torch/torch_bindings.cpp
+++ b/nvdiffrast/torch/torch_bindings.cpp
@@ -20,7 +20,7 @@
 #define OP_RETURN_TTV   std::tuple<torch::Tensor, torch::Tensor, std::vector<torch::Tensor> >
 #define OP_RETURN_TTTTV std::tuple<torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor, std::vector<torch::Tensor> >

-OP_RETURN_TT        rasterize_fwd                       (RasterizeGLStateWrapper& stateWrapper, torch::Tensor pos, torch::Tensor tri, std::tuple<int, int> resolution, torch::Tensor ranges);
+OP_RETURN_TT        rasterize_fwd                       (RasterizeGLStateWrapper& stateWrapper, torch::Tensor pos, torch::Tensor tri, std::tuple<int, int> resolution, torch::Tensor ranges, int depth_idx);
 OP_RETURN_T         rasterize_grad                      (torch::Tensor pos, torch::Tensor tri, torch::Tensor out, torch::Tensor dy);
 OP_RETURN_T         rasterize_grad_db                   (torch::Tensor pos, torch::Tensor tri, torch::Tensor out, torch::Tensor dy, torch::Tensor ddb);
 OP_RETURN_TT        interpolate_fwd                     (torch::Tensor attr, torch::Tensor rast, torch::Tensor tri);

--- a/nvdiffrast/torch/torch_rasterize.cpp
+++ b/nvdiffrast/torch/torch_rasterize.cpp
@@ -51,7 +51,7 @@ void RasterizeGLStateWrapper::releaseContext(void)
 //------------------------------------------------------------------------
 // Forward op.

-std::tuple<torch::Tensor, torch::Tensor> rasterize_fwd(RasterizeGLStateWrapper& stateWrapper, torch::Tensor pos, torch::Tensor tri, std::tuple<int, int> resolution, torch::Tensor ranges)
+std::tuple<torch::Tensor, torch::Tensor> rasterize_fwd(RasterizeGLStateWrapper& stateWrapper, torch::Tensor pos, torch::Tensor tri, std::tuple<int, int> resolution, torch::Tensor ranges, int peeling_idx)
 {
    const at::cuda::OptionalCUDAGuard device_guard(device_of(pos));
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
@@ -103,7 +103,7 @@ std::tuple<torch::Tensor, torch::Tensor> rasterize_fwd(RasterizeGLStateWrapper&
    const int32_t* rangesPtr = instance_mode ? 0 : ranges.data_ptr<int32_t>(); // This is in CPU memory.
    const int32_t* triPtr = tri.data_ptr<int32_t>();
    int vtxPerInstance = instance_mode ? pos.size(1) : 0;
-    rasterizeRender(NVDR_CTX_PARAMS, s, stream, posPtr, posCount, vtxPerInstance, triPtr, triCount, rangesPtr, width, height, depth);
+    rasterizeRender(NVDR_CTX_PARAMS, s, stream, posPtr, posCount, vtxPerInstance, triPtr, triCount, rangesPtr, width, height, depth, peeling_idx);

    // Allocate output tensors.
    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);