Add CUDA rasterizer

a1ec436b · Samuli Laine · 78528e68 · a1ec436b · a1ec436b · a1ec436b
Commit a1ec436b authored Aug 17, 2022 by Samuli Laine
20 changed files
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ Please refer to &#x261E;&#x261E; [nvdiffrast documentation](https://nvlabs.githu

 ## Licenses

-Copyright &copy; 2020, NVIDIA Corporation. All rights reserved.
+Copyright &copy; 2020&ndash;2022, NVIDIA Corporation. All rights reserved.

 This work is made available under the [Nvidia Source Code License](https://github.com/NVlabs/nvdiffrast/blob/main/LICENSE.txt).


--- a/docs/index.html
+++ b/docs/index.html
@@ -321,6 +321,7 @@ div.image-parent {
 <li><a href="#geometry-and-minibatches-range-mode-vs-instanced-mode">Geometry and minibatches: Range mode vs Instanced mode</a></li>
 <li><a href="#image-space-derivatives">Image-space derivatives</a></li>
 <li><a href="#mipmaps-and-texture-dimensions">Mipmaps and texture dimensions</a></li>
+<li><a href="#rasterizing-with-cuda-vs-opengl-new">Rasterizing with CUDA vs OpenGL <span style="color:red">(New!)</span></a></li>
 <li><a href="#running-on-multiple-gpus">Running on multiple GPUs</a>
 <ul>
 <li><a href="#note-on-torch.nn.dataparallel">Note on torch.nn.DataParallel</a></li>
@@ -347,7 +348,7 @@ div.image-parent {
 </nav></div>

 <h2 id="overview">Overview</h2>
-<p>Nvdiffrast is a PyTorch/TensorFlow library that provides high-performance primitive operations for rasterization-based differentiable rendering. It is a lower-level library compared to previous ones such as <a href="https://github.com/BachiLi/redner">redner</a>, <a href="https://github.com/ShichenLiu/SoftRas">SoftRas</a>, or <a href="https://github.com/facebookresearch/pytorch3d">PyTorch3D</a> — nvdiffrast has no built-in camera models, lighting/material models, etc. Instead, the provided operations encapsulate only the most graphics-centric steps in the modern hardware graphics pipeline: rasterization, interpolation, texturing, and antialiasing. All of these operations (and their gradients) are GPU-accelerated, either via CUDA or via the hardware graphics pipeline.</p>
+<p>Nvdiffrast is a PyTorch/TensorFlow library that provides high-performance primitive operations for rasterization-based differentiable rendering. It is a lower-level library compared to previous ones such as <a href="https://github.com/BachiLi/redner">redner</a>, <a href="https://github.com/ShichenLiu/SoftRas">SoftRas</a>, or <a href="https://github.com/facebookresearch/pytorch3d">PyTorch3D</a> — nvdiffrast has no built-in camera models, lighting/material models, etc. Instead, the provided operations encapsulate only the most graphics-centric steps in the modern hardware graphics pipeline: rasterization, interpolation, texturing, and antialiasing. All of these operations (and their gradients) are GPU-accelerated, either via CUDA or via the hardware graphics pipeline.</p>
 This documentation is intended to serve as a user's guide to nvdiffrast. For detailed discussion on the design principles, implementation details, and benchmarks, please see our paper:
 <blockquote>
 <strong>Modular Primitives for High-Performance Differentiable Rendering</strong><br> Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, Timo Aila<br> ACM Transactions on Graphics 39(6) (proc. SIGGRAPH Asia 2020)
@@ -364,19 +365,19 @@ Examples of things we've done with nvdiffrast
 </div>
 </div>
 <h2 id="installation">Installation</h2>
-<p>Requirements:</p>
+<p>Minimum requirements:</p>
 <ul>
 <li>Linux or Windows operating system.</li>
-<li>64-bit Python 3.6 or 3.7. We recommend Anaconda3 with numpy 1.14.3 or newer.</li>
-<li>PyTorch 1.6 (recommended) or TensorFlow 1.14. TensorFlow 2.x is currently not supported.</li>
-<li>A high-end NVIDIA GPU, NVIDIA drivers, CUDA 10.2 toolkit, and cuDNN 7.6.</li>
+<li>64-bit Python 3.6.</li>
+<li>PyTorch (recommended) 1.6 or TensorFlow 1.14. TensorFlow 2.x is currently not supported.</li>
+<li>A high-end NVIDIA GPU, NVIDIA drivers, CUDA 10.2 toolkit.</li>
 </ul>
 <p>To download nvdiffrast, either download the repository at <a href="https://github.com/NVlabs/nvdiffrast">https://github.com/NVlabs/nvdiffrast</a> as a .zip file, or clone the repository using git:</p>
 <div class="sourceCode" id="cb1"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://github.com/NVlabs/nvdiffrast</span></code></pre></div>
 <h3 id="linux">Linux</h3>
 <p>We recommend running nvdiffrast on <a href="https://www.docker.com/">Docker</a>. To build a Docker image with nvdiffrast and PyTorch 1.6 installed, run:</p>
 <div class="sourceCode" id="cb2"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="ex">./run_sample.sh</span> --build-container</span></code></pre></div>
-<p>We recommend using Ubuntu, as some Linux distributions might not have all the required packages available — at least CentOS is reportedly problematic.</p>
+<p>We recommend using Ubuntu, as some Linux distributions might not have all the required packages available — at least CentOS is reportedly problematic.</p>
 <p>To try out some of the provided code examples, run:</p>
 <div class="sourceCode" id="cb3"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="ex">./run_sample.sh</span> ./samples/torch/cube.py --resolution 32</span></code></pre></div>
 <p>Alternatively, if you have all the dependencies taken care of (consult the included Dockerfile for reference), you can install nvdiffrast in your local Python site-packages by running</p>
@@ -401,7 +402,7 @@ Examples of things we've done with nvdiffrast
 <p>The rasterization operation takes as inputs a tensor of vertex positions and a tensor of vertex index triplets that specify the triangles. Vertex positions are specified in clip space, i.e., after modelview and projection transformations. Performing these transformations is left as the user's responsibility. In clip space, the view frustum is a cube in homogeneous coordinates where <span class="math inline"><em>x</em>/<em>w</em></span>, <span class="math inline"><em>y</em>/<em>w</em></span>, <span class="math inline"><em>z</em>/<em>w</em></span> are all between -1 and +1.</p>
 <p>The output of the rasterization operation is a 4-channel float32 image with tuple (<span class="math inline"><em>u</em></span>, <span class="math inline"><em>v</em></span>, <span class="math inline"><em>z</em>/<em>w</em></span>, <span class="math inline"><em>t</em><em>r</em><em>i</em><em>a</em><em>n</em><em>g</em><em>l</em><em>e</em>_<em>i</em><em>d</em></span>) in each pixel. Values <span class="math inline"><em>u</em></span> and <span class="math inline"><em>v</em></span> are the barycentric coordinates within a triangle: the first vertex in the vertex index triplet obtains <span class="math inline">(<em>u</em>, <em>v</em>) = (1, 0)</span>, the second vertex <span class="math inline">(<em>u</em>, <em>v</em>) = (0, 1)</span> and the third vertex <span class="math inline">(<em>u</em>, <em>v</em>) = (0, 0)</span>. Normalized depth value <span class="math inline"><em>z</em>/<em>w</em></span> is used later by the antialiasing operation to infer occlusion relations between triangles, and it does not propagate gradients to the vertex position input. Field <span class="math inline"><em>t</em><em>r</em><em>i</em><em>a</em><em>n</em><em>g</em><em>l</em><em>e</em>_<em>i</em><em>d</em></span> is the triangle index, offset by one. Pixels where no triangle was rasterized will receive a zero in all channels.</p>
 <p>Rasterization is point-sampled, i.e., the geometry is not smoothed, blurred, or made partially transparent in any way, in contrast to some previous differentiable rasterizers. The contents of a pixel always represent a single surface point that is on the closest surface visible along the ray through the pixel center.</p>
-<p>Point-sampled coverage does not produce vertex position gradients related to occlusion and visibility effects. This is because the motion of vertices does not change the coverage in a continuous way — a triangle is either rasterized into a pixel or not. In nvdiffrast, the occlusion/visibility related gradients are generated in the antialiasing operation that typically occurs towards the end of the rendering pipeline.</p>
+<p>Point-sampled coverage does not produce vertex position gradients related to occlusion and visibility effects. This is because the motion of vertices does not change the coverage in a continuous way — a triangle is either rasterized into a pixel or not. In nvdiffrast, the occlusion/visibility related gradients are generated in the antialiasing operation that typically occurs towards the end of the rendering pipeline.</p>
 <div class="image-parent">
 <div class="image-row">
 <div class="image-caption">
@@ -464,9 +465,9 @@ Background replaced with white
 <div class="sourceCode" id="cb7"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>img_right <span class="op">=</span> torch.where(rast_out[..., <span class="dv">3</span>:] <span class="op">&gt;</span> <span class="dv">0</span>, img_left, torch.tensor(<span class="fl">1.0</span>).cuda())</span></code></pre></div>
 <p>where <code>rast_out</code> is the output of the rasterization operation. We simply test if the <span class="math inline"><em>t</em><em>r</em><em>i</em><em>a</em><em>n</em><em>g</em><em>l</em><em>e</em>_<em>i</em><em>d</em></span> field, i.e., channel 3 of the rasterizer output, is greater than zero, indicating that a triangle was rendered in that pixel. If so, we take the color from the textured image, and otherwise we take constant 1.0.</p>
 <h3 id="antialiasing">Antialiasing</h3>
-<p>The last of the four primitive operations in nvdiffrast is antialiasing. Based on the geometry input (vertex positions and triangles), it will smooth out discontinuties at silhouette edges in a given image. The smoothing is based on a local approximation of coverage — an approximate integral over a pixel is calculated based on the exact location of relevant edges and the point-sampled colors at pixel centers.</p>
+<p>The last of the four primitive operations in nvdiffrast is antialiasing. Based on the geometry input (vertex positions and triangles), it will smooth out discontinuties at silhouette edges in a given image. The smoothing is based on a local approximation of coverage — an approximate integral over a pixel is calculated based on the exact location of relevant edges and the point-sampled colors at pixel centers.</p>
 <p>In this context, a silhouette is any edge that connects to just one triangle, or connects two triangles so that one folds behind the other. Specifically, this includes both silhouettes against the background and silhouettes against another surface, unlike some previous methods (<a href="https://github.com/nv-tlabs/DIB-R">DIB-R</a>) that only support the former kind.</p>
-<p>It is worth discussing why we might want to go through this trouble to improve the image a tiny bit. If we're attempting to, say, match a real-world photograph, a slightly smoother edge probably won't match the captured image much better than a jagged one. However, that is not the point of the antialiasing operation — the real goal is to obtain gradients w.r.t. vertex positions related to occlusion, visibility, and coverage.</p>
+<p>It is worth discussing why we might want to go through this trouble to improve the image a tiny bit. If we're attempting to, say, match a real-world photograph, a slightly smoother edge probably won't match the captured image much better than a jagged one. However, that is not the point of the antialiasing operation — the real goal is to obtain gradients w.r.t. vertex positions related to occlusion, visibility, and coverage.</p>
 <p>Remember that everything up to this point in the rendering pipeline is point-sampled. In particular, the coverage, i.e., which triangle is rasterized to which pixel, changes discontinuously in the rasterization operation.</p>
 <p>This is the reason why previous differentiable rasterizers apply nonstandard image synthesis model with blur and transparency: Something has to make coverage continuous w.r.t. vertex positions if we wish to optimize vertex positions, camera position, etc., based on an image-space loss. In nvdiffrast, we do everything point-sampled so that we know that every pixel corresponds to a single, well-defined surface point. This lets us perform arbitrary shading computations without worrying about things like accidentally blurring texture coordinates across silhouettes, or having attributes mysteriously tend towards background color when getting close to the edge of the object. Only towards the end of the pipeline, the antialiasing operation ensures that the motion of vertex positions results in continuous change on silhouettes.</p>
 <p>The antialiasing operation supports any number of channels in the image to be antialiased. Thus, if your rendering pipeline produces an abstract representation that is fed to a neural network for further processing, that is not a problem.</p>
@@ -492,7 +493,7 @@ Closeup, after AA
 </div>
 </div>
 </div>
-<p>The left image above shows the result image from the last step, after performing antialiasing. The effect is quite small — some boundary pixels become less jagged, as shown in the closeups.</p>
+<p>The left image above shows the result image from the last step, after performing antialiasing. The effect is quite small — some boundary pixels become less jagged, as shown in the closeups.</p>
 <p>Notably, not all boundary pixels are antialiased as revealed by the left-side image below. This is because the accuracy of the antialiasing operation in nvdiffrast depends on the rendered size of triangles: Because we store knowledge of just one surface point per pixel, antialiasing is possible only when the triangle that contains the actual geometric silhouette edge is visible in the image. The example image is rendered in very low resolution and the triangles are tiny compared to pixels. Thus, triangles get easily lost between the pixels.</p>
 <p>This results in incomplete-looking antialiasing, and the gradients provided by antialiasing become noisier when edge triangles are missed. Therefore it is advisable to render images in resolutions where the triangles are large enough to show up in the image at least most of the time.</p>
 <div class="image-parent">
@@ -516,7 +517,7 @@ Rendered in 4×4 higher resolution and downsampled
 <h2 id="beyond-the-basics">Beyond the basics</h2>
 <p>Rendering images is easy with nvdiffrast, but there are a few practical things that you will need to take into account. The topics in this section explain the operation and usage of nvdiffrast in more detail, and hopefully help you avoid any potential misunderstandings and pitfalls.</p>
 <h3 id="coordinate-systems">Coordinate systems</h3>
-<p>Nvdiffrast follows OpenGL's coordinate systems and other conventions. This is partially because we use OpenGL to accelerate the rasterization operation, but mostly so that there is a <a href="https://xkcd.com/927/">single standard to follow</a>.</p>
+<p>Nvdiffrast follows OpenGL's coordinate systems and other conventions. This is partially because we support OpenGL to accelerate the rasterization operation, but mostly so that there is a <a href="https://xkcd.com/927/">single standard to follow</a>.</p>
 <ul>
 <li>
 In OpenGL convention, the perspective projection matrix (as implemented in, e.g., <a href="https://github.com/NVlabs/nvdiffrast/blob/main/samples/torch/util.py#L16-L20"><code>utils.projection()</code></a> in our samples and <a href="https://www.khronos.org/registry/OpenGL-Refpages/gl2.1/xhtml/glFrustum.xml"><code>glFrustum()</code></a> in OpenGL) treats the view-space <span class="math inline"><em>z</em></span> as increasing towards the viewer. However, <em>after</em> multiplication by perspective projection matrix, the homogeneous <a href="https://en.wikipedia.org/wiki/Clip_coordinates">clip-space</a> coordinate <span class="math inline"><em>z</em></span>/<span class="math inline"><em>w</em></span> increases away from the viewer. Hence, a larger depth value in the rasterizer output tensor also corresponds to a surface further away from the viewer.
@@ -537,7 +538,7 @@ For 2D textures, the coordinate origin <span class="math inline">(<em>s</em>, 
 <p>We skirted around a pretty fundamental question in the description of the texturing operation above. In order to determine the proper amount of prefiltering for sampling a texture, we need to know how densely it is being sampled. But how can we know the sampling density when each pixel knows of a just a single surface point?</p>
 <p>The solution is to track the image-space derivatives of all things leading up to the texture sampling operation. <em>These are not the same thing as the gradients used in the backward pass</em>, even though they both involve differentiation! Consider the barycentrics <span class="math inline">(<em>u</em>, <em>v</em>)</span> produced by the rasterization operation. They change by some amount when moving horizontally or vertically in the image plane. If we denote the image-space coordinates as <span class="math inline">(<em>X</em>, <em>Y</em>)</span>, the image-space derivatives of the barycentrics would be <span class="math inline">∂<em>u</em>/∂<em>X</em></span>, <span class="math inline">∂<em>u</em>/∂<em>Y</em></span>, <span class="math inline">∂<em>v</em>/∂<em>X</em></span>, and <span class="math inline">∂<em>v</em>/∂<em>Y</em></span>. We can organize these into a 2×2 Jacobian matrix that describes the local relationship between <span class="math inline">(<em>u</em>, <em>v</em>)</span> and <span class="math inline">(<em>X</em>, <em>Y</em>)</span>. This matrix is generally different at every pixel. For the purpose of image-space derivatives, the units of <span class="math inline"><em>X</em></span> and <span class="math inline"><em>Y</em></span> are pixels. Hence, <span class="math inline">∂<em>u</em>/∂<em>X</em></span> is the local approximation of how much <span class="math inline"><em>u</em></span> changes when moving a distance of one pixel in the horizontal direction, and so on.</p>
 <p>Once we know how the barycentrics change w.r.t. pixel position, the interpolation operation can use this to determine how the attributes change w.r.t. pixel position. When attributes are used as texture coordinates, we can therefore tell how the texture sampling position (in texture space) changes when moving around within the pixel (up to a local, linear approximation, that is). This <em>texture footprint</em> tells us the scale on which the texture should be prefiltered. In more practical terms, it tells us which mipmap level(s) to use when sampling the texture.</p>
-<p>In nvdiffrast, the rasterization operation can be configured to output the image-space derivatives of the barycentrics in an auxiliary 4-channel output tensor, ordered (<span class="math inline">∂<em>u</em>/∂<em>X</em></span>, <span class="math inline">∂<em>u</em>/∂<em>Y</em></span>, <span class="math inline">∂<em>v</em>/∂<em>X</em></span>, <span class="math inline">∂<em>v</em>/∂<em>Y</em></span>) from channel 0 to 3. The interpolation operation can take this auxiliary tensor as input and compute image-space derivatives of any set of attributes being interpolated. Finally, the texture sampling operation can use the image-space derivatives of the texture coordinates to determine the amount of prefiltering.</p>
+<p>In nvdiffrast, the rasterization operation outputs the image-space derivatives of the barycentrics in an auxiliary 4-channel output tensor, ordered (<span class="math inline">∂<em>u</em>/∂<em>X</em></span>, <span class="math inline">∂<em>u</em>/∂<em>Y</em></span>, <span class="math inline">∂<em>v</em>/∂<em>X</em></span>, <span class="math inline">∂<em>v</em>/∂<em>Y</em></span>) from channel 0 to 3. The interpolation operation can take this auxiliary tensor as input and compute image-space derivatives of any set of attributes being interpolated. Finally, the texture sampling operation can use the image-space derivatives of the texture coordinates to determine the amount of prefiltering.</p>
 <p>There is nothing magic about these image-space derivatives. They are tensors like the, e.g., the texture coordinates themselves, they propagate gradients backwards, and so on. For example, if you want to artificially blur or sharpen the texture when sampling it, you can simply multiply the tensor carrying the image-space derivatives of the texture coordinates <span class="math inline">∂{<em>s</em>, <em>t</em>}/∂{<em>X</em>, <em>Y</em>}</span> by a scalar value before feeding it into the texture sampling operation. This scales the texture footprints and thus adjusts the amount of prefiltering. If your loss function prefers a different level of sharpness, this multiplier will receive a nonzero gradient. <em>Update:</em> Since version 0.2.1, the texture sampling operation also supports a separate mip level bias input that would be better suited for this particular task, but the gist is the same nonetheless.</p>
 <p>One might wonder if it would have been easier to determine the texture footprints simply from the texture coordinates in adjacent pixels, and skip all this derivative rubbish? In easy cases the answer is yes, but silhouettes, occlusions, and discontinuous texture parameterizations would make this approach rather unreliable in practice. Computing the image-space derivatives analytically keeps everything point-like, local, and well-behaved.</p>
 <p>It should be noted that computing gradients related to image-space derivatives is somewhat involved and requires additional computation. At the same time, they are often not crucial for the convergence of the training/optimization. Because of this, the primitive operations in nvdiffrast offer options to disable the calculation of these gradients. We're talking about things like <span class="math inline">∂<em>L</em><em>o</em><em>s</em><em>s</em>/∂(∂{<em>u</em>, <em>v</em>}/∂{<em>X</em>, <em>Y</em>})</span> that may look second-order-ish, but they're not.</p>
@@ -731,13 +732,21 @@ Mip level 5
 </tr>
 </table>
 </div>
-<p>Scaling the atlas to, say, 256×32 pixels would feel silly because the dimensions of the sub-images are perfectly fine, and downsampling the different sub-images together — which would happen after the 5×1 resolution — would not make sense anyway. For this reason, the texture sampling operation allows the user to specify the maximum number of mipmap levels to be constructed and used. In this case, setting <code>max_mip_level=5</code> would stop at the 5×1 mipmap and prevent the error.</p>
+<p>Scaling the atlas to, say, 256×32 pixels would feel silly because the dimensions of the sub-images are perfectly fine, and downsampling the different sub-images together — which would happen after the 5×1 resolution — would not make sense anyway. For this reason, the texture sampling operation allows the user to specify the maximum number of mipmap levels to be constructed and used. In this case, setting <code>max_mip_level=5</code> would stop at the 5×1 mipmap and prevent the error.</p>
 <p>It is a deliberate design choice that nvdiffrast doesn't just stop automatically at a mipmap size it cannot downsample, but requires the user to specify a limit when the texture dimensions are not powers of two. The goal is to avoid bugs where prefiltered texture sampling mysteriously doesn't work due to an oddly sized texture. It would be confusing if a 256×256 texture gave beautifully prefiltered texture samples, a 255×255 texture suddenly had no prefiltering at all, and a 254×254 texture did just a bit of prefiltering (one level) but not more.</p>
-<p>If you compute your own mipmaps, their sizes must follow the scheme described above. There is no need to specify mipmaps all the way to 1×1 resolution, but the stack can end at any point and it will work equivalently to an internally constructed mipmap stack with a <code>max_mip_level</code> limit. Importantly, the gradients of user-provided mipmaps are not propagated automatically to the base texture — naturally so, because nvdiffrast knows nothing about the relation between them. Instead, the tensors that specify the mip levels in a user-provided mipmap stack will receive gradients of their own.</p>
+<p>If you compute your own mipmaps, their sizes must follow the scheme described above. There is no need to specify mipmaps all the way to 1×1 resolution, but the stack can end at any point and it will work equivalently to an internally constructed mipmap stack with a <code>max_mip_level</code> limit. Importantly, the gradients of user-provided mipmaps are not propagated automatically to the base texture — naturally so, because nvdiffrast knows nothing about the relation between them. Instead, the tensors that specify the mip levels in a user-provided mipmap stack will receive gradients of their own.</p>
+<h3 id="rasterizing-with-cuda-vs-opengl-new">Rasterizing with CUDA vs OpenGL <span style="color:red">(New!)</span></h3>
+<p>Since version 0.3.0, nvdiffrast on PyTorch supports executing the rasterization operation using either CUDA or OpenGL. Earlier versions and the Tensorflow bindings support OpenGL only.</p>
+<p>When rasterization is executed on OpenGL, we use the GPU's graphics pipeline to determine which triangles land on which pixels. GPUs have amazingly efficient hardware for doing this — it is their original <i>raison d'être</i> — and thus it makes sense to exploit it. Unfortunately, some computing environments haven't been designed with this in mind, and it can be difficult to get OpenGL to work correctly and interoperate with CUDA cleanly. On Windows, compatibility is generally good because the GPU drivers required to run CUDA also include OpenGL support. Linux is more complicated, as various drivers can be installed separately and there isn't a standardized way to acquire access to the hardware graphics pipeline.</p>
+<p>Rasterizing in CUDA pretty much reverses these considerations. Compatibility is obviously not an issue on any CUDA-enabled platform. On the other hand, implementing the rasterization process correctly and efficiently on a massively data-parallel programming model is non-trivial. The CUDA rasterizer in nvdiffrast follows the approach described in research paper <em>High-Performance Software Rasterization on GPUs</em> by Laine and Karras, HPG 2011. Our code is based on the paper's publicly released CUDA kernels, with considerable modifications to support current hardware architectures and to match nvdiffrast's needs.</p>
+<p>The CUDA rasterizer does not support output resolutions greater than 2048×2048, and both dimensions must be multiples of 8. In addition, the number of triangles that can be rendered in one batch is limited to around 16 million. Subpixel precision is limited to 4 bits and depth peeling is less accurate than with OpenGL. Memory consumption depends on many factors.</p>
+<p>It is difficult to predict which rasterizer offers better performance. For complex meshes and high resolutions OpenGL will most likely outperform the CUDA rasterizer, although it has certain overheads that the CUDA rasterizer does not have. For simple meshes and low resolutions the CUDA rasterizer may be faster, but it has its own overheads, too. Measuring the performance on actual data, on the target platform, and in the context of the entire program is the only way to know for sure.</p>
+<p>To run rasterization in CUDA, create a <code>RasterizeCudaContext</code> and supply it to the <code>rasterize()</code> operation. For OpenGL, use a <code>RasterizeGLContext</code> instead. Easy!</p>
 <h3 id="running-on-multiple-gpus">Running on multiple GPUs</h3>
-<p>Nvdiffrast supports computation on multiple GPUs in both PyTorch and TensorFlow. As is the convention in PyTorch, the operations are always executed on the device on which the input tensors reside. All GPU input tensors must reside on the same device, and the output tensors will unsurprisingly end up on that same device. In addition, the rasterization operation requires that its OpenGL context was created for the correct device. In TensorFlow, the OpenGL context is automatically created on the device of the rasterization operation when it is executed for the first time.</p>
-<p>On Windows, nvdiffrast implements OpenGL device selection in a way that can be done only once per process — after one context is created, all future ones will end up on the same GPU. Hence you cannot expect to run the rasterization operation on multiple GPUs within the same process. Trying to do so will either cause a crash or incur a significant performance penalty. However, with PyTorch it is common to distribute computation across GPUs by launching a separate process for each GPU, so this is not a huge concern. Note that any OpenGL context created within the same process, even for something like a GUI window, will prevent changing the device later. Therefore, if you want to run the rasterization operation on other than the default GPU, be sure to create its OpenGL context before initializing any other OpenGL-powered libraries.</p>
-<p>On Linux everything just works, and you can create rasterizer OpenGL contexts on multiple devices within the same process.</p>
+<p>Nvdiffrast supports computation on multiple GPUs in both PyTorch and TensorFlow. As is the convention in PyTorch, the operations are always executed on the device on which the input tensors reside. All GPU input tensors must reside on the same device, and the output tensors will unsurprisingly end up on that same device. In addition, the rasterization operation requires that its context was created for the correct device. In TensorFlow, the rasterizer context is automatically created on the device of the rasterization operation when it is executed for the first time.</p>
+<p><i>The remainder of this section applies only to OpenGL rasterizer contexts. CUDA rasterizer contexts require no special considerations besides making sure they're on the correct device.</i></p>
+<p>On Windows, nvdiffrast implements OpenGL device selection in a way that can be done only once per process — after one context is created, all future ones will end up on the same GPU. Hence you cannot expect to run the rasterization operation on multiple GPUs within the same process using an OpenGL context. Trying to do so will either cause a crash or incur a significant performance penalty. However, with PyTorch it is common to distribute computation across GPUs by launching a separate process for each GPU, so this is not a huge concern. Note that any OpenGL context created within the same process, even for something like a GUI window, will prevent changing the device later. Therefore, if you want to run the rasterization operation on other than the default GPU, be sure to create its OpenGL context before initializing any other OpenGL-powered libraries.</p>
+<p>On Linux everything just works, and you can create OpenGL rasterizer contexts on multiple devices within the same process.</p>
 <h4 id="note-on-torch.nn.dataparallel">Note on torch.nn.DataParallel</h4>
 <p>PyTorch offers <code>torch.nn.DataParallel</code> wrapper class for splitting the execution of a minibatch across multiple threads. Unfortunately, this class is fundamentally incompatible with OpenGL-dependent operations, as it spawns a new set of threads at each call (as of PyTorch 1.9.0, at least). Using previously created OpenGL contexts in these new threads, even if taking care to not use the same context in multiple threads, causes them to be migrated around and this has resulted in ever-growing GPU memory usage and abysmal GPU utilization. Therefore, we advise against using <code>torch.nn.DataParallel</code> for rasterization operations that depend on the OpenGL contexts.</p>
 <p>Notably, <code>torch.nn.DistributedDataParallel</code> spawns subprocesses that are much more persistent. The subprocesses must create their own OpenGL contexts as part of initialization, and as such they do not suffer from this problem.</p>
@@ -773,8 +782,8 @@ Third depth layer
 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>    rast, rast_db <span class="op">=</span> peeler.rasterize_next_layer()</span>
 <span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>    (process <span class="kw">or</span> store the results)</span></code></pre></div>
 <p>There is no performance penalty compared to the basic rasterization op if you end up extracting only the first depth layer. In other words, the code above with <code>num_layers=1</code> runs exactly as fast as calling <code>rasterize</code> once.</p>
-<p>Depth peeling is only supported in the PyTorch version of nvdiffrast. For implementation reasons, depth peeling reserves the OpenGL context so that other rasterization operations cannot be performed while the peeling is ongoing, i.e., inside the <code>with</code> block. Hence you cannot start a nested depth peeling operation or call <code>rasterize</code> inside the <code>with</code> block, unless you use a different OpenGL context.</p>
-<p>For the sake of completeness, let us note the following small caveat: Depth peeling relies on depth values to distinguish surface points from each other. Therefore, culling "previously rendered surface points" actually means culling all surface points at the same or closer depth as those rendered into the pixel in previous passes. This matters only if you have multiple layers of geometry at matching depths — if your geometry consists of, say, nothing but two exactly overlapping triangles, you will see one of them in the first pass but never see the other one in subsequent passes, as it's at the exact depth that is already considered done.</p>
+<p>Depth peeling is only supported in the PyTorch version of nvdiffrast. For implementation reasons, depth peeling reserves the rasterizer context so that other rasterization operations cannot be performed while the peeling is ongoing, i.e., inside the <code>with</code> block. Hence you cannot start a nested depth peeling operation or call <code>rasterize</code> inside the <code>with</code> block unless you use a different context.</p>
+<p>For the sake of completeness, let us note the following small caveat: Depth peeling relies on depth values to distinguish surface points from each other. Therefore, culling "previously rendered surface points" actually means culling all surface points at the same or closer depth as those rendered into the pixel in previous passes. This matters only if you have multiple layers of geometry at matching depths — if your geometry consists of, say, nothing but two exactly overlapping triangles, you will see one of them in the first pass but never see the other one in subsequent passes, as it's at the exact depth that is already considered done.</p>
 <h3 id="differences-between-pytorch-and-tensorflow">Differences between PyTorch and TensorFlow</h3>
 <p>Nvdiffrast can be used from PyTorch and from TensorFlow 1.x; the latter may change to TensorFlow 2.x if there is demand. These frameworks operate somewhat differently and that is reflected in the respective APIs. Simplifying a bit, in TensorFlow 1.x you construct a persistent graph out of persistent nodes, and run many batches of data through it. In PyTorch, there is no persistent graph or nodes, but a new, ephemeral graph is constructed for each batch of data and destroyed immediately afterwards. Therefore, there is also no persistent state for the operations. There is the <code>torch.nn.Module</code> abstraction for festooning operations with persistent state, but we do not use it.</p>
 <p>As a consequence, things that would be part of persistent state of an nvdiffrast operation in TensorFlow must be stored by the user in PyTorch, and supplied to the operations as needed. In practice, this is a very small difference and amounts to just a couple of lines of code in most cases.</p>
@@ -788,11 +797,14 @@ Third depth layer
 <p>In manual mode, the user assumes the responsibility of setting and releasing the OpenGL context. Most of the time, if you don't have any other libraries that would be using OpenGL, you can just set the context once after having created it and keep it set until the program exits. However, keep in mind that the active OpenGL context is a thread-local resource, so it needs to be set in the same CPU thread as it will be used, and it cannot be set simultaneously in multiple CPU threads.</p>
 <h2 id="samples">Samples</h2>
 <p>Nvdiffrast comes with a set of samples that were crafted to support the research paper. Each sample is available in both PyTorch and TensorFlow versions. Details such as command-line parameters, logging format, etc., may not be identical between the versions, and generally the PyTorch versions should be considered definitive. The command-line examples below are for the PyTorch versions.</p>
-<p>Enabling interactive display using the <code>--display-interval</code> parameter works on Windows but is likely to fail on Linux. Our Dockerfile is set up to support headless rendering only, and thus cannot show an interactive result window.</p>
+<p>All PyTorch samples support selecting between CUDA and OpenGL rasterizer contexts. The default is to do rasterization in CUDA, and switching to OpenGL is done by specifying command-line option <code>--opengl</code>.</p>
+<p>Enabling interactive display using the <code>--display-interval</code> parameter is likely to fail on Linux when using OpenGL rasterization. This is because the interactive display window is shown using OpenGL, and on Linux this conflicts with the internal OpenGL rasterization in nvdiffrast. Using a CUDA context should work, assuming that OpenGL is correctly installed in the system (for displaying the window). Our Dockerfile is set up to support headless rendering only, and thus cannot show an interactive result window.</p>
 <h3 id="triangle.py"><a href="https://github.com/NVlabs/nvdiffrast/blob/main/samples/torch/triangle.py">triangle.py</a></h3>
-<p>This is a minimal sample that renders a triangle and saves the resulting image into a file (<code>tri.png</code>) in the current directory. Running this should be the first step to verify that you have everything set up correctly. Rendering is done using the rasterization and interpolation operations, so getting the correct output image means that both OpenGL and CUDA are working as intended under the hood.</p>
-<p>Example command line:</p>
-<pre><code>python triangle.py</code></pre>
+<p>This is a minimal sample that renders a triangle and saves the resulting image into a file (<code>tri.png</code>) in the current directory. Running this should be the first step to verify that you have everything set up correctly. Rendering is done using the rasterization and interpolation operations, so getting the correct output image means that both OpenGL (if specified on command line) and CUDA are working as intended under the hood.</p>
+<p>This is the only sample where you must specify either <code>--cuda</code> or <code>--opengl</code> on command line. Other samples default to CUDA rasterization and provide only the <code>--opengl</code> option.</p>
+<p>Example command lines:</p>
+<pre><code>python triangle.py --cuda
+python triangle.py --opengl</code></pre>
 <div class="image-parent">
 <div class="image-row">
 <div class="image-caption">
@@ -901,6 +913,12 @@ Interactive view of pose.py
 <p>The interactive view shows, from left to right: target pose, best found pose, and current pose. When viewed live, the two stages of optimization are clearly visible. In the first phase, the best pose updates intermittently when a better initialization is found. In the second phase, the solution converges smoothly to the target via gradient-based optimization.</p>
 <h2 id="pytorch-api-reference">PyTorch API reference</h2>
 <div style="padding-top: 1em;">
+<div class="apifunc"><h4><code>nvdiffrast.torch.RasterizeCudaContext(<em>device</em>=<span class="defarg">None</span>)</code>&nbsp;<span class="sym_class">Class</span></h4>
+<p class="shortdesc">Create a new Cuda rasterizer context.</p><p class="longdesc">The context is deleted and internal storage is released when the object is
+destroyed.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">device</td><td class="arg_short">Cuda device on which the context is created. Type can be
+<code>torch.device</code>, string (e.g., <code>'cuda:1'</code>), or int. If not
+specified, context will be created on currently active Cuda
+device.</td></tr></table><div class="returns">Returns:<div class="return_description">The newly created Cuda rasterizer context.</div></div></div>
 <div class="apifunc"><h4><code>nvdiffrast.torch.RasterizeGLContext(<em>output_db</em>=<span class="defarg">True</span>, <em>mode</em>=<span class="defarg">'automatic'</span>, <em>device</em>=<span class="defarg">None</span>)</code>&nbsp;<span class="sym_class">Class</span></h4>
 <p class="shortdesc">Create a new OpenGL rasterizer context.</p><p class="longdesc">Creating an OpenGL context is a slow operation so you should usually reuse the same
 context in all calls to <code>rasterize()</code> on the same CPU thread. The OpenGL context
@@ -918,13 +936,13 @@ device.</td></tr></table><div class="methods">Methods, only available if context
 <div class="apifunc"><h4><code>nvdiffrast.torch.rasterize(<em>glctx</em>, <em>pos</em>, <em>tri</em>, <em>resolution</em>, <em>ranges</em>=<span class="defarg">None</span>, <em>grad_db</em>=<span class="defarg">True</span>)</code>&nbsp;<span class="sym_function">Function</span></h4>
 <p class="shortdesc">Rasterize triangles.</p><p class="longdesc">All input tensors must be contiguous and reside in GPU memory except for
 the <code>ranges</code> tensor that, if specified, has to reside in CPU memory. The
-output tensors will be contiguous and reside in GPU memory.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">glctx</td><td class="arg_short">OpenGL context of type <code>RasterizeGLContext</code>.</td></tr><tr class="arg"><td class="argname">pos</td><td class="arg_short">Vertex position tensor with dtype <code>torch.float32</code>. To enable range
+output tensors will be contiguous and reside in GPU memory.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">glctx</td><td class="arg_short">Rasterizer context of type <code>RasterizeGLContext</code> or <code>RasterizeCudaContext</code>.</td></tr><tr class="arg"><td class="argname">pos</td><td class="arg_short">Vertex position tensor with dtype <code>torch.float32</code>. To enable range
 mode, this tensor should have a 2D shape [num_vertices, 4]. To enable
 instanced mode, use a 3D shape [minibatch_size, num_vertices, 4].</td></tr><tr class="arg"><td class="argname">tri</td><td class="arg_short">Triangle tensor with shape [num_triangles, 3] and dtype <code>torch.int32</code>.</td></tr><tr class="arg"><td class="argname">resolution</td><td class="arg_short">Output resolution as integer tuple (height, width).</td></tr><tr class="arg"><td class="argname">ranges</td><td class="arg_short">In range mode, tensor with shape [minibatch_size, 2] and dtype
 <code>torch.int32</code>, specifying start indices and counts into <code>tri</code>.
 Ignored in instanced mode.</td></tr><tr class="arg"><td class="argname">grad_db</td><td class="arg_short">Propagate gradients of image-space derivatives of barycentrics
-into <code>pos</code> in backward pass. Ignored if OpenGL context was
-not configured to output image-space derivatives.</td></tr></table><div class="returns">Returns:<div class="return_description">A tuple of two tensors. The first output tensor has shape [minibatch_size,
+into <code>pos</code> in backward pass. Ignored if using an OpenGL context that
+was not configured to output image-space derivatives.</td></tr></table><div class="returns">Returns:<div class="return_description">A tuple of two tensors. The first output tensor has shape [minibatch_size,
 height, width, 4] and contains the main rasterizer output in order (u, v, z/w,
 triangle_id). If the OpenGL context was configured to output image-space
 derivatives of barycentrics, the second output tensor will also have shape
@@ -991,7 +1009,13 @@ constant. This avoids reconstructing it every time <code>texture()</code> is cal
 in the <code>mip</code> argument.</div></div></div>
 <div class="apifunc"><h4><code>nvdiffrast.torch.antialias(<em>color</em>, <em>rast</em>, <em>pos</em>, <em>tri</em>, <em>topology_hash</em>=<span class="defarg">None</span>, <em>pos_gradient_boost</em>=<span class="defarg">1.0</span>)</code>&nbsp;<span class="sym_function">Function</span></h4>
 <p class="shortdesc">Perform antialiasing.</p><p class="longdesc">All input tensors must be contiguous and reside in GPU memory. The output tensor
-will be contiguous and reside in GPU memory.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">color</td><td class="arg_short">Input image to antialias with shape [minibatch_size, height, width, num_channels].</td></tr><tr class="arg"><td class="argname">rast</td><td class="arg_short">Main output tensor from <code>rasterize()</code>.</td></tr><tr class="arg"><td class="argname">pos</td><td class="arg_short">Vertex position tensor used in the rasterization operation.</td></tr><tr class="arg"><td class="argname">tri</td><td class="arg_short">Triangle tensor used in the rasterization operation.</td></tr><tr class="arg"><td class="argname">topology_hash</td><td class="arg_short">(Optional) Preconstructed topology hash for the triangle tensor. If not
+will be contiguous and reside in GPU memory.</p><p class="longdesc">Note that silhouette edge determination is based on vertex indices in the triangle
+tensor. For it to work properly, a vertex belonging to multiple triangles must be
+referred to using the same vertex index in each triangle. Otherwise, nvdiffrast will always
+classify the adjacent edges as silhouette edges, which leads to bad performance and
+potentially incorrect gradients. If you are unsure whether your data is good, check
+which pixels are modified by the antialias operation and compare to the example in the
+documentation.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">color</td><td class="arg_short">Input image to antialias with shape [minibatch_size, height, width, num_channels].</td></tr><tr class="arg"><td class="argname">rast</td><td class="arg_short">Main output tensor from <code>rasterize()</code>.</td></tr><tr class="arg"><td class="argname">pos</td><td class="arg_short">Vertex position tensor used in the rasterization operation.</td></tr><tr class="arg"><td class="argname">tri</td><td class="arg_short">Triangle tensor used in the rasterization operation.</td></tr><tr class="arg"><td class="argname">topology_hash</td><td class="arg_short">(Optional) Preconstructed topology hash for the triangle tensor. If not
 specified, the topology hash is constructed internally and discarded afterwards.</td></tr><tr class="arg"><td class="argname">pos_gradient_boost</td><td class="arg_short">(Optional) Multiplier for gradients propagated to <code>pos</code>.</td></tr></table><div class="returns">Returns:<div class="return_description">A tensor containing the antialiased image with the same shape as <code>color</code> input tensor.</div></div></div>
 <div class="apifunc"><h4><code>nvdiffrast.torch.antialias_construct_topology_hash(<em>tri</em>)</code>&nbsp;<span class="sym_function">Function</span></h4>
 <p class="shortdesc">Construct a topology hash for a triangle tensor.</p><p class="longdesc">This function can be used for constructing a topology hash for a triangle tensor that is 
@@ -1012,7 +1036,7 @@ severity will be silent.</td></tr></table></div>

 </div>
 <h2 id="licenses">Licenses</h2>
-<p>Copyright © 2020, NVIDIA Corporation. All rights reserved.</p>
+<p>Copyright © 2020–2022, NVIDIA Corporation. All rights reserved.</p>
 <p>This work is made available under the <a href="https://github.com/NVlabs/nvdiffrast/blob/main/LICENSE.txt">Nvidia Source Code License</a>.</p>
 <p>For business inquiries, please visit our website and submit the form: <a href="https://www.nvidia.com/en-us/research/inquiries/">NVIDIA Research Licensing</a></p>
 <p>We do not currently accept outside contributions in the form of pull requests.</p>

--- a/nvdiffrast/__init__.py
+++ b/nvdiffrast/__init__.py
@@ -6,4 +6,4 @@
 # distribution of this software and related documentation without an express
 # license agreement from NVIDIA CORPORATION is strictly prohibited.

-__version__ = '0.2.8'
+__version__ = '0.3.0'
--- a/nvdiffrast/common/cudaraster/CudaRaster.hpp
+++ b/nvdiffrast/common/cudaraster/CudaRaster.hpp
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+#pragma once
+
+//------------------------------------------------------------------------
+// This is a slimmed-down and modernized version of the original
+// CudaRaster codebase that accompanied the HPG 2011 paper
+// "High-Performance Software Rasterization on GPUs" by Laine and Karras.
+// Modifications have been made to accommodate post-Volta execution model
+// with warp divergence. Support for shading, blending, quad rendering,
+// and supersampling have been removed as unnecessary for nvdiffrast.
+//------------------------------------------------------------------------
+
+namespace CR
+{
+
+class RasterImpl;
+
+//------------------------------------------------------------------------
+// Interface class to isolate user from implementation details.
+//------------------------------------------------------------------------
+
+class CudaRaster
+{
+public:
+    enum
+    {
+        RenderModeFlag_EnableBackfaceCulling = 1 << 0,   // Enable backface culling.
+        RenderModeFlag_EnableDepthPeeling    = 1 << 1,   // Enable depth peeling. Must have a peel buffer set.
+    };
+
+public:
+					        CudaRaster				(void);
+					        ~CudaRaster				(void);
+
+    void                    setViewportSize         (int width, int height, int numImages);     // Width and height must be multiples of tile size (8x8).
+    void                    setRenderModeFlags      (unsigned int renderModeFlags);             // Affects all subsequent calls to drawTriangles(). Defaults to zero.
+    void                    deferredClear           (unsigned int clearColor);                  // Clears color and depth buffers during next call to drawTriangles().
+    void                    setVertexBuffer         (void* vertices, int numVertices);          // GPU pointer managed by caller. Vertex positions in clip space as float4 (x, y, z, w).
+    void                    setIndexBuffer          (void* indices, int numTriangles);          // GPU pointer managed by caller. Triangle index+color quadruplets as uint4 (idx0, idx1, idx2, color).
+    bool                    drawTriangles           (const int* ranges, cudaStream_t stream);   // Ranges (offsets and counts) as #triangles entries, not as bytes. If NULL, draw all triangles. Returns false in case of internal overflow.
+    void*                   getColorBuffer          (void);                                     // GPU pointer managed by CudaRaster.
+    void*                   getDepthBuffer          (void);                                     // GPU pointer managed by CudaRaster.
+    void                    swapDepthAndPeel        (void);                                     // Swap depth and peeling buffers.
+
+private:
+					        CudaRaster           	(const CudaRaster&); // forbidden
+	CudaRaster&             operator=           	(const CudaRaster&); // forbidden
+
+private:
+    RasterImpl*             m_impl;                 // Opaque pointer to implementation.
+};
+
+//------------------------------------------------------------------------
+} // namespace CR
+
--- a/nvdiffrast/common/cudaraster/impl/BinRaster.inl
+++ b/nvdiffrast/common/cudaraster/impl/BinRaster.inl
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ void binRasterImpl(const CRParams p)
+{
+    __shared__ volatile U32 s_broadcast [CR_BIN_WARPS + 16];
+    __shared__ volatile S32 s_outOfs    [CR_MAXBINS_SQR];
+    __shared__ volatile S32 s_outTotal  [CR_MAXBINS_SQR];
+    __shared__ volatile S32 s_overIndex [CR_MAXBINS_SQR];
+    __shared__ volatile S32 s_outMask   [CR_BIN_WARPS][CR_MAXBINS_SQR + 1]; // +1 to avoid bank collisions
+    __shared__ volatile S32 s_outCount  [CR_BIN_WARPS][CR_MAXBINS_SQR + 1]; // +1 to avoid bank collisions
+    __shared__ volatile S32 s_triBuf    [CR_BIN_WARPS*32*4];                // triangle ring buffer
+    __shared__ volatile U32 s_batchPos;
+    __shared__ volatile U32 s_bufCount;
+    __shared__ volatile U32 s_overTotal;
+    __shared__ volatile U32 s_allocBase;
+
+    const CRImageParams&    ip              = getImageParams(p, blockIdx.z);
+    CRAtomics&              atomics         = p.atomics[blockIdx.z];
+    const U8*               triSubtris      = (const U8*)p.triSubtris + p.maxSubtris * blockIdx.z;
+    const CRTriangleHeader* triHeader       = (const CRTriangleHeader*)p.triHeader + p.maxSubtris * blockIdx.z;
+
+    S32*                    binFirstSeg     = (S32*)p.binFirstSeg + CR_MAXBINS_SQR * CR_BIN_STREAMS_SIZE * blockIdx.z;
+    S32*                    binTotal        = (S32*)p.binTotal    + CR_MAXBINS_SQR * CR_BIN_STREAMS_SIZE * blockIdx.z;
+    S32*                    binSegData      = (S32*)p.binSegData  + p.maxBinSegs * CR_BIN_SEG_SIZE * blockIdx.z;
+    S32*                    binSegNext      = (S32*)p.binSegNext  + p.maxBinSegs * blockIdx.z;
+    S32*                    binSegCount     = (S32*)p.binSegCount + p.maxBinSegs * blockIdx.z;
+
+    if (atomics.numSubtris > p.maxSubtris)
+        return;
+
+    // per-thread state
+    int thrInBlock = threadIdx.x + threadIdx.y * 32;
+    int batchPos = 0;
+
+    // first 16 elements of s_broadcast are always zero
+    if (thrInBlock < 16)
+        s_broadcast[thrInBlock] = 0;
+
+    // initialize output linked lists and offsets
+    if (thrInBlock < p.numBins)
+    {
+        binFirstSeg[(thrInBlock << CR_BIN_STREAMS_LOG2) + blockIdx.x] = -1;
+        s_outOfs[thrInBlock] = -CR_BIN_SEG_SIZE;
+        s_outTotal[thrInBlock] = 0;
+    }
+
+    // repeat until done
+    for(;;)
+    {
+        // get batch
+        if (thrInBlock == 0)
+            s_batchPos = atomicAdd(&atomics.binCounter, ip.binBatchSize);
+        __syncthreads();
+        batchPos = s_batchPos;
+
+        // all batches done?
+        if (batchPos >= ip.triCount)
+            break;
+
+        // per-thread state
+        int bufIndex = 0;
+        int bufCount = 0;
+        int batchEnd = min(batchPos + ip.binBatchSize, ip.triCount);
+
+        // loop over batch as long as we have triangles in it
+        do
+        {
+            // read more triangles
+            while (bufCount < CR_BIN_WARPS*32 && batchPos < batchEnd)
+            {
+                // get subtriangle count
+
+                int triIdx = batchPos + thrInBlock;
+                int num = 0;
+                if (triIdx < batchEnd)
+                    num = triSubtris[triIdx];
+
+                // cumulative sum of subtriangles within each warp
+                U32 myIdx = __popc(__ballot_sync(~0u, num & 1) & getLaneMaskLt());
+                if (__any_sync(~0u, num > 1))
+                {
+                    myIdx += __popc(__ballot_sync(~0u, num & 2) & getLaneMaskLt()) * 2;
+                    myIdx += __popc(__ballot_sync(~0u, num & 4) & getLaneMaskLt()) * 4;
+                }
+                if (threadIdx.x == 31) // Do not assume that last thread in warp wins the write.
+                    s_broadcast[threadIdx.y + 16] = myIdx + num;
+                __syncthreads();
+
+                // cumulative sum of per-warp subtriangle counts
+                // Note: cannot have more than 32 warps or this needs to sync between each step.
+                bool act = (thrInBlock < CR_BIN_WARPS);
+                U32 actMask = __ballot_sync(~0u, act);
+                if (threadIdx.y == 0 && act)
+                {
+                    volatile U32* ptr = &s_broadcast[thrInBlock + 16];
+                    U32 val = *ptr;
+                    #if (CR_BIN_WARPS > 1)
+                        val += ptr[-1]; __syncwarp(actMask);
+                        *ptr = val;     __syncwarp(actMask);
+                    #endif
+                    #if (CR_BIN_WARPS > 2)
+                        val += ptr[-2]; __syncwarp(actMask);
+                        *ptr = val;     __syncwarp(actMask);
+                    #endif
+                    #if (CR_BIN_WARPS > 4)
+                        val += ptr[-4]; __syncwarp(actMask);
+                        *ptr = val;     __syncwarp(actMask);
+                    #endif
+                    #if (CR_BIN_WARPS > 8)
+                        val += ptr[-8]; __syncwarp(actMask);
+                        *ptr = val;     __syncwarp(actMask);
+                    #endif
+                    #if (CR_BIN_WARPS > 16)
+                        val += ptr[-16]; __syncwarp(actMask);
+                        *ptr = val;      __syncwarp(actMask);
+                    #endif
+
+                    // initially assume that we consume everything
+                    // only last active thread does the writes
+                    if (threadIdx.x == CR_BIN_WARPS - 1)
+                    {
+                        s_batchPos = batchPos + CR_BIN_WARPS * 32;
+                        s_bufCount = bufCount + val;
+                    }
+                }
+                __syncthreads();
+
+                // skip if no subtriangles
+                if (num)
+                {
+                    // calculate write position for first subtriangle
+                    U32 pos = bufCount + myIdx + s_broadcast[threadIdx.y + 16 - 1];
+
+                    // only write if entire triangle fits
+                    if (pos + num <= CR_ARRAY_SIZE(s_triBuf))
+                    {
+                        pos += bufIndex; // adjust for current start position
+                        pos &= CR_ARRAY_SIZE(s_triBuf)-1;
+                        if (num == 1)
+                            s_triBuf[pos] = triIdx * 8 + 7; // single triangle
+                        else
+                        {
+                            for (int i=0; i < num; i++)
+                            {
+                                s_triBuf[pos] = triIdx * 8 + i;
+                                pos++;
+                                pos &= CR_ARRAY_SIZE(s_triBuf)-1;
+                            }
+                        }
+                    } else if (pos <= CR_ARRAY_SIZE(s_triBuf))
+                    {
+                        // this triangle is the first that failed, overwrite total count and triangle count
+                        s_batchPos = batchPos + thrInBlock;
+                        s_bufCount = pos;
+                    }
+                }
+
+                // update triangle counts
+                __syncthreads();
+                batchPos = s_batchPos;
+                bufCount = s_bufCount;
+            }
+
+            // make every warp clear its output buffers
+            for (int i=threadIdx.x; i < p.numBins; i += 32)
+                s_outMask[threadIdx.y][i] = 0;
+            __syncwarp();
+
+            // choose our triangle
+            uint4 triData = make_uint4(0, 0, 0, 0);
+            if (thrInBlock < bufCount)
+            {
+                U32 triPos = bufIndex + thrInBlock;
+                triPos &= CR_ARRAY_SIZE(s_triBuf)-1;
+
+                // find triangle
+                int triIdx = s_triBuf[triPos];
+                int dataIdx = triIdx >> 3;
+                int subtriIdx = triIdx & 7;
+                if (subtriIdx != 7)
+                    dataIdx = triHeader[dataIdx].misc + subtriIdx;
+
+                // read triangle
+
+                triData = *(((const uint4*)triHeader) + dataIdx);
+            }
+
+            // setup bounding box and edge functions, and rasterize
+            S32 lox, loy, hix, hiy;
+            bool hasTri = (thrInBlock < bufCount);
+            U32 hasTriMask = __ballot_sync(~0u, hasTri);
+            if (hasTri)
+            {
+                S32 v0x = add_s16lo_s16lo(triData.x, p.widthPixels  * (CR_SUBPIXEL_SIZE >> 1));
+                S32 v0y = add_s16hi_s16lo(triData.x, p.heightPixels * (CR_SUBPIXEL_SIZE >> 1));
+                S32 d01x = sub_s16lo_s16lo(triData.y, triData.x);
+                S32 d01y = sub_s16hi_s16hi(triData.y, triData.x);
+                S32 d02x = sub_s16lo_s16lo(triData.z, triData.x);
+                S32 d02y = sub_s16hi_s16hi(triData.z, triData.x);
+                int binLog = CR_BIN_LOG2 + CR_TILE_LOG2 + CR_SUBPIXEL_LOG2;
+                lox = add_clamp_0_x((v0x + min_min(d01x, 0, d02x)) >> binLog, 0, p.widthBins  - 1);
+                loy = add_clamp_0_x((v0y + min_min(d01y, 0, d02y)) >> binLog, 0, p.heightBins - 1);
+                hix = add_clamp_0_x((v0x + max_max(d01x, 0, d02x)) >> binLog, 0, p.widthBins  - 1);
+                hiy = add_clamp_0_x((v0y + max_max(d01y, 0, d02y)) >> binLog, 0, p.heightBins - 1);
+
+                U32 bit = 1 << threadIdx.x;
+                bool multi = (hix != lox || hiy != loy);
+                if (!__any_sync(hasTriMask, multi))
+                {
+                    int binIdx = lox + p.widthBins * loy;
+                    U32 mask = __match_any_sync(hasTriMask, binIdx);
+                    s_outMask[threadIdx.y][binIdx] = mask;
+                    __syncwarp(hasTriMask);
+                } else
+                {
+                    bool complex = (hix > lox+1 || hiy > loy+1);
+                    if (!__any_sync(hasTriMask, complex))
+                    {
+                        int binIdx = lox + p.widthBins * loy;
+                        atomicOr((U32*)&s_outMask[threadIdx.y][binIdx], bit);
+                        if (hix > lox) atomicOr((U32*)&s_outMask[threadIdx.y][binIdx + 1], bit);
+                        if (hiy > loy) atomicOr((U32*)&s_outMask[threadIdx.y][binIdx + p.widthBins], bit);
+                        if (hix > lox && hiy > loy) atomicOr((U32*)&s_outMask[threadIdx.y][binIdx + p.widthBins + 1], bit);
+                    } else
+                    {
+                        S32 d12x = d02x - d01x, d12y = d02y - d01y;
+                        v0x -= lox << binLog, v0y -= loy << binLog;
+
+                        S32 t01 = v0x * d01y - v0y * d01x;
+                        S32 t02 = v0y * d02x - v0x * d02y;
+                        S32 t12 = d01x * d12y - d01y * d12x - t01 - t02;
+                        S32 b01 = add_sub(t01 >> binLog, max(d01x, 0), min(d01y, 0));
+                        S32 b02 = add_sub(t02 >> binLog, max(d02y, 0), min(d02x, 0));
+                        S32 b12 = add_sub(t12 >> binLog, max(d12x, 0), min(d12y, 0));
+
+                        int width = hix - lox + 1;
+                        d01x += width * d01y;
+                        d02x += width * d02y;
+                        d12x += width * d12y;
+
+                        U8* currPtr = (U8*)&s_outMask[threadIdx.y][lox + loy * p.widthBins];
+                        U8* skipPtr = (U8*)&s_outMask[threadIdx.y][(hix + 1) + loy * p.widthBins];
+                        U8* endPtr  = (U8*)&s_outMask[threadIdx.y][lox + (hiy + 1) * p.widthBins];
+                        int stride  = p.widthBins * 4;
+                        int ptrYInc = stride - width * 4;
+
+                        do
+                        {
+                            if (b01 >= 0 && b02 >= 0 && b12 >= 0)
+                                atomicOr((U32*)currPtr, bit);
+                            currPtr += 4, b01 -= d01y, b02 += d02y, b12 -= d12y;
+                            if (currPtr == skipPtr)
+                                currPtr += ptrYInc, b01 += d01x, b02 -= d02x, b12 += d12x, skipPtr += stride;
+                        }
+                        while (currPtr != endPtr);
+                    }
+                }
+            }
+
+            // count per-bin contributions
+            if (thrInBlock == 0)
+                s_overTotal = 0; // overflow counter
+
+            // ensure that out masks are done
+            __syncthreads();
+
+            int overIndex = -1;
+            bool act = (thrInBlock < p.numBins);
+            U32 actMask = __ballot_sync(~0u, act);
+            if (act)
+            {
+                U8* srcPtr = (U8*)&s_outMask[0][thrInBlock];
+                U8* dstPtr = (U8*)&s_outCount[0][thrInBlock];
+                int total = 0;
+                for (int i = 0; i < CR_BIN_WARPS; i++)
+                {
+                    total += __popc(*(U32*)srcPtr);
+                    *(U32*)dstPtr = total;
+                    srcPtr += (CR_MAXBINS_SQR + 1) * 4;
+                    dstPtr += (CR_MAXBINS_SQR + 1) * 4;
+                }
+
+                // overflow => request a new segment
+                int ofs = s_outOfs[thrInBlock];
+                bool ovr = (((ofs - 1) >> CR_BIN_SEG_LOG2) != (((ofs - 1) + total) >> CR_BIN_SEG_LOG2));
+                U32 ovrMask = __ballot_sync(actMask, ovr);
+                if (ovr)
+                {
+                    overIndex = __popc(ovrMask & getLaneMaskLt());
+                    if (overIndex == 0)
+                        s_broadcast[threadIdx.y + 16] = atomicAdd((U32*)&s_overTotal, __popc(ovrMask));
+                    __syncwarp(ovrMask);
+                    overIndex += s_broadcast[threadIdx.y + 16];
+                    s_overIndex[thrInBlock] = overIndex;
+                }
+            }
+
+            // sync after overTotal is ready
+            __syncthreads();
+
+            // at least one segment overflowed => allocate segments
+            U32 overTotal = s_overTotal;
+            U32 allocBase = 0;
+            if (overTotal > 0)
+            {
+                // allocate memory
+                if (thrInBlock == 0)
+                {
+                    U32 allocBase = atomicAdd(&atomics.numBinSegs, overTotal);
+                    s_allocBase = (allocBase + overTotal <= p.maxBinSegs) ? allocBase : 0;
+                }
+                __syncthreads();
+                allocBase = s_allocBase;
+
+                // did my bin overflow?
+                if (overIndex != -1)
+                {
+                    // calculate new segment index
+                    int segIdx = allocBase + overIndex;
+
+                    // add to linked list
+                    if (s_outOfs[thrInBlock] < 0)
+                        binFirstSeg[(thrInBlock << CR_BIN_STREAMS_LOG2) + blockIdx.x] = segIdx;
+                    else
+                        binSegNext[(s_outOfs[thrInBlock] - 1) >> CR_BIN_SEG_LOG2] = segIdx;
+
+                    // defaults
+                    binSegNext [segIdx] = -1;
+                    binSegCount[segIdx] = CR_BIN_SEG_SIZE;
+                }
+            }
+
+            // concurrent emission -- each warp handles its own triangle
+            if (thrInBlock < bufCount)
+            {
+                int triPos  = (bufIndex + thrInBlock) & (CR_ARRAY_SIZE(s_triBuf) - 1);
+                int currBin = lox + loy * p.widthBins;
+                int skipBin = (hix + 1) + loy * p.widthBins;
+                int endBin  = lox + (hiy + 1) * p.widthBins;
+                int binYInc = p.widthBins - (hix - lox + 1);
+
+                // loop over triangle's bins
+                do
+                {
+                    U32 outMask = s_outMask[threadIdx.y][currBin];
+                    if (outMask & (1<<threadIdx.x))
+                    {
+                        int idx = __popc(outMask & getLaneMaskLt());
+                        if (threadIdx.y > 0)
+                            idx += s_outCount[threadIdx.y-1][currBin];
+
+                        int base = s_outOfs[currBin];
+                        int free = (-base) & (CR_BIN_SEG_SIZE - 1);
+                        if (idx >= free)
+                            idx += ((allocBase + s_overIndex[currBin]) << CR_BIN_SEG_LOG2) - free;
+                        else
+                            idx += base;
+
+                        binSegData[idx] = s_triBuf[triPos];
+                    }
+
+                    currBin++;
+                    if (currBin == skipBin)
+                        currBin += binYInc, skipBin += p.widthBins;
+                }
+                while (currBin != endBin);
+            }
+
+            // wait all triangles to finish, then replace overflown segment offsets
+            __syncthreads();
+            if (thrInBlock < p.numBins)
+            {
+                U32 total  = s_outCount[CR_BIN_WARPS - 1][thrInBlock];
+                U32 oldOfs = s_outOfs[thrInBlock];
+                if (overIndex == -1)
+                    s_outOfs[thrInBlock] = oldOfs + total;
+                else
+                {
+                    int addr = oldOfs + total;
+                    addr = ((addr - 1) & (CR_BIN_SEG_SIZE - 1)) + 1;
+                    addr += (allocBase + overIndex) << CR_BIN_SEG_LOG2;
+                    s_outOfs[thrInBlock] = addr;
+                }
+                s_outTotal[thrInBlock] += total;
+            }
+
+            // these triangles are now done
+            int count = ::min(bufCount, CR_BIN_WARPS * 32);
+            bufCount -= count;
+            bufIndex += count;
+            bufIndex &= CR_ARRAY_SIZE(s_triBuf)-1;
+        }
+        while (bufCount > 0 || batchPos < batchEnd);
+
+        // flush all bins
+        if (thrInBlock < p.numBins)
+        {
+            int ofs = s_outOfs[thrInBlock];
+            if (ofs & (CR_BIN_SEG_SIZE-1))
+            {
+                int seg = ofs >> CR_BIN_SEG_LOG2;
+                binSegCount[seg] = ofs & (CR_BIN_SEG_SIZE-1);
+                s_outOfs[thrInBlock] = (ofs + CR_BIN_SEG_SIZE - 1) & -CR_BIN_SEG_SIZE;
+            }
+        }
+    }
+
+    // output totals
+    if (thrInBlock < p.numBins)
+        binTotal[(thrInBlock << CR_BIN_STREAMS_LOG2) + blockIdx.x] = s_outTotal[thrInBlock];
+}
+
+//------------------------------------------------------------------------
--- a/nvdiffrast/common/cudaraster/impl/Buffer.cpp
+++ b/nvdiffrast/common/cudaraster/impl/Buffer.cpp
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+#include "../../framework.h"
+#include "Buffer.hpp"
+
+using namespace CR;
+
+//------------------------------------------------------------------------
+
+Buffer::Buffer(void)
+:   m_gpuPtr(NULL),
+    m_bytes (0)
+{
+    // empty
+}
+
+Buffer::~Buffer(void)
+{
+    if (m_gpuPtr)
+        NVDR_CHECK_CUDA_ERROR(cudaFree(m_gpuPtr));
+}
+
+//------------------------------------------------------------------------
+
+void Buffer::reset(size_t bytes)
+{
+    if (bytes == m_bytes)
+        return;
+
+    if (m_gpuPtr)
+    {
+        NVDR_CHECK_CUDA_ERROR(cudaFree(m_gpuPtr));
+        m_gpuPtr = NULL;
+    }
+
+    if (bytes > 0)
+        NVDR_CHECK_CUDA_ERROR(cudaMalloc(&m_gpuPtr, bytes));
+
+    m_bytes = bytes;
+}
+
+//------------------------------------------------------------------------
+
+void Buffer::grow(size_t bytes)
+{
+    if (bytes > m_bytes)
+        reset(bytes);
+}
+
+//------------------------------------------------------------------------
--- a/nvdiffrast/common/cudaraster/impl/Buffer.hpp
+++ b/nvdiffrast/common/cudaraster/impl/Buffer.hpp
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+#pragma once
+#include "Defs.hpp"
+
+namespace CR
+{
+//------------------------------------------------------------------------
+
+class Buffer
+{
+public:
+                    Buffer      (void);
+                    ~Buffer     (void);
+
+    void            reset       (size_t bytes);
+    void            grow        (size_t bytes);
+    void*           getPtr      (void) { return m_gpuPtr; }
+    size_t          getSize     (void) const { return m_bytes; }
+
+    void            setPtr      (void* ptr) { m_gpuPtr = ptr; }
+
+private:
+    void*           m_gpuPtr;
+    size_t          m_bytes;
+};
+
+//------------------------------------------------------------------------
+}
--- a/nvdiffrast/common/cudaraster/impl/CoarseRaster.inl
+++ b/nvdiffrast/common/cudaraster/impl/CoarseRaster.inl
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ int globalTileIdx(int tileInBin, int widthTiles)
+{
+    int tileX = tileInBin & (CR_BIN_SIZE - 1);
+    int tileY = tileInBin >> CR_BIN_LOG2;
+    return tileX + tileY * widthTiles;
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ void coarseRasterImpl(const CRParams p)
+{
+    // Common.
+
+    __shared__ volatile U32 s_workCounter;
+    __shared__ volatile U32 s_scanTemp          [CR_COARSE_WARPS][48];              // 3KB
+
+    // Input.
+
+    __shared__ volatile U32 s_binOrder          [CR_MAXBINS_SQR];                   // 1KB
+    __shared__ volatile S32 s_binStreamCurrSeg  [CR_BIN_STREAMS_SIZE];              // 0KB
+    __shared__ volatile S32 s_binStreamFirstTri [CR_BIN_STREAMS_SIZE];              // 0KB
+    __shared__ volatile S32 s_triQueue          [CR_COARSE_QUEUE_SIZE];             // 4KB
+    __shared__ volatile S32 s_triQueueWritePos;
+    __shared__ volatile U32 s_binStreamSelectedOfs;
+    __shared__ volatile U32 s_binStreamSelectedSize;
+
+    // Output.
+
+    __shared__ volatile U32 s_warpEmitMask      [CR_COARSE_WARPS][CR_BIN_SQR + 1];  // 16KB, +1 to avoid bank collisions
+    __shared__ volatile U32 s_warpEmitPrefixSum [CR_COARSE_WARPS][CR_BIN_SQR + 1];  // 16KB, +1 to avoid bank collisions
+    __shared__ volatile U32 s_tileEmitPrefixSum [CR_BIN_SQR + 1];                   // 1KB, zero at the beginning
+    __shared__ volatile U32 s_tileAllocPrefixSum[CR_BIN_SQR + 1];                   // 1KB, zero at the beginning
+    __shared__ volatile S32 s_tileStreamCurrOfs [CR_BIN_SQR];                       // 1KB
+    __shared__ volatile U32 s_firstAllocSeg;
+    __shared__ volatile U32 s_firstActiveIdx;
+
+    // Pointers and constants.
+
+    CRAtomics&              atomics         = p.atomics[blockIdx.z];
+    const CRTriangleHeader* triHeader       = (const CRTriangleHeader*)p.triHeader + p.maxSubtris * blockIdx.z;
+    const S32*              binFirstSeg     = (const S32*)p.binFirstSeg + CR_MAXBINS_SQR * CR_BIN_STREAMS_SIZE * blockIdx.z;
+    const S32*              binTotal        = (const S32*)p.binTotal    + CR_MAXBINS_SQR * CR_BIN_STREAMS_SIZE * blockIdx.z;
+    const S32*              binSegData      = (const S32*)p.binSegData  + p.maxBinSegs * CR_BIN_SEG_SIZE * blockIdx.z;
+    const S32*              binSegNext      = (const S32*)p.binSegNext  + p.maxBinSegs * blockIdx.z;
+    const S32*              binSegCount     = (const S32*)p.binSegCount + p.maxBinSegs * blockIdx.z;
+    S32*                    activeTiles     = (S32*)p.activeTiles  + CR_MAXTILES_SQR * blockIdx.z;
+    S32*                    tileFirstSeg    = (S32*)p.tileFirstSeg + CR_MAXTILES_SQR * blockIdx.z;
+    S32*                    tileSegData     = (S32*)p.tileSegData  + p.maxTileSegs * CR_TILE_SEG_SIZE * blockIdx.z;
+    S32*                    tileSegNext     = (S32*)p.tileSegNext  + p.maxTileSegs * blockIdx.z;
+    S32*                    tileSegCount    = (S32*)p.tileSegCount + p.maxTileSegs * blockIdx.z;
+
+    int tileLog     = CR_TILE_LOG2 + CR_SUBPIXEL_LOG2;
+    int thrInBlock  = threadIdx.x + threadIdx.y * 32;
+    int emitShift   = CR_BIN_LOG2 * 2 + 5; // We scan ((numEmits << emitShift) | numAllocs) over tiles.
+
+    if (atomics.numSubtris > p.maxSubtris || atomics.numBinSegs > p.maxBinSegs)
+        return;
+
+    // Initialize sharedmem arrays.
+
+    if (thrInBlock == 0)
+    {
+        s_tileEmitPrefixSum[0] = 0;
+        s_tileAllocPrefixSum[0] = 0;
+    }
+    s_scanTemp[threadIdx.y][threadIdx.x] = 0;
+
+    // Sort bins in descending order of triangle count.
+
+    for (int binIdx = thrInBlock; binIdx < p.numBins; binIdx += CR_COARSE_WARPS * 32)
+    {
+        int count = 0;
+        for (int i = 0; i < CR_BIN_STREAMS_SIZE; i++)
+            count += binTotal[(binIdx << CR_BIN_STREAMS_LOG2) + i];
+        s_binOrder[binIdx] = (~count << (CR_MAXBINS_LOG2 * 2)) | binIdx;
+    }
+
+    __syncthreads();
+    sortShared(s_binOrder, p.numBins);
+
+    // Process each bin by one block.
+
+    for (;;)
+    {
+        // Pick a bin for the block.
+
+        if (thrInBlock == 0)
+            s_workCounter = atomicAdd(&atomics.coarseCounter, 1);
+        __syncthreads();
+
+        int workCounter = s_workCounter;
+        if (workCounter >= p.numBins)
+            break;
+
+        U32 binOrder = s_binOrder[workCounter];
+        bool binEmpty = ((~binOrder >> (CR_MAXBINS_LOG2 * 2)) == 0);
+        if (binEmpty && !p.deferredClear)
+            break;
+
+        int binIdx = binOrder & (CR_MAXBINS_SQR - 1);
+
+        // Initialize input/output streams.
+
+        int triQueueWritePos = 0;
+        int triQueueReadPos = 0;
+
+        if (thrInBlock < CR_BIN_STREAMS_SIZE)
+        {
+            int segIdx = binFirstSeg[(binIdx << CR_BIN_STREAMS_LOG2) + thrInBlock];
+            s_binStreamCurrSeg[thrInBlock] = segIdx;
+            s_binStreamFirstTri[thrInBlock] = (segIdx == -1) ? ~0u : binSegData[segIdx << CR_BIN_SEG_LOG2];
+        }
+
+        for (int tileInBin = CR_COARSE_WARPS * 32 - 1 - thrInBlock; tileInBin < CR_BIN_SQR; tileInBin += CR_COARSE_WARPS * 32)
+            s_tileStreamCurrOfs[tileInBin] = -CR_TILE_SEG_SIZE;
+
+        // Initialize per-bin state.
+
+        int binY = idiv_fast(binIdx, p.widthBins);
+        int binX = binIdx - binY * p.widthBins;
+        int originX = (binX << (CR_BIN_LOG2 + tileLog)) - (p.widthPixels << (CR_SUBPIXEL_LOG2 - 1));
+        int originY = (binY << (CR_BIN_LOG2 + tileLog)) - (p.heightPixels << (CR_SUBPIXEL_LOG2 - 1));
+        int maxTileXInBin = ::min(p.widthTiles - (binX << CR_BIN_LOG2), CR_BIN_SIZE) - 1;
+        int maxTileYInBin = ::min(p.heightTiles - (binY << CR_BIN_LOG2), CR_BIN_SIZE) - 1;
+        int binTileIdx = (binX + binY * p.widthTiles) << CR_BIN_LOG2;
+
+        // Entire block: Merge input streams and process triangles.
+
+        if (!binEmpty)
+        do
+        {
+            //------------------------------------------------------------------------
+            // Merge.
+            //------------------------------------------------------------------------
+
+            // Entire block: Not enough triangles => merge and queue segments.
+            // NOTE: The bin exit criterion assumes that we queue more triangles than we actually need.
+
+            while (triQueueWritePos - triQueueReadPos <= CR_COARSE_WARPS * 32)
+            {
+                // First warp: Choose the segment with the lowest initial triangle index.
+
+                bool hasStream = (thrInBlock < CR_BIN_STREAMS_SIZE);
+                U32 hasStreamMask = __ballot_sync(~0u, hasStream);
+                if (hasStream)
+                {
+                    // Find the stream with the lowest triangle index.
+
+                    U32 firstTri = s_binStreamFirstTri[thrInBlock];
+                    U32 t = firstTri;
+                    volatile U32* v = &s_scanTemp[0][thrInBlock + 16];
+
+                    #if (CR_BIN_STREAMS_SIZE > 1)
+                        v[0] = t; __syncwarp(hasStreamMask); t = ::min(t, v[-1]); __syncwarp(hasStreamMask);
+                    #endif
+                    #if (CR_BIN_STREAMS_SIZE > 2)
+                        v[0] = t; __syncwarp(hasStreamMask); t = ::min(t, v[-2]); __syncwarp(hasStreamMask);
+                    #endif
+                    #if (CR_BIN_STREAMS_SIZE > 4)
+                        v[0] = t; __syncwarp(hasStreamMask); t = ::min(t, v[-4]); __syncwarp(hasStreamMask);
+                    #endif
+                    #if (CR_BIN_STREAMS_SIZE > 8)
+                        v[0] = t; __syncwarp(hasStreamMask); t = ::min(t, v[-8]); __syncwarp(hasStreamMask);
+                    #endif
+                    #if (CR_BIN_STREAMS_SIZE > 16)
+                        v[0] = t; __syncwarp(hasStreamMask); t = ::min(t, v[-16]); __syncwarp(hasStreamMask);
+                    #endif
+                    v[0] = t; __syncwarp(hasStreamMask);
+
+                    // Consume and broadcast.
+
+                    bool first = (s_scanTemp[0][CR_BIN_STREAMS_SIZE - 1 + 16] == firstTri);
+                    U32 firstMask = __ballot_sync(hasStreamMask, first);
+                    if (first && (firstMask >> threadIdx.x) == 1u)
+                    {
+                        int segIdx = s_binStreamCurrSeg[thrInBlock];
+                        s_binStreamSelectedOfs = segIdx << CR_BIN_SEG_LOG2;
+                        if (segIdx != -1)
+                        {
+                            int segSize = binSegCount[segIdx];
+                            int segNext = binSegNext[segIdx];
+                            s_binStreamSelectedSize = segSize;
+                            s_triQueueWritePos = triQueueWritePos + segSize;
+                            s_binStreamCurrSeg[thrInBlock] = segNext;
+                            s_binStreamFirstTri[thrInBlock] = (segNext == -1) ? ~0u : binSegData[segNext << CR_BIN_SEG_LOG2];
+                        }
+                    }
+                }
+
+                // No more segments => break.
+
+                __syncthreads();
+                triQueueWritePos = s_triQueueWritePos;
+                int segOfs = s_binStreamSelectedOfs;
+                if (segOfs < 0)
+                    break;
+
+                int segSize = s_binStreamSelectedSize;
+                __syncthreads();
+
+                // Fetch triangles into the queue.
+
+                for (int idxInSeg = CR_COARSE_WARPS * 32 - 1 - thrInBlock; idxInSeg < segSize; idxInSeg += CR_COARSE_WARPS * 32)
+                {
+                    S32 triIdx = binSegData[segOfs + idxInSeg];
+                    s_triQueue[(triQueueWritePos - segSize + idxInSeg) & (CR_COARSE_QUEUE_SIZE - 1)] = triIdx;
+                }
+            }
+
+            // All threads: Clear emit masks.
+
+            for (int maskIdx = thrInBlock; maskIdx < CR_COARSE_WARPS * CR_BIN_SQR; maskIdx += CR_COARSE_WARPS * 32)
+                s_warpEmitMask[maskIdx >> (CR_BIN_LOG2 * 2)][maskIdx & (CR_BIN_SQR - 1)] = 0;
+
+            __syncthreads();
+
+            //------------------------------------------------------------------------
+            // Raster.
+            //------------------------------------------------------------------------
+
+            // Triangle per thread: Read from the queue.
+
+            int triIdx = -1;
+            if (triQueueReadPos + thrInBlock < triQueueWritePos)
+                triIdx = s_triQueue[(triQueueReadPos + thrInBlock) & (CR_COARSE_QUEUE_SIZE - 1)];
+
+            uint4 triData = make_uint4(0, 0, 0, 0);
+            if (triIdx != -1)
+            {
+                int dataIdx = triIdx >> 3;
+                int subtriIdx = triIdx & 7;
+                if (subtriIdx != 7)
+                    dataIdx = triHeader[dataIdx].misc + subtriIdx;
+                triData = *((uint4*)triHeader + dataIdx);
+            }
+
+            // 32 triangles per warp: Record emits (= tile intersections).
+
+            if (__any_sync(~0u, triIdx != -1))
+            {
+                S32 v0x = sub_s16lo_s16lo(triData.x, originX);
+                S32 v0y = sub_s16hi_s16lo(triData.x, originY);
+                S32 d01x = sub_s16lo_s16lo(triData.y, triData.x);
+                S32 d01y = sub_s16hi_s16hi(triData.y, triData.x);
+                S32 d02x = sub_s16lo_s16lo(triData.z, triData.x);
+                S32 d02y = sub_s16hi_s16hi(triData.z, triData.x);
+
+                // Compute tile-based AABB.
+
+                int lox = add_clamp_0_x((v0x + min_min(d01x, 0, d02x)) >> tileLog, 0, maxTileXInBin);
+                int loy = add_clamp_0_x((v0y + min_min(d01y, 0, d02y)) >> tileLog, 0, maxTileYInBin);
+                int hix = add_clamp_0_x((v0x + max_max(d01x, 0, d02x)) >> tileLog, 0, maxTileXInBin);
+                int hiy = add_clamp_0_x((v0y + max_max(d01y, 0, d02y)) >> tileLog, 0, maxTileYInBin);
+                int sizex = add_sub(hix, 1, lox);
+                int sizey = add_sub(hiy, 1, loy);
+                int area = sizex * sizey;
+
+                // Miscellaneous init.
+
+                U8* currPtr = (U8*)&s_warpEmitMask[threadIdx.y][lox + (loy << CR_BIN_LOG2)];
+                int ptrYInc = CR_BIN_SIZE * 4 - (sizex << 2);
+                U32 maskBit = 1 << threadIdx.x;
+
+                // Case A: All AABBs are small => record the full AABB using atomics.
+
+                if (__all_sync(~0u, sizex <= 2 && sizey <= 2))
+                {
+                    if (triIdx != -1)
+                    {
+                        atomicOr((U32*)currPtr, maskBit);
+                        if (sizex == 2) atomicOr((U32*)(currPtr + 4), maskBit);
+                        if (sizey == 2) atomicOr((U32*)(currPtr + CR_BIN_SIZE * 4), maskBit);
+                        if (sizex == 2 && sizey == 2) atomicOr((U32*)(currPtr + 4 + CR_BIN_SIZE * 4), maskBit);
+                    }
+                }
+                else
+                {
+                    // Compute warp-AABB (scan-32).
+
+                    U32 aabbMask = add_sub(2 << hix, 0x20000 << hiy, 1 << lox) - (0x10000 << loy);
+                    if (triIdx == -1)
+                        aabbMask = 0;
+
+                    volatile U32* v = &s_scanTemp[threadIdx.y][threadIdx.x + 16];
+                    v[0] = aabbMask; __syncwarp(); aabbMask |= v[-1]; __syncwarp();
+                    v[0] = aabbMask; __syncwarp(); aabbMask |= v[-2]; __syncwarp();
+                    v[0] = aabbMask; __syncwarp(); aabbMask |= v[-4]; __syncwarp();
+                    v[0] = aabbMask; __syncwarp(); aabbMask |= v[-8]; __syncwarp();
+                    v[0] = aabbMask; __syncwarp(); aabbMask |= v[-16]; __syncwarp();
+                    v[0] = aabbMask; __syncwarp(); aabbMask = s_scanTemp[threadIdx.y][47];
+
+                    U32 maskX = aabbMask & 0xFFFF;
+                    U32 maskY = aabbMask >> 16;
+                    int wlox = findLeadingOne(maskX ^ (maskX - 1));
+                    int wloy = findLeadingOne(maskY ^ (maskY - 1));
+                    int whix = findLeadingOne(maskX);
+                    int whiy = findLeadingOne(maskY);
+                    int warea = (add_sub(whix, 1, wlox)) * (add_sub(whiy, 1, wloy));
+
+                    // Initialize edge functions.
+
+                    S32 d12x = d02x - d01x;
+                    S32 d12y = d02y - d01y;
+                    v0x -= lox << tileLog;
+                    v0y -= loy << tileLog;
+
+                    S32 t01 = v0x * d01y - v0y * d01x;
+                    S32 t02 = v0y * d02x - v0x * d02y;
+                    S32 t12 = d01x * d12y - d01y * d12x - t01 - t02;
+                    S32 b01 = add_sub(t01 >> tileLog, ::max(d01x, 0), ::min(d01y, 0));
+                    S32 b02 = add_sub(t02 >> tileLog, ::max(d02y, 0), ::min(d02x, 0));
+                    S32 b12 = add_sub(t12 >> tileLog, ::max(d12x, 0), ::min(d12y, 0));
+
+                    d01x += sizex * d01y;
+                    d02x += sizex * d02y;
+                    d12x += sizex * d12y;
+
+                    // Case B: Warp-AABB is not much larger than largest AABB => Check tiles in warp-AABB, record using ballots.
+                    if (__any_sync(~0u, warea * 4 <= area * 8))
+                    {
+                        // Not sure if this is any faster than Case C after all the post-Volta ballot mask tracking.
+                        bool act = (triIdx != -1);
+                        U32 actMask = __ballot_sync(~0u, act);
+                        if (act)
+                        {
+                            for (int y = wloy; y <= whiy; y++)
+                            {
+                                bool yIn = (y >= loy && y <= hiy);
+                                U32 yMask = __ballot_sync(actMask, yIn);
+                                if (yIn)
+                                {
+                                    for (int x = wlox; x <= whix; x++)
+                                    {
+                                        bool xyIn = (x >= lox && x <= hix);
+                                        U32 xyMask = __ballot_sync(yMask, xyIn);
+                                        if (xyIn)
+                                        {
+                                            U32 res = __ballot_sync(xyMask, b01 >= 0 && b02 >= 0 && b12 >= 0);
+                                            if (threadIdx.x == 31 - __clz(xyMask))
+                                                *(U32*)currPtr = res;
+                                            currPtr += 4, b01 -= d01y, b02 += d02y, b12 -= d12y;
+                                        }
+                                    }
+                                    currPtr += ptrYInc, b01 += d01x, b02 -= d02x, b12 += d12x;
+                                }
+                            }
+                        }
+                    }
+
+                    // Case C: General case => Check tiles in AABB, record using atomics.
+
+                    else
+                    {
+                        if (triIdx != -1)
+                        {
+                            U8* skipPtr = currPtr + (sizex << 2);
+                            U8* endPtr  = currPtr + (sizey << (CR_BIN_LOG2 + 2));
+                            do
+                            {
+                                if (b01 >= 0 && b02 >= 0 && b12 >= 0)
+                                    atomicOr((U32*)currPtr, maskBit);
+                                currPtr += 4, b01 -= d01y, b02 += d02y, b12 -= d12y;
+                                if (currPtr == skipPtr)
+                                    currPtr += ptrYInc, b01 += d01x, b02 -= d02x, b12 += d12x, skipPtr += CR_BIN_SIZE * 4;
+                            }
+                            while (currPtr != endPtr);
+                        }
+                    }
+                }
+            }
+
+            __syncthreads();
+
+            //------------------------------------------------------------------------
+            // Count.
+            //------------------------------------------------------------------------
+
+            // Tile per thread: Initialize prefix sums.
+
+            for (int tileInBin_base = 0; tileInBin_base < CR_BIN_SQR; tileInBin_base += CR_COARSE_WARPS * 32)
+            {
+                int tileInBin = tileInBin_base + thrInBlock;
+                bool act = (tileInBin < CR_BIN_SQR);
+                U32 actMask = __ballot_sync(~0u, act);
+                if (act)
+                {
+                    // Compute prefix sum of emits over warps.
+
+                    U8* srcPtr = (U8*)&s_warpEmitMask[0][tileInBin];
+                    U8* dstPtr = (U8*)&s_warpEmitPrefixSum[0][tileInBin];
+                    int tileEmits = 0;
+                    for (int i = 0; i < CR_COARSE_WARPS; i++)
+                    {
+                        tileEmits += __popc(*(U32*)srcPtr);
+                        *(U32*)dstPtr = tileEmits;
+                        srcPtr += (CR_BIN_SQR + 1) * 4;
+                        dstPtr += (CR_BIN_SQR + 1) * 4;
+                    }
+
+                    // Determine the number of segments to allocate.
+
+                    int spaceLeft = -s_tileStreamCurrOfs[tileInBin] & (CR_TILE_SEG_SIZE - 1);
+                    int tileAllocs = (tileEmits - spaceLeft + CR_TILE_SEG_SIZE - 1) >> CR_TILE_SEG_LOG2;
+                    volatile U32* v = &s_tileEmitPrefixSum[tileInBin + 1];
+
+                    // All counters within the warp are small => compute prefix sum using ballot.
+
+                    if (!__any_sync(actMask, tileEmits >= 2))
+                    {
+                        U32 m = getLaneMaskLe();
+                        *v = (__popc(__ballot_sync(actMask, tileEmits & 1) & m) << emitShift) | __popc(__ballot_sync(actMask, tileAllocs & 1) & m);
+                    }
+
+                    // Otherwise => scan-32 within the warp.
+
+                    else
+                    {
+                        U32 sum = (tileEmits << emitShift) | tileAllocs;
+                        *v = sum; __syncwarp(actMask); if (threadIdx.x >= 1)  sum += v[-1]; __syncwarp(actMask);
+                        *v = sum; __syncwarp(actMask); if (threadIdx.x >= 2)  sum += v[-2]; __syncwarp(actMask);
+                        *v = sum; __syncwarp(actMask); if (threadIdx.x >= 4)  sum += v[-4]; __syncwarp(actMask);
+                        *v = sum; __syncwarp(actMask); if (threadIdx.x >= 8)  sum += v[-8]; __syncwarp(actMask);
+                        *v = sum; __syncwarp(actMask); if (threadIdx.x >= 16) sum += v[-16]; __syncwarp(actMask);
+                        *v = sum; __syncwarp(actMask);
+                    }
+                }
+            }
+
+            // First warp: Scan-8.
+
+            __syncthreads();
+
+            bool scan8 = (thrInBlock < CR_BIN_SQR / 32);
+            U32 scan8Mask = __ballot_sync(~0u, scan8);
+            if (scan8)
+            {
+                int sum = s_tileEmitPrefixSum[(thrInBlock << 5) + 32];
+                volatile U32* v = &s_scanTemp[0][thrInBlock + 16];
+                v[0] = sum; __syncwarp(scan8Mask);
+                #if (CR_BIN_SQR > 1 * 32)
+                    sum += v[-1]; __syncwarp(scan8Mask); v[0] = sum; __syncwarp(scan8Mask);
+                #endif
+                #if (CR_BIN_SQR > 2 * 32)
+                    sum += v[-2]; __syncwarp(scan8Mask); v[0] = sum; __syncwarp(scan8Mask);
+                #endif
+                #if (CR_BIN_SQR > 4 * 32)
+                    sum += v[-4]; __syncwarp(scan8Mask); v[0] = sum; __syncwarp(scan8Mask);
+                #endif
+            }
+
+            __syncthreads();
+
+            // Tile per thread: Finalize prefix sums.
+            // Single thread: Allocate segments.
+
+            for (int tileInBin = thrInBlock; tileInBin < CR_BIN_SQR; tileInBin += CR_COARSE_WARPS * 32)
+            {
+                int sum = s_tileEmitPrefixSum[tileInBin + 1] + s_scanTemp[0][(tileInBin >> 5) + 15];
+                int numEmits = sum >> emitShift;
+                int numAllocs = sum & ((1 << emitShift) - 1);
+                s_tileEmitPrefixSum[tileInBin + 1] = numEmits;
+                s_tileAllocPrefixSum[tileInBin + 1] = numAllocs;
+
+                if (tileInBin == CR_BIN_SQR - 1 && numAllocs != 0)
+                {
+                    int t = atomicAdd(&atomics.numTileSegs, numAllocs);
+                    s_firstAllocSeg = (t + numAllocs <= p.maxTileSegs) ? t : 0;
+                }
+            }
+
+            __syncthreads();
+            int firstAllocSeg   = s_firstAllocSeg;
+            int totalEmits      = s_tileEmitPrefixSum[CR_BIN_SQR];
+            int totalAllocs     = s_tileAllocPrefixSum[CR_BIN_SQR];
+
+            //------------------------------------------------------------------------
+            // Emit.
+            //------------------------------------------------------------------------
+
+            // Emit per thread: Write triangle index to globalmem.
+
+            for (int emitInBin = thrInBlock; emitInBin < totalEmits; emitInBin += CR_COARSE_WARPS * 32)
+            {
+                // Find tile in bin.
+
+                U8* tileBase = (U8*)&s_tileEmitPrefixSum[0];
+                U8* tilePtr = tileBase;
+                U8* ptr;
+
+                #if (CR_BIN_SQR > 128)
+                    ptr = tilePtr + 0x80 * 4; if (emitInBin >= *(U32*)ptr) tilePtr = ptr;
+                #endif
+                #if (CR_BIN_SQR > 64)
+                    ptr = tilePtr + 0x40 * 4; if (emitInBin >= *(U32*)ptr) tilePtr = ptr;
+                #endif
+                #if (CR_BIN_SQR > 32)
+                    ptr = tilePtr + 0x20 * 4; if (emitInBin >= *(U32*)ptr) tilePtr = ptr;
+                #endif
+                #if (CR_BIN_SQR > 16)
+                    ptr = tilePtr + 0x10 * 4; if (emitInBin >= *(U32*)ptr) tilePtr = ptr;
+                #endif
+                #if (CR_BIN_SQR > 8)
+                    ptr = tilePtr + 0x08 * 4; if (emitInBin >= *(U32*)ptr) tilePtr = ptr;
+                #endif
+                #if (CR_BIN_SQR > 4)
+                    ptr = tilePtr + 0x04 * 4; if (emitInBin >= *(U32*)ptr) tilePtr = ptr;
+                #endif
+                #if (CR_BIN_SQR > 2)
+                    ptr = tilePtr + 0x02 * 4; if (emitInBin >= *(U32*)ptr) tilePtr = ptr;
+                #endif
+                #if (CR_BIN_SQR > 1)
+                    ptr = tilePtr + 0x01 * 4; if (emitInBin >= *(U32*)ptr) tilePtr = ptr;
+                #endif
+
+                int tileInBin = (tilePtr - tileBase) >> 2;
+                int emitInTile = emitInBin - *(U32*)tilePtr;
+
+                // Find warp in tile.
+
+                int warpStep = (CR_BIN_SQR + 1) * 4;
+                U8* warpBase = (U8*)&s_warpEmitPrefixSum[0][tileInBin] - warpStep;
+                U8* warpPtr = warpBase;
+
+                #if (CR_COARSE_WARPS > 8)
+                    ptr = warpPtr + 0x08 * warpStep; if (emitInTile >= *(U32*)ptr) warpPtr = ptr;
+                #endif
+                #if (CR_COARSE_WARPS > 4)
+                    ptr = warpPtr + 0x04 * warpStep; if (emitInTile >= *(U32*)ptr) warpPtr = ptr;
+                #endif
+                #if (CR_COARSE_WARPS > 2)
+                    ptr = warpPtr + 0x02 * warpStep; if (emitInTile >= *(U32*)ptr) warpPtr = ptr;
+                #endif
+                #if (CR_COARSE_WARPS > 1)
+                    ptr = warpPtr + 0x01 * warpStep; if (emitInTile >= *(U32*)ptr) warpPtr = ptr;
+                #endif
+
+                int warpInTile = (warpPtr - warpBase) >> (CR_BIN_LOG2 * 2 + 2);
+                U32 emitMask = *(U32*)(warpPtr + warpStep + ((U8*)s_warpEmitMask - (U8*)s_warpEmitPrefixSum));
+                int emitInWarp = emitInTile - *(U32*)(warpPtr + warpStep) + __popc(emitMask);
+
+                // Find thread in warp.
+
+                int threadInWarp = 0;
+                int pop = __popc(emitMask & 0xFFFF);
+                bool pred = (emitInWarp >= pop);
+                if (pred) emitInWarp -= pop;
+                if (pred) emitMask >>= 0x10;
+                if (pred) threadInWarp += 0x10;
+
+                pop = __popc(emitMask & 0xFF);
+                pred = (emitInWarp >= pop);
+                if (pred) emitInWarp -= pop;
+                if (pred) emitMask >>= 0x08;
+                if (pred) threadInWarp += 0x08;
+
+                pop = __popc(emitMask & 0xF);
+                pred = (emitInWarp >= pop);
+                if (pred) emitInWarp -= pop;
+                if (pred) emitMask >>= 0x04;
+                if (pred) threadInWarp += 0x04;
+
+                pop = __popc(emitMask & 0x3);
+                pred = (emitInWarp >= pop);
+                if (pred) emitInWarp -= pop;
+                if (pred) emitMask >>= 0x02;
+                if (pred) threadInWarp += 0x02;
+
+                if (emitInWarp >= (emitMask & 1))
+                    threadInWarp++;
+
+                // Figure out where to write.
+
+                int currOfs = s_tileStreamCurrOfs[tileInBin];
+                int spaceLeft = -currOfs & (CR_TILE_SEG_SIZE - 1);
+                int outOfs = emitInTile;
+
+                if (outOfs < spaceLeft)
+                    outOfs += currOfs;
+                else
+                {
+                    int allocLo = firstAllocSeg + s_tileAllocPrefixSum[tileInBin];
+                    outOfs += (allocLo << CR_TILE_SEG_LOG2) - spaceLeft;
+                }
+
+                // Write.
+
+                int queueIdx = warpInTile * 32 + threadInWarp;
+                int triIdx = s_triQueue[(triQueueReadPos + queueIdx) & (CR_COARSE_QUEUE_SIZE - 1)];
+
+                tileSegData[outOfs] = triIdx;
+            }
+
+            //------------------------------------------------------------------------
+            // Patch.
+            //------------------------------------------------------------------------
+
+            // Allocated segment per thread: Initialize next-pointer and count.
+
+            for (int i = CR_COARSE_WARPS * 32 - 1 - thrInBlock; i < totalAllocs; i += CR_COARSE_WARPS * 32)
+            {
+                int segIdx = firstAllocSeg + i;
+                tileSegNext[segIdx] = segIdx + 1;
+                tileSegCount[segIdx] = CR_TILE_SEG_SIZE;
+            }
+
+            // Tile per thread: Fix previous segment's next-pointer and update s_tileStreamCurrOfs.
+
+            __syncthreads();
+            for (int tileInBin = CR_COARSE_WARPS * 32 - 1 - thrInBlock; tileInBin < CR_BIN_SQR; tileInBin += CR_COARSE_WARPS * 32)
+            {
+                int oldOfs = s_tileStreamCurrOfs[tileInBin];
+                int newOfs = oldOfs + s_warpEmitPrefixSum[CR_COARSE_WARPS - 1][tileInBin];
+                int allocLo = s_tileAllocPrefixSum[tileInBin];
+                int allocHi = s_tileAllocPrefixSum[tileInBin + 1];
+
+                if (allocLo != allocHi)
+                {
+                    S32* nextPtr = &tileSegNext[(oldOfs - 1) >> CR_TILE_SEG_LOG2];
+                    if (oldOfs < 0)
+                        nextPtr = &tileFirstSeg[binTileIdx + globalTileIdx(tileInBin, p.widthTiles)];
+                    *nextPtr = firstAllocSeg + allocLo;
+
+                    newOfs--;
+                    newOfs &= CR_TILE_SEG_SIZE - 1;
+                    newOfs |= (firstAllocSeg + allocHi - 1) << CR_TILE_SEG_LOG2;
+                    newOfs++;
+                }
+                s_tileStreamCurrOfs[tileInBin] = newOfs;
+            }
+
+            // Advance queue read pointer.
+            // Queue became empty => bin done.
+
+            triQueueReadPos += CR_COARSE_WARPS * 32;
+        }
+        while (triQueueReadPos < triQueueWritePos);
+
+        // Tile per thread: Fix next-pointer and count of the last segment.
+        // 32 tiles per warp: Count active tiles.
+
+        __syncthreads();
+
+        for (int tileInBin_base = 0; tileInBin_base < CR_BIN_SQR; tileInBin_base += CR_COARSE_WARPS * 32)
+        {
+            int tileInBin = tileInBin_base + thrInBlock;
+            bool act = (tileInBin < CR_BIN_SQR);
+            U32 actMask = __ballot_sync(~0u, act);
+            if (act)
+            {
+                int tileX = tileInBin & (CR_BIN_SIZE - 1);
+                int tileY = tileInBin >> CR_BIN_LOG2;
+                bool force = (p.deferredClear & tileX <= maxTileXInBin & tileY <= maxTileYInBin);
+
+                int ofs = s_tileStreamCurrOfs[tileInBin];
+                int segIdx = (ofs - 1) >> CR_TILE_SEG_LOG2;
+                int segCount = ofs & (CR_TILE_SEG_SIZE - 1);
+
+                if (ofs >= 0)
+                    tileSegNext[segIdx] = -1;
+                else if (force)
+                {
+                    s_tileStreamCurrOfs[tileInBin] = 0;
+                    tileFirstSeg[binTileIdx + tileX + tileY * p.widthTiles] = -1;
+                }
+
+                if (segCount != 0)
+                    tileSegCount[segIdx] = segCount;
+
+                U32 res = __ballot_sync(actMask, ofs >= 0 | force);
+                if (threadIdx.x == 0)
+                    s_scanTemp[0][(tileInBin >> 5) + 16] = __popc(res);
+            }
+        }
+
+        // First warp: Scan-8.
+        // One thread: Allocate space for active tiles.
+
+        __syncthreads();
+
+        bool scan8 = (thrInBlock < CR_BIN_SQR / 32);
+        U32 scan8Mask = __ballot_sync(~0u, scan8);
+        if (scan8)
+        {
+            volatile U32* v = &s_scanTemp[0][thrInBlock + 16];
+            U32 sum = v[0];
+            #if (CR_BIN_SQR > 1 * 32)
+                sum += v[-1]; __syncwarp(scan8Mask); v[0] = sum; __syncwarp(scan8Mask);
+            #endif
+            #if (CR_BIN_SQR > 2 * 32)
+                sum += v[-2]; __syncwarp(scan8Mask); v[0] = sum; __syncwarp(scan8Mask);
+            #endif
+            #if (CR_BIN_SQR > 4 * 32)
+                sum += v[-4]; __syncwarp(scan8Mask); v[0] = sum; __syncwarp(scan8Mask);
+            #endif
+
+            if (thrInBlock == CR_BIN_SQR / 32 - 1)
+                s_firstActiveIdx = atomicAdd(&atomics.numActiveTiles, sum);
+        }
+
+        // Tile per thread: Output active tiles.
+
+        __syncthreads();
+
+        for (int tileInBin_base = 0; tileInBin_base < CR_BIN_SQR; tileInBin_base += CR_COARSE_WARPS * 32)
+        {
+            int tileInBin = tileInBin_base + thrInBlock;
+            bool act = (tileInBin < CR_BIN_SQR) && (s_tileStreamCurrOfs[tileInBin] >= 0);
+            U32 actMask = __ballot_sync(~0u, act);
+            if (act)
+            {
+                int activeIdx = s_firstActiveIdx;
+                activeIdx += s_scanTemp[0][(tileInBin >> 5) + 15];
+                activeIdx += __popc(actMask & getLaneMaskLt());
+                activeTiles[activeIdx] = binTileIdx + globalTileIdx(tileInBin, p.widthTiles);
+            }
+        }
+    }
+}
+
+//------------------------------------------------------------------------
--- a/nvdiffrast/common/cudaraster/impl/Constants.hpp
+++ b/nvdiffrast/common/cudaraster/impl/Constants.hpp
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+#pragma once
+
+//------------------------------------------------------------------------
+
+#define CR_MAXVIEWPORT_LOG2     11      // ViewportSize / PixelSize.
+#define CR_SUBPIXEL_LOG2        4       // PixelSize / SubpixelSize.
+
+#define CR_MAXBINS_LOG2         4       // ViewportSize / BinSize.
+#define CR_BIN_LOG2             4       // BinSize / TileSize.
+#define CR_TILE_LOG2            3       // TileSize / PixelSize.
+
+#define CR_COVER8X8_LUT_SIZE    768     // 64-bit entries.
+#define CR_FLIPBIT_FLIP_Y       2
+#define CR_FLIPBIT_FLIP_X       3
+#define CR_FLIPBIT_SWAP_XY      4
+#define CR_FLIPBIT_COMPL        5
+
+#define CR_BIN_STREAMS_LOG2     4
+#define CR_BIN_SEG_LOG2         9       // 32-bit entries.
+#define CR_TILE_SEG_LOG2        5       // 32-bit entries.
+
+#define CR_MAXSUBTRIS_LOG2      24      // Triangle structs. Dictated by CoarseRaster.
+#define CR_COARSE_QUEUE_LOG2    10      // Triangles.
+
+#define CR_SETUP_WARPS          2
+#define CR_SETUP_OPT_BLOCKS     8
+#define CR_BIN_WARPS            16
+#define CR_COARSE_WARPS         16      // Must be a power of two.
+#define CR_FINE_MAX_WARPS       20
+
+#define CR_EMBED_IMAGE_PARAMS   32      // Number of per-image parameter structs embedded in kernel launch parameter block.
+
+//------------------------------------------------------------------------
+
+#define CR_MAXVIEWPORT_SIZE     (1 << CR_MAXVIEWPORT_LOG2)
+#define CR_SUBPIXEL_SIZE        (1 << CR_SUBPIXEL_LOG2)
+#define CR_SUBPIXEL_SQR         (1 << (CR_SUBPIXEL_LOG2 * 2))
+
+#define CR_MAXBINS_SIZE         (1 << CR_MAXBINS_LOG2)
+#define CR_MAXBINS_SQR          (1 << (CR_MAXBINS_LOG2 * 2))
+#define CR_BIN_SIZE             (1 << CR_BIN_LOG2)
+#define CR_BIN_SQR              (1 << (CR_BIN_LOG2 * 2))
+
+#define CR_MAXTILES_LOG2        (CR_MAXBINS_LOG2 + CR_BIN_LOG2)
+#define CR_MAXTILES_SIZE        (1 << CR_MAXTILES_LOG2)
+#define CR_MAXTILES_SQR         (1 << (CR_MAXTILES_LOG2 * 2))
+#define CR_TILE_SIZE            (1 << CR_TILE_LOG2)
+#define CR_TILE_SQR             (1 << (CR_TILE_LOG2 * 2))
+
+#define CR_BIN_STREAMS_SIZE     (1 << CR_BIN_STREAMS_LOG2)
+#define CR_BIN_SEG_SIZE         (1 << CR_BIN_SEG_LOG2)
+#define CR_TILE_SEG_SIZE        (1 << CR_TILE_SEG_LOG2)
+
+#define CR_MAXSUBTRIS_SIZE      (1 << CR_MAXSUBTRIS_LOG2)
+#define CR_COARSE_QUEUE_SIZE    (1 << CR_COARSE_QUEUE_LOG2)
+
+//------------------------------------------------------------------------
+// When evaluating interpolated Z pixel centers, we may introduce an error
+// of (+-CR_LERP_ERROR) ULPs.
+
+#define CR_LERP_ERROR(SAMPLES_LOG2) (2200u << (SAMPLES_LOG2))
+#define CR_DEPTH_MIN                CR_LERP_ERROR(3)
+#define CR_DEPTH_MAX                (CR_U32_MAX - CR_LERP_ERROR(3))
+
+//------------------------------------------------------------------------
--- a/nvdiffrast/common/cudaraster/impl/CudaRaster.cpp
+++ b/nvdiffrast/common/cudaraster/impl/CudaRaster.cpp
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+#include "Defs.hpp"
+#include "../CudaRaster.hpp"
+#include "RasterImpl.hpp"
+
+using namespace CR;
+
+//------------------------------------------------------------------------
+// Stub interface implementation.
+//------------------------------------------------------------------------
+
+CudaRaster::CudaRaster()
+{
+    m_impl = new RasterImpl();
+}
+
+CudaRaster::~CudaRaster()
+{
+    delete m_impl;
+}
+
+void CudaRaster::setViewportSize(int width, int height, int numImages)
+{
+    m_impl->setViewportSize(Vec3i(width, height, numImages));
+}
+
+void CudaRaster::setRenderModeFlags(U32 flags)
+{
+    m_impl->setRenderModeFlags(flags);
+}
+
+void CudaRaster::deferredClear(U32 clearColor)
+{
+    m_impl->deferredClear(clearColor);
+}
+
+void CudaRaster::setVertexBuffer(void* vertices, int numVertices)
+{
+    m_impl->setVertexBuffer(vertices, numVertices);
+}
+
+void CudaRaster::setIndexBuffer(void* indices, int numTriangles)
+{
+    m_impl->setIndexBuffer(indices, numTriangles);
+}
+
+bool CudaRaster::drawTriangles(const int* ranges, cudaStream_t stream)
+{
+    return m_impl->drawTriangles((const Vec2i*)ranges, stream);
+}
+
+void* CudaRaster::getColorBuffer(void)
+{
+    return m_impl->getColorBuffer();
+}
+
+void* CudaRaster::getDepthBuffer(void)
+{
+    return m_impl->getDepthBuffer();
+}
+
+void CudaRaster::swapDepthAndPeel(void)
+{
+    m_impl->swapDepthAndPeel();
+}
+
+//------------------------------------------------------------------------
--- a/nvdiffrast/common/cudaraster/impl/Defs.hpp
+++ b/nvdiffrast/common/cudaraster/impl/Defs.hpp
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+#pragma once
+#include <cuda_runtime.h>
+#include <cstdint>
+
+namespace CR
+{
+//------------------------------------------------------------------------
+
+#ifndef NULL
+#   define NULL 0
+#endif
+
+#ifdef __CUDACC__
+#   define CR_CUDA 1
+#else
+#   define CR_CUDA 0
+#endif
+
+#if CR_CUDA
+#   define CR_CUDA_FUNC     __device__ __inline__
+#   define CR_CUDA_CONST    __constant__
+#else
+#   define CR_CUDA_FUNC     inline
+#   define CR_CUDA_CONST    static const
+#endif
+
+#define CR_UNREF(X)         ((void)(X))
+#define CR_ARRAY_SIZE(X)    ((int)(sizeof(X) / sizeof((X)[0])))
+
+//------------------------------------------------------------------------
+
+typedef uint8_t             U8;
+typedef uint16_t            U16;
+typedef uint32_t            U32;
+typedef uint64_t            U64;
+typedef int8_t              S8;
+typedef int16_t             S16;
+typedef int32_t             S32;
+typedef int64_t             S64;
+typedef float               F32;
+typedef double              F64;
+typedef void                (*FuncPtr)(void);
+
+//------------------------------------------------------------------------
+
+#define CR_U32_MAX          (0xFFFFFFFFu)
+#define CR_S32_MIN          (~0x7FFFFFFF)
+#define CR_S32_MAX          (0x7FFFFFFF)
+#define CR_U64_MAX          ((U64)(S64)-1)
+#define CR_S64_MIN          ((S64)-1 << 63)
+#define CR_S64_MAX          (~((S64)-1 << 63))
+#define CR_F32_MIN          (1.175494351e-38f)
+#define CR_F32_MAX          (3.402823466e+38f)
+#define CR_F64_MIN          (2.2250738585072014e-308)
+#define CR_F64_MAX          (1.7976931348623158e+308)
+
+//------------------------------------------------------------------------
+// Misc types.
+
+class Vec2i
+{
+public:
+    Vec2i(int x_, int y_) : x(x_), y(y_) {}
+    int x, y;
+};
+
+class Vec3i
+{
+public:
+    Vec3i(int x_, int y_, int z_) : x(x_), y(y_), z(z_) {}
+    int x, y, z;
+};
+
+//------------------------------------------------------------------------
+// CUDA utilities.
+
+#if CR_CUDA
+#   define globalThreadIdx (threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y * (blockIdx.x + gridDim.x * blockIdx.y)))
+#endif
+
+//------------------------------------------------------------------------
+} // namespace CR
--- a/nvdiffrast/common/cudaraster/impl/FineRaster.inl
+++ b/nvdiffrast/common/cudaraster/impl/FineRaster.inl
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+//------------------------------------------------------------------------
+// Utility funcs.
+//------------------------------------------------------------------------
+
+__device__ __inline__ void initTileZMax(U32& tileZMax, bool& tileZUpd, volatile U32* tileDepth)
+{
+    tileZMax = CR_DEPTH_MAX;
+    tileZUpd = (::min(tileDepth[threadIdx.x], tileDepth[threadIdx.x + 32]) < tileZMax);
+}
+
+__device__ __inline__ void updateTileZMax(U32& tileZMax, bool& tileZUpd, volatile U32* tileDepth, volatile U32* temp)
+{
+    // Entry is warp-coherent.
+    if (__any_sync(~0u, tileZUpd))
+    {
+        U32 z = ::max(tileDepth[threadIdx.x], tileDepth[threadIdx.x + 32]); __syncwarp();
+        temp[threadIdx.x + 16] = z; __syncwarp();
+        z = ::max(z, temp[threadIdx.x + 16 -  1]); __syncwarp(); temp[threadIdx.x + 16] = z; __syncwarp();
+        z = ::max(z, temp[threadIdx.x + 16 -  2]); __syncwarp(); temp[threadIdx.x + 16] = z; __syncwarp();
+        z = ::max(z, temp[threadIdx.x + 16 -  4]); __syncwarp(); temp[threadIdx.x + 16] = z; __syncwarp();
+        z = ::max(z, temp[threadIdx.x + 16 -  8]); __syncwarp(); temp[threadIdx.x + 16] = z; __syncwarp();
+        z = ::max(z, temp[threadIdx.x + 16 - 16]); __syncwarp(); temp[threadIdx.x + 16] = z; __syncwarp();
+        tileZMax = temp[47];
+        tileZUpd = false;
+    }
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ void getTriangle(const CRParams& p, S32& triIdx, S32& dataIdx, uint4& triHeader, S32& segment)
+{
+    const CRTriangleHeader* triHeaderPtr    = (const CRTriangleHeader*)p.triHeader + blockIdx.z * p.maxSubtris;;
+    const S32*              tileSegData     = (const S32*)p.tileSegData  + p.maxTileSegs * CR_TILE_SEG_SIZE * blockIdx.z;
+    const S32*              tileSegNext     = (const S32*)p.tileSegNext  + p.maxTileSegs * blockIdx.z;
+    const S32*              tileSegCount    = (const S32*)p.tileSegCount + p.maxTileSegs * blockIdx.z;
+
+    if (threadIdx.x >= tileSegCount[segment])
+    {
+        triIdx = -1;
+        dataIdx = -1;
+    }
+    else
+    {
+        int subtriIdx = tileSegData[segment * CR_TILE_SEG_SIZE + threadIdx.x];
+        triIdx = subtriIdx >> 3;
+        dataIdx = triIdx;
+        subtriIdx &= 7;
+        if (subtriIdx != 7)
+            dataIdx = triHeaderPtr[triIdx].misc + subtriIdx;
+        triHeader = *((uint4*)triHeaderPtr + dataIdx);
+    }
+
+    // advance to next segment
+    segment = tileSegNext[segment];
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ bool earlyZCull(uint4 triHeader, U32 tileZMax)
+{
+    U32 zmin = triHeader.w & 0xFFFFF000;
+    return (zmin > tileZMax);
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ U64 trianglePixelCoverage(const CRParams& p, const uint4& triHeader, int tileX, int tileY, volatile U64* s_cover8x8_lut)
+{
+    int baseX = (tileX << (CR_TILE_LOG2 + CR_SUBPIXEL_LOG2)) - ((p.widthPixels  - 1) << (CR_SUBPIXEL_LOG2 - 1));
+    int baseY = (tileY << (CR_TILE_LOG2 + CR_SUBPIXEL_LOG2)) - ((p.heightPixels - 1) << (CR_SUBPIXEL_LOG2 - 1));
+
+    // extract S16 vertex positions while subtracting tile coordinates
+    S32 v0x  = sub_s16lo_s16lo(triHeader.x, baseX);
+    S32 v0y  = sub_s16hi_s16lo(triHeader.x, baseY);
+    S32 v01x = sub_s16lo_s16lo(triHeader.y, triHeader.x);
+    S32 v01y = sub_s16hi_s16hi(triHeader.y, triHeader.x);
+    S32 v20x = sub_s16lo_s16lo(triHeader.x, triHeader.z);
+    S32 v20y = sub_s16hi_s16hi(triHeader.x, triHeader.z);
+
+    // extract flipbits
+    U32 f01 = (triHeader.w >> 6) & 0x3C;
+    U32 f12 = (triHeader.w >> 2) & 0x3C;
+    U32 f20 = (triHeader.w << 2) & 0x3C;
+
+    // compute per-edge coverage masks
+    U64 c01, c12, c20;
+    c01 = cover8x8_exact_fast(v0x, v0y, v01x, v01y, f01, s_cover8x8_lut);
+    c12 = cover8x8_exact_fast(v0x + v01x, v0y + v01y, -v01x - v20x, -v01y - v20y, f12, s_cover8x8_lut);
+    c20 = cover8x8_exact_fast(v0x, v0y, v20x, v20y, f20, s_cover8x8_lut);
+
+    // combine masks
+    return c01 & c12 & c20;
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ U32 scan32_value(U32 value, volatile U32* temp)
+{
+    __syncwarp();
+    temp[threadIdx.x + 16] = value; __syncwarp();
+    value += temp[threadIdx.x + 16 -  1]; __syncwarp(); temp[threadIdx.x + 16] = value; __syncwarp();
+    value += temp[threadIdx.x + 16 -  2]; __syncwarp(); temp[threadIdx.x + 16] = value; __syncwarp();
+    value += temp[threadIdx.x + 16 -  4]; __syncwarp(); temp[threadIdx.x + 16] = value; __syncwarp();
+    value += temp[threadIdx.x + 16 -  8]; __syncwarp(); temp[threadIdx.x + 16] = value; __syncwarp();
+    value += temp[threadIdx.x + 16 - 16]; __syncwarp(); temp[threadIdx.x + 16] = value; __syncwarp();
+    return value;
+}
+
+__device__ __inline__ volatile const U32& scan32_total(volatile U32* temp)
+{
+    return temp[47];
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ S32 findBit(U64 mask, int idx)
+{
+    U32 x = getLo(mask);
+    int  pop = __popc(x);
+    bool p   = (pop <= idx);
+    if (p) x = getHi(mask);
+    if (p) idx -= pop;
+    int bit = p ? 32 : 0;
+
+    pop = __popc(x & 0x0000ffffu);
+    p   = (pop <= idx);
+    if (p) x >>= 16;
+    if (p) bit += 16;
+    if (p) idx -= pop;
+
+    U32 tmp = x & 0x000000ffu;
+    pop = __popc(tmp);
+    p   = (pop <= idx);
+    if (p) tmp = x & 0x0000ff00u;
+    if (p) idx -= pop;
+
+    return findLeadingOne(tmp) + bit - idx;
+}
+
+//------------------------------------------------------------------------
+// Single-sample implementation.
+//------------------------------------------------------------------------
+
+__device__ __inline__ void executeROP(U32 color, U32 depth, volatile U32* pColor, volatile U32* pDepth, U32 ropMask)
+{
+    atomicMin((U32*)pDepth, depth);
+    __syncwarp(ropMask);
+    bool act = (depth == *pDepth);
+    __syncwarp(ropMask);
+    U32 actMask = __ballot_sync(ropMask, act);
+    if (act)
+    {
+        *pDepth = 0;
+        __syncwarp(actMask);
+        atomicMax((U32*)pDepth, threadIdx.x);
+        __syncwarp(actMask);
+        if (*pDepth == threadIdx.x)
+        {
+            *pDepth = depth;
+            *pColor = color;
+        }
+        __syncwarp(actMask);
+    }
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ void fineRasterImpl(const CRParams p)
+{
+                                                                            // for 20 warps:
+    __shared__ volatile U64 s_cover8x8_lut[CR_COVER8X8_LUT_SIZE];           // 6KB
+    __shared__ volatile U32 s_tileColor   [CR_FINE_MAX_WARPS][CR_TILE_SQR]; // 5KB
+    __shared__ volatile U32 s_tileDepth   [CR_FINE_MAX_WARPS][CR_TILE_SQR]; // 5KB
+    __shared__ volatile U32 s_tilePeel    [CR_FINE_MAX_WARPS][CR_TILE_SQR]; // 5KB
+    __shared__ volatile U32 s_triDataIdx  [CR_FINE_MAX_WARPS][64];          // 5KB  CRTriangleData index
+    __shared__ volatile U64 s_triangleCov [CR_FINE_MAX_WARPS][64];          // 10KB coverage mask
+    __shared__ volatile U32 s_triangleFrag[CR_FINE_MAX_WARPS][64];          // 5KB  fragment index
+    __shared__ volatile U32 s_temp        [CR_FINE_MAX_WARPS][80];          // 6.25KB
+                                                                            // = 47.25KB total
+
+    CRAtomics&            atomics   = p.atomics[blockIdx.z];
+    const CRTriangleData* triData   = (const CRTriangleData*)p.triData + blockIdx.z * p.maxSubtris;
+
+    const S32*      activeTiles     = (const S32*)p.activeTiles  + CR_MAXTILES_SQR * blockIdx.z;
+    const S32*      tileFirstSeg    = (const S32*)p.tileFirstSeg + CR_MAXTILES_SQR * blockIdx.z;
+
+    volatile U32*   tileColor       = s_tileColor[threadIdx.y];
+    volatile U32*   tileDepth       = s_tileDepth[threadIdx.y];
+    volatile U32*   tilePeel        = s_tilePeel[threadIdx.y];
+    volatile U32*   triDataIdx      = s_triDataIdx[threadIdx.y];
+    volatile U64*   triangleCov     = s_triangleCov[threadIdx.y];
+    volatile U32*   triangleFrag    = s_triangleFrag[threadIdx.y];
+    volatile U32*   temp            = s_temp[threadIdx.y];
+
+    if (atomics.numSubtris > p.maxSubtris || atomics.numBinSegs > p.maxBinSegs || atomics.numTileSegs > p.maxTileSegs)
+        return;
+
+    temp[threadIdx.x] = 0; // first 16 elements of temp are always zero
+    cover8x8_setupLUT(s_cover8x8_lut);
+    __syncthreads();
+
+    // loop over tiles
+    for (;;)
+    {
+        // pick a tile
+        if (threadIdx.x == 0)
+            temp[16] = atomicAdd(&atomics.fineCounter, 1);
+        __syncwarp();
+        int activeIdx = temp[16];
+        if (activeIdx >= atomics.numActiveTiles)
+            break;
+
+        int tileIdx = activeTiles[activeIdx];
+        S32 segment = tileFirstSeg[tileIdx];
+        int tileY = tileIdx / p.widthTiles;
+        int tileX = tileIdx - tileY * p.widthTiles;
+        int px = (tileX << CR_TILE_LOG2) + (threadIdx.x & (CR_TILE_SIZE - 1));
+        int py = (tileY << CR_TILE_LOG2) + (threadIdx.x >> CR_TILE_LOG2);
+
+        // initialize per-tile state
+        int triRead = 0, triWrite = 0;
+        int fragRead = 0, fragWrite = 0;
+        if (threadIdx.x == 0)
+            triangleFrag[63] = 0; // "previous triangle"
+
+        // deferred clear => clear tile
+        if (p.deferredClear)
+        {
+			tileColor[threadIdx.x] = p.clearColor;
+            tileDepth[threadIdx.x] = p.clearDepth;
+            tileColor[threadIdx.x + 32] = p.clearColor;
+            tileDepth[threadIdx.x + 32] = p.clearDepth;
+        }
+        else // otherwise => read tile from framebuffer
+        {
+            U32* pColor = (U32*)p.colorBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
+            U32* pDepth = (U32*)p.depthBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
+			tileColor[threadIdx.x] = pColor[px + p.widthPixels * py];
+            tileDepth[threadIdx.x] = pDepth[px + p.widthPixels * py];
+            tileColor[threadIdx.x + 32] = pColor[px + p.widthPixels * (py + 4)];
+            tileDepth[threadIdx.x + 32] = pDepth[px + p.widthPixels * (py + 4)];
+        }
+
+        // read peeling inputs if enabled
+        if (p.renderModeFlags & CudaRaster::RenderModeFlag_EnableDepthPeeling)
+        {
+            U32* pPeel = (U32*)p.peelBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
+            tilePeel[threadIdx.x] = pPeel[px + p.widthPixels * py];
+            tilePeel[threadIdx.x + 32] = pPeel[px + p.widthPixels * (py + 4)];
+        }
+
+        U32 tileZMax;
+        bool tileZUpd;
+        initTileZMax(tileZMax, tileZUpd, tileDepth);
+
+        // process fragments
+        for(;;)
+        {
+            // need to queue more fragments?
+            if (fragWrite - fragRead < 32 && segment >= 0)
+            {
+                // update tile z - coherent over warp
+                updateTileZMax(tileZMax, tileZUpd, tileDepth, temp);
+
+                // read triangles
+                do
+                {
+                    // read triangle index and data, advance to next segment
+                    S32 triIdx, dataIdx;
+                    uint4 triHeader;
+                    getTriangle(p, triIdx, dataIdx, triHeader, segment);
+
+                    // early z cull
+                    if (triIdx >= 0 && earlyZCull(triHeader, tileZMax))
+                        triIdx = -1;
+
+                    // determine coverage
+                    U64 coverage = trianglePixelCoverage(p, triHeader, tileX, tileY, s_cover8x8_lut);
+                    S32 pop = (triIdx == -1) ? 0 : __popcll(coverage);
+
+                    // fragment count scan
+                    U32 frag = scan32_value(pop, temp);
+                    frag += fragWrite; // frag now holds cumulative fragment count
+                    fragWrite += scan32_total(temp);
+
+                    // queue non-empty triangles
+                    U32 goodMask = __ballot_sync(~0u, pop != 0);
+                    if (pop != 0)
+                    {
+                        int idx = (triWrite + __popc(goodMask & getLaneMaskLt())) & 63;
+                        triDataIdx  [idx] = dataIdx;
+                        triangleFrag[idx] = frag;
+                        triangleCov [idx] = coverage;
+                    }
+                    triWrite += __popc(goodMask);
+                }
+                while (fragWrite - fragRead < 32 && segment >= 0);
+            }
+            __syncwarp();
+
+            // end of segment?
+            if (fragRead == fragWrite)
+                break;
+
+            // clear triangle boundaries
+            temp[threadIdx.x + 16] = 0;
+            __syncwarp();
+
+            // tag triangle boundaries
+            if (triRead + threadIdx.x < triWrite)
+            {
+                int idx = triangleFrag[(triRead + threadIdx.x) & 63] - fragRead;
+                if (idx <= 32)
+                    temp[idx + 16 - 1] = 1;
+            }
+            __syncwarp();
+
+            int ropLaneIdx = threadIdx.x;
+            U32 boundaryMask = __ballot_sync(~0u, temp[ropLaneIdx + 16]);
+
+            // distribute fragments
+            bool hasFragment = (ropLaneIdx < fragWrite - fragRead);
+            U32 fragmentMask = __ballot_sync(~0u, hasFragment);
+            if (hasFragment)
+            {
+                int triBufIdx = (triRead + __popc(boundaryMask & getLaneMaskLt())) & 63;
+                int fragIdx = add_sub(fragRead, ropLaneIdx, triangleFrag[(triBufIdx - 1) & 63]);
+                U64 coverage = triangleCov[triBufIdx];
+                int pixelInTile = findBit(coverage, fragIdx);
+                int dataIdx = triDataIdx[triBufIdx];
+
+                // determine pixel position
+                U32 pixelX = (tileX << CR_TILE_LOG2) + (pixelInTile & 7);
+                U32 pixelY = (tileY << CR_TILE_LOG2) + (pixelInTile >> 3);
+
+                // depth test
+                U32 depth = 0;
+                uint4 td = *((uint4*)triData + dataIdx * (sizeof(CRTriangleData) >> 4));
+
+                depth = td.x * pixelX + td.y * pixelY + td.z;
+                bool zkill = (p.renderModeFlags & CudaRaster::RenderModeFlag_EnableDepthPeeling) && (depth <= tilePeel[pixelInTile]);
+                if (!zkill)
+                {
+                    U32 oldDepth = tileDepth[pixelInTile];
+                    if (depth > oldDepth)
+                        zkill = true;
+                    else if (oldDepth == tileZMax)
+                        tileZUpd = true; // we are replacing previous zmax => need to update
+                }
+
+                U32 ropMask = __ballot_sync(fragmentMask, !zkill);
+                if (!zkill)
+					executeROP(td.w, depth, &tileColor[pixelInTile], &tileDepth[pixelInTile], ropMask);
+            }
+            // no need to sync, as next up is updateTileZMax that does internal warp sync
+
+            // update counters
+            fragRead = ::min(fragRead + 32, fragWrite);
+            triRead += __popc(boundaryMask);
+        }
+
+        // Write tile back to the framebuffer.
+        if (true)
+        {
+            int px = (tileX << CR_TILE_LOG2) + (threadIdx.x & (CR_TILE_SIZE - 1));
+            int py = (tileY << CR_TILE_LOG2) + (threadIdx.x >> CR_TILE_LOG2);
+            U32* pColor = (U32*)p.colorBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
+            U32* pDepth = (U32*)p.depthBuffer + p.widthPixels * p.heightPixels * blockIdx.z;
+            pColor[px + p.widthPixels * py] = tileColor[threadIdx.x];
+            pDepth[px + p.widthPixels * py] = tileDepth[threadIdx.x];
+            pColor[px + p.widthPixels * (py + 4)] = tileColor[threadIdx.x + 32];
+            pDepth[px + p.widthPixels * (py + 4)] = tileDepth[threadIdx.x + 32];
+        }
+    }
+}
+
+//------------------------------------------------------------------------
--- a/nvdiffrast/common/cudaraster/impl/PrivateDefs.hpp
+++ b/nvdiffrast/common/cudaraster/impl/PrivateDefs.hpp
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+#pragma once
+#include "Defs.hpp"
+#include "Constants.hpp"
+
+namespace CR
+{
+//------------------------------------------------------------------------
+// Projected triangle.
+//------------------------------------------------------------------------
+
+struct CRTriangleHeader
+{
+    S16 v0x;    // Subpixels relative to viewport center. Valid if triSubtris = 1.
+    S16 v0y;
+    S16 v1x;
+    S16 v1y;
+    S16 v2x;
+    S16 v2y;
+
+    U32 misc;   // triSubtris=1: (zmin:20, f01:4, f12:4, f20:4), triSubtris>=2: (subtriBase)
+};
+
+//------------------------------------------------------------------------
+
+struct CRTriangleData
+{
+    U32 zx;     // zx * sampleX + zy * sampleY + zb = lerp(CR_DEPTH_MIN, CR_DEPTH_MAX, (clipZ / clipW + 1) / 2)
+    U32 zy;
+    U32 zb;
+    U32 id;     // Triangle id.
+};
+
+//------------------------------------------------------------------------
+// Device-side structures.
+//------------------------------------------------------------------------
+
+struct CRAtomics
+{
+    // Setup.
+    S32         numSubtris;         // = numTris
+
+    // Bin.
+    S32         binCounter;         // = 0
+    S32         numBinSegs;         // = 0
+
+    // Coarse.
+    S32         coarseCounter;      // = 0
+    S32         numTileSegs;        // = 0
+    S32         numActiveTiles;     // = 0
+
+    // Fine.
+    S32         fineCounter;        // = 0
+};
+
+//------------------------------------------------------------------------
+
+struct CRImageParams
+{
+    S32         triOffset;          // First triangle index to draw.
+    S32         triCount;           // Number of triangles to draw.
+    S32         binBatchSize;       // Number of triangles per batch.
+};
+
+//------------------------------------------------------------------------
+
+struct CRParams
+{
+    // Common.
+
+    CRAtomics*              atomics;            // Work counters. Per-image.
+    S32                     numImages;          // Batch size.
+    S32                     totalCount;         // In range mode, total number of triangles to render.
+    S32                     instanceMode;       // 0 = range mode, 1 = instance mode.
+
+    S32         numVertices;        // Number of vertices in input buffer, not counting multiples in instance mode.
+    S32         numTriangles;       // Number of triangles in input buffer.
+    void*       vertexBuffer;       // numVertices * float4(x, y, z, w)
+    void*       indexBuffer;        // numTriangles * int3(vi0, vi1, vi2)
+
+    S32         widthPixels;        // Viewport size in pixels. Must be multiple of tile size (8x8).
+    S32         heightPixels;
+    S32         widthBins;          // widthPixels / CR_BIN_SIZE
+    S32         heightBins;         // heightPixels / CR_BIN_SIZE
+    S32         numBins;            // widthBins * heightBins
+
+    S32         widthTiles;         // widthPixels / CR_TILE_SIZE
+    S32         heightTiles;        // heightPixels / CR_TILE_SIZE
+    S32         numTiles;           // widthTiles * heightTiles
+
+    U32         renderModeFlags;
+    S32         deferredClear;      // 1 = Clear framebuffer before rendering triangles.
+    U32         clearColor;
+    U32         clearDepth;
+
+    // These are uniform across batch.
+
+    S32         maxSubtris;
+    S32         maxBinSegs;
+    S32         maxTileSegs;
+
+    // Setup output / bin input.
+
+    void*       triSubtris;         // maxSubtris * U8
+    void*       triHeader;          // maxSubtris * CRTriangleHeader
+    void*       triData;            // maxSubtris * CRTriangleData
+
+    // Bin output / coarse input.
+
+    void*       binSegData;         // maxBinSegs * CR_BIN_SEG_SIZE * S32
+    void*       binSegNext;         // maxBinSegs * S32
+    void*       binSegCount;        // maxBinSegs * S32
+    void*       binFirstSeg;        // CR_MAXBINS_SQR * CR_BIN_STREAMS_SIZE * (S32 segIdx), -1 = none
+    void*       binTotal;           // CR_MAXBINS_SQR * CR_BIN_STREAMS_SIZE * (S32 numTris)
+
+    // Coarse output / fine input.
+
+    void*       tileSegData;        // maxTileSegs * CR_TILE_SEG_SIZE * S32
+    void*       tileSegNext;        // maxTileSegs * S32
+    void*       tileSegCount;       // maxTileSegs * S32
+    void*       activeTiles;        // CR_MAXTILES_SQR * (S32 tileIdx)
+    void*       tileFirstSeg;       // CR_MAXTILES_SQR * (S32 segIdx), -1 = none
+
+    // Surface buffers.
+
+    void*       colorBuffer;        // sizePixels.x * sizePixels.y * numImages * U32
+    void*       depthBuffer;        // sizePixels.x * sizePixels.y * numImages * U32
+    void*       peelBuffer;         // sizePixels.x * sizePixels.y * numImages * U32, only if peeling enabled.
+
+    // Per-image parameters for first images are embedded here to avoid extra memcpy for small batches.
+
+    CRImageParams imageParamsFirst[CR_EMBED_IMAGE_PARAMS];
+    const CRImageParams* imageParamsExtra; // After CR_EMBED_IMAGE_PARAMS.
+};
+
+//------------------------------------------------------------------------
+}
--- a/nvdiffrast/common/cudaraster/impl/RasterImpl.cpp
+++ b/nvdiffrast/common/cudaraster/impl/RasterImpl.cpp
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+#include "../../framework.h"
+#include "PrivateDefs.hpp"
+#include "Constants.hpp"
+#include "RasterImpl.hpp"
+#include <cuda_runtime.h>
+
+using namespace CR;
+using std::min;
+using std::max;
+
+//------------------------------------------------------------------------
+// Kernel prototypes and variables.
+
+void triangleSetupKernel (const CRParams p);
+void binRasterKernel     (const CRParams p);
+void coarseRasterKernel  (const CRParams p);
+void fineRasterKernel    (const CRParams p);
+
+//------------------------------------------------------------------------
+
+RasterImpl::RasterImpl(void)
+:   m_renderModeFlags       (0),
+    m_deferredClear         (false),
+    m_clearColor            (0),
+    m_vertexPtr             (NULL),
+    m_indexPtr              (NULL),
+    m_numVertices           (0),
+    m_numTriangles          (0),
+    m_bufferSizesReported   (0),
+
+    m_numImages             (0),
+    m_sizePixels            (0, 0),
+    m_sizeBins              (0, 0),
+    m_numBins               (0),
+    m_sizeTiles             (0, 0),
+    m_numTiles              (0),
+
+    m_numSMs                (1),
+    m_numCoarseBlocksPerSM  (1),
+    m_numFineBlocksPerSM    (1),
+    m_numFineWarpsPerBlock  (1),
+
+    m_maxSubtris            (1),
+    m_maxBinSegs            (1),
+    m_maxTileSegs           (1)
+{
+    // Query relevant device attributes.
+
+    int currentDevice = 0;
+    NVDR_CHECK_CUDA_ERROR(cudaGetDevice(&currentDevice));
+    NVDR_CHECK_CUDA_ERROR(cudaDeviceGetAttribute(&m_numSMs, cudaDevAttrMultiProcessorCount, currentDevice));
+    cudaFuncAttributes attr;
+    NVDR_CHECK_CUDA_ERROR(cudaFuncGetAttributes(&attr, (void*)fineRasterKernel));
+    m_numFineWarpsPerBlock = min(attr.maxThreadsPerBlock / 32, CR_FINE_MAX_WARPS);
+    NVDR_CHECK_CUDA_ERROR(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&m_numCoarseBlocksPerSM, (void*)coarseRasterKernel, 32 * CR_COARSE_WARPS, 0));
+    NVDR_CHECK_CUDA_ERROR(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&m_numFineBlocksPerSM, (void*)fineRasterKernel, 32 * m_numFineWarpsPerBlock, 0));
+
+    // Setup functions.
+
+    NVDR_CHECK_CUDA_ERROR(cudaFuncSetCacheConfig((void*)triangleSetupKernel, cudaFuncCachePreferShared));
+    NVDR_CHECK_CUDA_ERROR(cudaFuncSetCacheConfig((void*)binRasterKernel,     cudaFuncCachePreferShared));
+    NVDR_CHECK_CUDA_ERROR(cudaFuncSetCacheConfig((void*)coarseRasterKernel,  cudaFuncCachePreferShared));
+    NVDR_CHECK_CUDA_ERROR(cudaFuncSetCacheConfig((void*)fineRasterKernel,    cudaFuncCachePreferShared));
+}
+
+//------------------------------------------------------------------------
+
+RasterImpl::~RasterImpl(void)
+{
+    // Empty.
+}
+
+//------------------------------------------------------------------------
+
+void RasterImpl::setViewportSize(Vec3i size)
+{
+    if ((size.x | size.y) & (CR_TILE_SIZE - 1))
+        return; // Invalid size.
+
+    m_numImages     = size.z;
+    m_sizePixels    = Vec2i(size.x, size.y);
+    m_sizeTiles.x   = m_sizePixels.x >> CR_TILE_LOG2;
+    m_sizeTiles.y   = m_sizePixels.y >> CR_TILE_LOG2;
+    m_numTiles      = m_sizeTiles.x * m_sizeTiles.y;
+    m_sizeBins.x    = (m_sizeTiles.x + CR_BIN_SIZE - 1) >> CR_BIN_LOG2;
+    m_sizeBins.y    = (m_sizeTiles.y + CR_BIN_SIZE - 1) >> CR_BIN_LOG2;
+    m_numBins       = m_sizeBins.x * m_sizeBins.y;
+
+    m_colorBuffer.reset(m_sizePixels.x * m_sizePixels.y * m_numImages * sizeof(U32));
+    m_depthBuffer.reset(m_sizePixels.x * m_sizePixels.y * m_numImages * sizeof(U32));
+}
+
+void RasterImpl::swapDepthAndPeel(void)
+{
+    m_peelBuffer.reset(m_depthBuffer.getSize()); // Ensure equal size and valid pointer.
+
+    void* tmp = m_depthBuffer.getPtr();
+    m_depthBuffer.setPtr(m_peelBuffer.getPtr());
+    m_peelBuffer.setPtr(tmp);
+}
+
+//------------------------------------------------------------------------
+
+bool RasterImpl::drawTriangles(const Vec2i* ranges, cudaStream_t stream)
+{
+    bool instanceMode = (!ranges);
+
+    int maxSubtrisSlack     = 4096;     // x 81B    = 324KB
+    int maxBinSegsSlack     = 256;      // x 2137B  = 534KB
+    int maxTileSegsSlack    = 4096;     // x 136B   = 544KB
+
+    // Resize atomics as needed.
+    m_crAtomics    .grow(m_numImages * sizeof(CRAtomics));
+
+    // Size of these buffers doesn't depend on input.
+    m_binFirstSeg  .grow(m_numImages * CR_MAXBINS_SQR * CR_BIN_STREAMS_SIZE * sizeof(S32));
+    m_binTotal     .grow(m_numImages * CR_MAXBINS_SQR * CR_BIN_STREAMS_SIZE * sizeof(S32));
+    m_activeTiles  .grow(m_numImages * CR_MAXTILES_SQR * sizeof(S32));
+    m_tileFirstSeg .grow(m_numImages * CR_MAXTILES_SQR * sizeof(S32));
+
+    // Construct per-image parameters and determine worst-case buffer sizes.
+    std::vector<CRImageParams> imageParams(m_numImages);
+    for (int i=0; i < m_numImages; i++)
+    {
+        CRImageParams& ip = imageParams[i];
+
+        int roundSize  = CR_BIN_WARPS * 32;
+        int minBatches = CR_BIN_STREAMS_SIZE * 2;
+        int maxRounds  = 32;
+
+        ip.triOffset = instanceMode ? 0 : ranges[i].x;
+        ip.triCount  = instanceMode ? m_numTriangles : ranges[i].y;
+        ip.binBatchSize = min(max(ip.triCount / (roundSize * minBatches), 1), maxRounds) * roundSize;
+
+        m_maxSubtris  = max(m_maxSubtris,  min(ip.triCount + maxSubtrisSlack, CR_MAXSUBTRIS_SIZE));
+        m_maxBinSegs  = max(m_maxBinSegs,  max(m_numBins * CR_BIN_STREAMS_SIZE, (ip.triCount - 1) / CR_BIN_SEG_SIZE + 1) + maxBinSegsSlack);
+        m_maxTileSegs = max(m_maxTileSegs, max(m_numTiles, (ip.triCount - 1) / CR_TILE_SEG_SIZE + 1) + maxTileSegsSlack);
+    }
+
+    // Retry until successful.
+
+    for (;;)
+    {
+        // Allocate buffers.
+        m_triSubtris.reset(m_numImages * m_maxSubtris * sizeof(U8));
+        m_triHeader .reset(m_numImages * m_maxSubtris * sizeof(CRTriangleHeader));
+        m_triData   .reset(m_numImages * m_maxSubtris * sizeof(CRTriangleData));
+
+        m_binSegData .reset(m_numImages * m_maxBinSegs * CR_BIN_SEG_SIZE * sizeof(S32));
+        m_binSegNext .reset(m_numImages * m_maxBinSegs * sizeof(S32));
+        m_binSegCount.reset(m_numImages * m_maxBinSegs * sizeof(S32));
+
+        m_tileSegData .reset(m_numImages * m_maxTileSegs * CR_TILE_SEG_SIZE * sizeof(S32));
+        m_tileSegNext .reset(m_numImages * m_maxTileSegs * sizeof(S32));
+        m_tileSegCount.reset(m_numImages * m_maxTileSegs * sizeof(S32));
+
+        // Report if buffers grow from last time.
+        size_t sizesTotal = getTotalBufferSizes();
+        if (sizesTotal > m_bufferSizesReported)
+        {
+            size_t sizesMB = ((sizesTotal - 1) >> 20) + 1; // Round up.
+            sizesMB = ((sizesMB + 9) / 10) * 10; // 10MB granularity enough in this day and age.
+            LOG(INFO) << "Internal buffers grown to " << sizesMB << " MB";
+            m_bufferSizesReported = sizesMB << 20;
+        }
+
+        // Launch stages.
+        launchStages(&imageParams[0], instanceMode, stream);
+
+        // Get atomics.
+        std::vector<CRAtomics> atomics(m_numImages);
+        NVDR_CHECK_CUDA_ERROR(cudaMemcpyAsync(&atomics[0], m_crAtomics.getPtr(), sizeof(CRAtomics) * m_numImages, cudaMemcpyDeviceToHost, stream));
+
+        // Success?
+        bool failed = false;
+        for (int i=0; i < m_numImages; i++)
+        {
+            const CRAtomics& a = atomics[i];
+            failed = failed || (a.numSubtris > m_maxSubtris) || (a.numBinSegs > m_maxBinSegs) || (a.numTileSegs > m_maxTileSegs);
+        }
+        if (!failed)
+            break; // Success!
+
+        // If we were already at maximum capacity, no can do.
+        if (m_maxSubtris == CR_MAXSUBTRIS_SIZE)
+            return false;
+
+        // Enlarge buffers and try again.
+        for (int i=0; i < m_numImages; i++)
+        {
+            const CRAtomics& a = atomics[i];
+            m_maxSubtris  = max(m_maxSubtris,  min(a.numSubtris + maxSubtrisSlack, CR_MAXSUBTRIS_SIZE));
+            m_maxBinSegs  = max(m_maxBinSegs,  a.numBinSegs + maxBinSegsSlack);
+            m_maxTileSegs = max(m_maxTileSegs, a.numTileSegs + maxTileSegsSlack);
+        }
+    }
+
+    m_deferredClear = false;
+    return true; // Success.
+}
+
+//------------------------------------------------------------------------
+
+size_t RasterImpl::getTotalBufferSizes(void) const
+{
+    return
+        m_colorBuffer.getSize() + m_depthBuffer.getSize() + // Don't include atomics and image params.
+        m_triSubtris.getSize() + m_triHeader.getSize() + m_triData.getSize() +
+        m_binFirstSeg.getSize() + m_binTotal.getSize() + m_binSegData.getSize() + m_binSegNext.getSize() + m_binSegCount.getSize() +
+        m_activeTiles.getSize() + m_tileFirstSeg.getSize() + m_tileSegData.getSize() + m_tileSegNext.getSize() + m_tileSegCount.getSize();
+}
+
+//------------------------------------------------------------------------
+
+void RasterImpl::launchStages(const CRImageParams* imageParams, bool instanceMode, cudaStream_t stream)
+{
+    // Initialize atomics to mostly zero.
+    {
+        std::vector<CRAtomics> atomics(m_numImages);
+        memset(&atomics[0], 0, m_numImages * sizeof(CRAtomics));
+        for (int i=0; i < m_numImages; i++)
+            atomics[i].numSubtris = imageParams[i].triCount;
+        NVDR_CHECK_CUDA_ERROR(cudaMemcpyAsync(m_crAtomics.getPtr(), &atomics[0], m_numImages * sizeof(CRAtomics), cudaMemcpyHostToDevice, stream));
+    }
+
+    // Copy per-image parameters if there are more than fits in launch parameter block.
+    if (m_numImages > CR_EMBED_IMAGE_PARAMS)
+    {
+        int numImageParamsExtra = m_numImages - CR_EMBED_IMAGE_PARAMS;
+        m_crImageParamsExtra.grow(numImageParamsExtra * sizeof(CRImageParams));
+        NVDR_CHECK_CUDA_ERROR(cudaMemcpyAsync(m_crImageParamsExtra.getPtr(), imageParams + CR_EMBED_IMAGE_PARAMS, numImageParamsExtra * sizeof(CRImageParams), cudaMemcpyHostToDevice, stream));
+    }
+
+    // Set global parameters.
+    CRParams p;
+    {
+        p.atomics           = (CRAtomics*)m_crAtomics.getPtr();
+        p.numImages         = m_numImages;
+        p.totalCount        = 0; // Only relevant in range mode.
+        p.instanceMode      = instanceMode ? 1 : 0;
+
+        p.numVertices       = m_numVertices;
+        p.numTriangles      = m_numTriangles;
+        p.vertexBuffer      = m_vertexPtr;
+        p.indexBuffer       = m_indexPtr;
+
+        p.widthPixels       = m_sizePixels.x;
+        p.heightPixels      = m_sizePixels.y;
+        p.widthBins         = m_sizeBins.x;
+        p.heightBins        = m_sizeBins.y;
+        p.numBins           = m_numBins;
+
+        p.widthTiles        = m_sizeTiles.x;
+        p.heightTiles       = m_sizeTiles.y;
+        p.numTiles          = m_numTiles;
+
+        p.renderModeFlags   = m_renderModeFlags;
+        p.deferredClear     = m_deferredClear ? 1 : 0;
+        p.clearColor        = m_clearColor;
+        p.clearDepth        = CR_DEPTH_MAX;
+
+        p.maxSubtris        = m_maxSubtris;
+        p.maxBinSegs        = m_maxBinSegs;
+        p.maxTileSegs       = m_maxTileSegs;
+
+        p.triSubtris        = m_triSubtris.getPtr();
+        p.triHeader         = m_triHeader.getPtr();
+        p.triData           = m_triData.getPtr();
+        p.binSegData        = m_binSegData.getPtr();
+        p.binSegNext        = m_binSegNext.getPtr();
+        p.binSegCount       = m_binSegCount.getPtr();
+        p.binFirstSeg       = m_binFirstSeg.getPtr();
+        p.binTotal          = m_binTotal.getPtr();
+        p.tileSegData       = m_tileSegData.getPtr();
+        p.tileSegNext       = m_tileSegNext.getPtr();
+        p.tileSegCount      = m_tileSegCount.getPtr();
+        p.activeTiles       = m_activeTiles.getPtr();
+        p.tileFirstSeg      = m_tileFirstSeg.getPtr();
+
+        p.colorBuffer       = m_colorBuffer.getPtr();
+        p.depthBuffer       = m_depthBuffer.getPtr();
+        p.peelBuffer        = (m_renderModeFlags & CudaRaster::RenderModeFlag_EnableDepthPeeling) ? m_peelBuffer.getPtr() : 0;
+
+        memcpy(&p.imageParamsFirst, imageParams, min(m_numImages, CR_EMBED_IMAGE_PARAMS) * sizeof(CRImageParams));
+        p.imageParamsExtra  = (CRImageParams*)m_crImageParamsExtra.getPtr();
+    }
+
+    // Setup block sizes.
+
+    dim3 brBlock(32, CR_BIN_WARPS);
+    dim3 crBlock(32, CR_COARSE_WARPS);
+    dim3 frBlock(32, m_numFineWarpsPerBlock);
+
+    // Launch stages.
+    void* args[] = {&p};
+    if (instanceMode)
+    {
+        int setupBlocks = (m_numTriangles - 1) / (32 * CR_SETUP_WARPS) + 1;
+        NVDR_CHECK_CUDA_ERROR(cudaLaunchKernel((void*)triangleSetupKernel, dim3(setupBlocks, 1, m_numImages), dim3(32, CR_SETUP_WARPS), args, 0, stream));
+    }
+    else
+    {
+        for (int i=0; i < m_numImages; i++)
+            p.totalCount += imageParams[i].triCount;
+        int setupBlocks = (p.totalCount - 1) / (32 * CR_SETUP_WARPS) + 1;
+        NVDR_CHECK_CUDA_ERROR(cudaLaunchKernel((void*)triangleSetupKernel, dim3(setupBlocks, 1, 1), dim3(32, CR_SETUP_WARPS), args, 0, stream));
+    }
+    NVDR_CHECK_CUDA_ERROR(cudaLaunchKernel((void*)binRasterKernel, dim3(CR_BIN_STREAMS_SIZE, 1, m_numImages), brBlock, args, 0, stream));
+    NVDR_CHECK_CUDA_ERROR(cudaLaunchKernel((void*)coarseRasterKernel, dim3(m_numSMs * m_numCoarseBlocksPerSM, 1, m_numImages), crBlock, args, 0, stream));
+    NVDR_CHECK_CUDA_ERROR(cudaLaunchKernel((void*)fineRasterKernel, dim3(m_numSMs * m_numFineBlocksPerSM, 1, m_numImages), frBlock, args, 0, stream));
+}
+
+//------------------------------------------------------------------------
--- a/nvdiffrast/common/cudaraster/impl/RasterImpl.cu
+++ b/nvdiffrast/common/cudaraster/impl/RasterImpl.cu
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+#include "../CudaRaster.hpp"
+#include "PrivateDefs.hpp"
+#include "Constants.hpp"
+#include "Util.inl"
+
+namespace CR
+{
+
+//------------------------------------------------------------------------
+// Stage implementations.
+//------------------------------------------------------------------------
+
+#include "TriangleSetup.inl"
+#include "BinRaster.inl"
+#include "CoarseRaster.inl"
+#include "FineRaster.inl"
+
+}
+
+//------------------------------------------------------------------------
+// Stage entry points.
+//------------------------------------------------------------------------
+
+__global__ void __launch_bounds__(CR_SETUP_WARPS * 32, CR_SETUP_OPT_BLOCKS)  triangleSetupKernel (const CR::CRParams p)  { CR::triangleSetupImpl(p); }
+__global__ void __launch_bounds__(CR_BIN_WARPS * 32, 1)                      binRasterKernel     (const CR::CRParams p)  { CR::binRasterImpl(p); }
+__global__ void __launch_bounds__(CR_COARSE_WARPS * 32, 1)                   coarseRasterKernel  (const CR::CRParams p)  { CR::coarseRasterImpl(p); }
+__global__ void __launch_bounds__(CR_FINE_MAX_WARPS * 32, 1)                 fineRasterKernel    (const CR::CRParams p)  { CR::fineRasterImpl(p); }
+
+//------------------------------------------------------------------------
--- a/nvdiffrast/common/cudaraster/impl/RasterImpl.hpp
+++ b/nvdiffrast/common/cudaraster/impl/RasterImpl.hpp
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+#pragma once
+#include "PrivateDefs.hpp"
+#include "Buffer.hpp"
+#include "../CudaRaster.hpp"
+
+namespace CR
+{
+//------------------------------------------------------------------------
+
+class RasterImpl
+{
+public:
+					        RasterImpl				(void);
+					        ~RasterImpl				(void);
+
+    void                    setViewportSize         (Vec3i size); // Must be multiple of tile size (8x8).
+    void                    setRenderModeFlags      (U32 flags) { m_renderModeFlags = flags; }
+    void                    deferredClear           (U32 color) { m_deferredClear = true; m_clearColor = color; }
+    void                    setVertexBuffer         (void* ptr, int numVertices) { m_vertexPtr = ptr; m_numVertices = numVertices; } // GPU pointer.
+    void                    setIndexBuffer          (void* ptr, int numTriangles) { m_indexPtr = ptr; m_numTriangles = numTriangles; } // GPU pointer.
+    bool                    drawTriangles           (const Vec2i* ranges, cudaStream_t stream);
+    void*                   getColorBuffer          (void) { return m_colorBuffer.getPtr(); } // GPU pointer.
+    void*                   getDepthBuffer          (void) { return m_depthBuffer.getPtr(); } // GPU pointer.
+    void                    swapDepthAndPeel        (void);
+    size_t                  getTotalBufferSizes     (void) const;
+
+private:
+    void                    launchStages            (const CRImageParams* imageParams, bool instanceMode, cudaStream_t stream);
+
+    // State.
+
+    unsigned int            m_renderModeFlags;
+    bool                    m_deferredClear;
+    unsigned int            m_clearColor;
+    void*                   m_vertexPtr;
+    void*                   m_indexPtr;
+    int                     m_numVertices;          // Input buffer size.
+    int                     m_numTriangles;         // Input buffer size.
+    size_t                  m_bufferSizesReported;  // Previously reported buffer sizes.
+
+    // Surfaces.
+
+    Buffer                  m_colorBuffer;
+    Buffer                  m_depthBuffer;
+    Buffer                  m_peelBuffer;
+    int                     m_numImages;
+    Vec2i                   m_sizePixels;
+    Vec2i                   m_sizeBins;
+    S32                     m_numBins;
+    Vec2i                   m_sizeTiles;
+    S32                     m_numTiles;
+
+    // Launch sizes etc.
+
+    S32                     m_numSMs;
+    S32                     m_numCoarseBlocksPerSM;
+    S32                     m_numFineBlocksPerSM;
+    S32                     m_numFineWarpsPerBlock;
+
+    // Global intermediate buffers. Individual images have offsets to these.
+
+    Buffer                  m_crAtomics;
+    Buffer                  m_crImageParamsExtra;
+    Buffer                  m_triSubtris;
+    Buffer                  m_triHeader;
+    Buffer                  m_triData;
+    Buffer                  m_binFirstSeg;
+    Buffer                  m_binTotal;
+    Buffer                  m_binSegData;
+    Buffer                  m_binSegNext;
+	Buffer                  m_binSegCount;
+    Buffer                  m_activeTiles;
+    Buffer                  m_tileFirstSeg;
+    Buffer                  m_tileSegData;
+    Buffer                  m_tileSegNext;
+    Buffer                  m_tileSegCount;
+
+    // Actual buffer sizes.
+
+    S32                     m_maxSubtris;
+    S32                     m_maxBinSegs;
+    S32                     m_maxTileSegs;
+};
+
+//------------------------------------------------------------------------
+} // namespace CR
+
--- a/nvdiffrast/common/cudaraster/impl/TriangleSetup.inl
+++ b/nvdiffrast/common/cudaraster/impl/TriangleSetup.inl
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ void snapTriangle(
+    const CRParams& p,
+    float4 v0, float4 v1, float4 v2,
+    int2& p0, int2& p1, int2& p2, float3& rcpW, int2& lo, int2& hi)
+{
+    F32 viewScaleX = (F32)(p.widthPixels  << (CR_SUBPIXEL_LOG2 - 1));
+    F32 viewScaleY = (F32)(p.heightPixels << (CR_SUBPIXEL_LOG2 - 1));
+    rcpW = make_float3(1.0f / v0.w, 1.0f / v1.w, 1.0f / v2.w);
+    p0 = make_int2(f32_to_s32_sat(v0.x * rcpW.x * viewScaleX), f32_to_s32_sat(v0.y * rcpW.x * viewScaleY));
+    p1 = make_int2(f32_to_s32_sat(v1.x * rcpW.y * viewScaleX), f32_to_s32_sat(v1.y * rcpW.y * viewScaleY));
+    p2 = make_int2(f32_to_s32_sat(v2.x * rcpW.z * viewScaleX), f32_to_s32_sat(v2.y * rcpW.z * viewScaleY));
+    lo = make_int2(min_min(p0.x, p1.x, p2.x), min_min(p0.y, p1.y, p2.y));
+    hi = make_int2(max_max(p0.x, p1.x, p2.x), max_max(p0.y, p1.y, p2.y));
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ U32 cover8x8_selectFlips(S32 dx, S32 dy) // 10 instr
+{
+    U32 flips = 0;
+    if (dy > 0 || (dy == 0 && dx <= 0))
+        flips ^= (1 << CR_FLIPBIT_FLIP_X) ^ (1 << CR_FLIPBIT_FLIP_Y) ^ (1 << CR_FLIPBIT_COMPL);
+    if (dx > 0)
+        flips ^= (1 << CR_FLIPBIT_FLIP_X) ^ (1 << CR_FLIPBIT_FLIP_Y);
+    if (::abs(dx) < ::abs(dy))
+        flips ^= (1 << CR_FLIPBIT_SWAP_XY) ^ (1 << CR_FLIPBIT_FLIP_Y);
+    return flips;
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ bool prepareTriangle(
+    const CRParams& p,
+    int2 p0, int2 p1, int2 p2, int2 lo, int2 hi,
+    int2& d1, int2& d2, S32& area)
+{
+    // Backfacing or degenerate => cull.
+
+    d1 = make_int2(p1.x - p0.x, p1.y - p0.y);
+    d2 = make_int2(p2.x - p0.x, p2.y - p0.y);
+    area = d1.x * d2.y - d1.y * d2.x;
+
+    if (area == 0)
+        return false; // Degenerate.
+
+    if (area < 0 && (p.renderModeFlags & CudaRaster::RenderModeFlag_EnableBackfaceCulling) != 0)
+        return false; // Backfacing.
+
+    // AABB falls between samples => cull.
+
+    int sampleSize = 1 << CR_SUBPIXEL_LOG2;
+    int biasX = (p.widthPixels  << (CR_SUBPIXEL_LOG2 - 1)) - (sampleSize >> 1);
+    int biasY = (p.heightPixels << (CR_SUBPIXEL_LOG2 - 1)) - (sampleSize >> 1);
+    int lox = (int)add_add(lo.x, sampleSize - 1, biasX) & -sampleSize;
+    int loy = (int)add_add(lo.y, sampleSize - 1, biasY) & -sampleSize;
+    int hix = (hi.x + biasX) & -sampleSize;
+    int hiy = (hi.y + biasY) & -sampleSize;
+
+    if (lox > hix || loy > hiy)
+        return false; // Between pixels.
+
+    // AABB covers 1 or 2 samples => cull if they are not covered.
+
+    int diff = add_sub(hix, hiy, lox) - loy;
+    if (diff <= sampleSize)
+    {
+        int2 t0 = make_int2(add_sub(p0.x, biasX, lox), add_sub(p0.y, biasY, loy));
+        int2 t1 = make_int2(add_sub(p1.x, biasX, lox), add_sub(p1.y, biasY, loy));
+        int2 t2 = make_int2(add_sub(p2.x, biasX, lox), add_sub(p2.y, biasY, loy));
+        S32 e0 = t0.x * t1.y - t0.y * t1.x;
+        S32 e1 = t1.x * t2.y - t1.y * t2.x;
+        S32 e2 = t2.x * t0.y - t2.y * t0.x;
+
+        if (e0 < 0 || e1 < 0 || e2 < 0)
+        {
+            if (diff == 0)
+                return false; // Between pixels.
+
+            t0 = make_int2(add_sub(p0.x, biasX, hix), add_sub(p0.y, biasY, hiy));
+            t1 = make_int2(add_sub(p1.x, biasX, hix), add_sub(p1.y, biasY, hiy));
+            t2 = make_int2(add_sub(p2.x, biasX, hix), add_sub(p2.y, biasY, hiy));
+            e0 = t0.x * t1.y - t0.y * t1.x;
+            e1 = t1.x * t2.y - t1.y * t2.x;
+            e2 = t2.x * t0.y - t2.y * t0.x;
+
+            if (e0 < 0 || e1 < 0 || e2 < 0)
+                return false; // Between pixels.
+        }
+    }
+
+    // Otherwise => proceed to output the triangle.
+
+    return true; // Visible.
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ void setupTriangle(
+    const CRParams& p,
+    CRTriangleHeader* th, CRTriangleData* td, int triId,
+    float v0z, float v1z, float v2z,
+    int2 p0, int2 p1, int2 p2, float3 rcpW,
+    int2 d1, int2 d2, S32 area)
+{
+    // Swap vertices 1 and 2 if area is negative. Only executed if backface culling is
+    // disabled (if it is enabled, we never come here with area < 0).
+
+    if (area < 0)
+    {
+        swap(d1, d2);
+        swap(p1, p2);
+        swap(v1z, v2z);
+        swap(rcpW.y, rcpW.z);
+        area = -area;
+    }
+
+    int2 wv0;
+    wv0.x = p0.x + (p.widthPixels  << (CR_SUBPIXEL_LOG2 - 1));
+    wv0.y = p0.y + (p.heightPixels << (CR_SUBPIXEL_LOG2 - 1));
+
+    // Setup depth plane equation.
+
+    F32 zcoef = (F32)(CR_DEPTH_MAX - CR_DEPTH_MIN) * 0.5f;
+    F32 zbias = (F32)(CR_DEPTH_MAX + CR_DEPTH_MIN) * 0.5f;
+    float3 zvert = make_float3(
+        (v0z * zcoef) * rcpW.x + zbias,
+        (v1z * zcoef) * rcpW.y + zbias,
+        (v2z * zcoef) * rcpW.z + zbias
+    );
+    int2 zv0 = make_int2(
+        wv0.x - (1 << (CR_SUBPIXEL_LOG2 - 1)),
+        wv0.y - (1 << (CR_SUBPIXEL_LOG2 - 1))
+    );
+    uint3 zpleq = setupPleq(zvert, zv0, d1, d2, 1.0f / (F32)area);
+
+    U32 zmin = f32_to_u32_sat(fminf(fminf(zvert.x, zvert.y), zvert.z) - (F32)CR_LERP_ERROR(0));
+
+    // Write CRTriangleData.
+
+    *(uint4*)td = make_uint4(zpleq.x, zpleq.y, zpleq.z, triId);
+
+    // Determine flipbits.
+
+    U32 f01 = cover8x8_selectFlips(d1.x, d1.y);
+    U32 f12 = cover8x8_selectFlips(d2.x - d1.x, d2.y - d1.y);
+    U32 f20 = cover8x8_selectFlips(-d2.x, -d2.y);
+
+    // Write CRTriangleHeader.
+
+    *(uint4*)th = make_uint4(
+        prmt(p0.x, p0.y, 0x5410),
+        prmt(p1.x, p1.y, 0x5410),
+        prmt(p2.x, p2.y, 0x5410),
+        (zmin & 0xfffff000u) | (f01 << 6) | (f12 << 2) | (f20 >> 2));
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ void triangleSetupImpl(const CRParams p)
+{
+    __shared__ F32 s_bary[CR_SETUP_WARPS * 32][18];
+    F32* bary = s_bary[threadIdx.x + threadIdx.y * 32];
+
+    // Compute task and image indices.
+
+    int taskIdx = threadIdx.x + 32 * (threadIdx.y + CR_SETUP_WARPS * blockIdx.x);
+    int imageIdx = 0;
+    if (p.instanceMode)
+    {
+        imageIdx = blockIdx.z;
+        if (taskIdx >= p.numTriangles)
+            return;
+    }
+    else
+    {
+        while (imageIdx < p.numImages)
+        {
+            int count = getImageParams(p, imageIdx).triCount;
+            if (taskIdx < count)
+                break;
+            taskIdx -= count;
+            imageIdx += 1;
+        }
+        if (imageIdx == p.numImages)
+            return;
+    }
+
+    // Per-image data structures.
+
+    const CRImageParams& ip = getImageParams(p, imageIdx);
+    CRAtomics& atomics = p.atomics[imageIdx];
+
+    const int*          indexBuffer = (const int*)p.indexBuffer;
+    U8*                 triSubtris  = (U8*)p.triSubtris               + imageIdx * p.maxSubtris;
+    CRTriangleHeader*   triHeader   = (CRTriangleHeader*)p.triHeader  + imageIdx * p.maxSubtris;
+    CRTriangleData*     triData     = (CRTriangleData*)p.triData      + imageIdx * p.maxSubtris;
+
+    // Determine triangle index.
+
+    int triIdx = taskIdx;
+    if (!p.instanceMode)
+        triIdx += ip.triOffset;
+
+    // Read vertex indices.
+
+    if ((U32)triIdx >= (U32)p.numTriangles)
+    {
+        // Bad triangle index.
+        triSubtris[taskIdx] = 0;
+        return;
+    }
+
+    uint4 vidx;
+    vidx.x = indexBuffer[triIdx * 3 + 0];
+    vidx.y = indexBuffer[triIdx * 3 + 1];
+    vidx.z = indexBuffer[triIdx * 3 + 2];
+    vidx.w = triIdx + 1; // Triangle index.
+
+    if (vidx.x >= (U32)p.numVertices ||
+        vidx.y >= (U32)p.numVertices ||
+        vidx.z >= (U32)p.numVertices)
+    {
+        // Bad vertex index.
+        triSubtris[taskIdx] = 0;
+        return;
+    }
+
+    // Read vertex positions.
+
+    const float4* vertexBuffer = (const float4*)p.vertexBuffer;
+    if (p.instanceMode)
+        vertexBuffer += p.numVertices * imageIdx; // Instance offset.
+
+    float4 v0 = vertexBuffer[vidx.x];
+    float4 v1 = vertexBuffer[vidx.y];
+    float4 v2 = vertexBuffer[vidx.z];
+
+    // Outside view frustum => cull.
+
+    if (v0.w < fabsf(v0.x) | v0.w < fabsf(v0.y) | v0.w < fabsf(v0.z))
+    {
+        if ((v0.w < +v0.x & v1.w < +v1.x & v2.w < +v2.x) |
+            (v0.w < -v0.x & v1.w < -v1.x & v2.w < -v2.x) |
+            (v0.w < +v0.y & v1.w < +v1.y & v2.w < +v2.y) |
+            (v0.w < -v0.y & v1.w < -v1.y & v2.w < -v2.y) |
+            (v0.w < +v0.z & v1.w < +v1.z & v2.w < +v2.z) |
+            (v0.w < -v0.z & v1.w < -v1.z & v2.w < -v2.z))
+        {
+            triSubtris[taskIdx] = 0;
+            return;
+        }
+    }
+
+    // Inside depth range => try to snap vertices.
+
+    if (v0.w >= fabsf(v0.z) & v1.w >= fabsf(v1.z) & v2.w >= fabsf(v2.z))
+    {
+        // Inside S16 range and small enough => fast path.
+        // Note: aabbLimit comes from the fact that cover8x8
+        // does not support guardband with maximal viewport.
+
+        int2 p0, p1, p2, lo, hi;
+        float3 rcpW;
+
+        snapTriangle(p, v0, v1, v2, p0, p1, p2, rcpW, lo, hi);
+        S32 loxy = ::min(lo.x, lo.y);
+        S32 hixy = ::max(hi.x, hi.y);
+        S32 aabbLimit = (1 << (CR_MAXVIEWPORT_LOG2 + CR_SUBPIXEL_LOG2)) - 1;
+
+        if (loxy >= -32768 && hixy <= 32767 && hixy - loxy <= aabbLimit)
+        {
+            int2 d1, d2;
+            S32 area;
+            bool res = prepareTriangle(p, p0, p1, p2, lo, hi, d1, d2, area);
+            triSubtris[taskIdx] = res ? 1 : 0;
+
+            if (res)
+                setupTriangle(
+                    p,
+                    &triHeader[taskIdx], &triData[taskIdx], vidx.w,
+                    v0.z, v1.z, v2.z,
+                    p0, p1, p2, rcpW,
+                    d1, d2, area);
+
+            return;
+        }
+    }
+
+    // Clip to view frustum.
+
+    float4 ov0 = v0;
+    float4 od1 = make_float4(v1.x - v0.x, v1.y - v0.y, v1.z - v0.z, v1.w - v0.w);
+    float4 od2 = make_float4(v2.x - v0.x, v2.y - v0.y, v2.z - v0.z, v2.w - v0.w);
+    int numVerts = clipTriangleWithFrustum(bary, &ov0.x, &v1.x, &v2.x, &od1.x, &od2.x);
+
+    // Count non-culled subtriangles.
+
+    v0.x = ov0.x + od1.x * bary[0] + od2.x * bary[1];
+    v0.y = ov0.y + od1.y * bary[0] + od2.y * bary[1];
+    v0.z = ov0.z + od1.z * bary[0] + od2.z * bary[1];
+    v0.w = ov0.w + od1.w * bary[0] + od2.w * bary[1];
+    v1.x = ov0.x + od1.x * bary[2] + od2.x * bary[3];
+    v1.y = ov0.y + od1.y * bary[2] + od2.y * bary[3];
+    v1.z = ov0.z + od1.z * bary[2] + od2.z * bary[3];
+    v1.w = ov0.w + od1.w * bary[2] + od2.w * bary[3];
+    float4 tv1 = v1;
+
+    int numSubtris = 0;
+    for (int i = 2; i < numVerts; i++)
+    {
+        v2.x = ov0.x + od1.x * bary[i * 2 + 0] + od2.x * bary[i * 2 + 1];
+        v2.y = ov0.y + od1.y * bary[i * 2 + 0] + od2.y * bary[i * 2 + 1];
+        v2.z = ov0.z + od1.z * bary[i * 2 + 0] + od2.z * bary[i * 2 + 1];
+        v2.w = ov0.w + od1.w * bary[i * 2 + 0] + od2.w * bary[i * 2 + 1];
+
+        int2 p0, p1, p2, lo, hi, d1, d2;
+        float3 rcpW;
+        S32 area;
+
+        snapTriangle(p, v0, v1, v2, p0, p1, p2, rcpW, lo, hi);
+        if (prepareTriangle(p, p0, p1, p2, lo, hi, d1, d2, area))
+            numSubtris++;
+
+        v1 = v2;
+    }
+
+    triSubtris[taskIdx] = numSubtris;
+
+    // Multiple subtriangles => allocate.
+
+    int subtriBase = taskIdx;
+    if (numSubtris > 1)
+    {
+        subtriBase = atomicAdd(&atomics.numSubtris, numSubtris);
+        triHeader[taskIdx].misc = subtriBase;
+        if (subtriBase + numSubtris > p.maxSubtris)
+            numVerts = 0;
+    }
+
+    // Setup subtriangles.
+
+    v1 = tv1;
+    for (int i = 2; i < numVerts; i++)
+    {
+        v2.x = ov0.x + od1.x * bary[i * 2 + 0] + od2.x * bary[i * 2 + 1];
+        v2.y = ov0.y + od1.y * bary[i * 2 + 0] + od2.y * bary[i * 2 + 1];
+        v2.z = ov0.z + od1.z * bary[i * 2 + 0] + od2.z * bary[i * 2 + 1];
+        v2.w = ov0.w + od1.w * bary[i * 2 + 0] + od2.w * bary[i * 2 + 1];
+
+        int2 p0, p1, p2, lo, hi, d1, d2;
+        float3 rcpW;
+        S32 area;
+
+        snapTriangle(p, v0, v1, v2, p0, p1, p2, rcpW, lo, hi);
+        if (prepareTriangle(p, p0, p1, p2, lo, hi, d1, d2, area))
+        {
+            setupTriangle(
+                p,
+                &triHeader[subtriBase], &triData[subtriBase], vidx.w,
+                v0.z, v1.z, v2.z,
+                p0, p1, p2, rcpW,
+                d1, d2, area);
+
+            subtriBase++;
+        }
+
+        v1 = v2;
+    }
+}
+
+//------------------------------------------------------------------------
--- a/nvdiffrast/common/cudaraster/impl/Util.inl
+++ b/nvdiffrast/common/cudaraster/impl/Util.inl
+// Copyright (c) 2009-2022, NVIDIA CORPORATION.  All rights reserved.
+//
+// NVIDIA CORPORATION and its licensors retain all intellectual property
+// and proprietary rights in and to this software, related documentation
+// and any modifications thereto.  Any use, reproduction, disclosure or
+// distribution of this software and related documentation without an express
+// license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+#include "PrivateDefs.hpp"
+
+namespace CR
+{
+//------------------------------------------------------------------------
+
+template<class T> __device__ __inline__ void swap(T& a, T& b)               { T t = a; a = b; b = t; }
+
+__device__ __inline__ U32   getLo                   (U64 a)                 { return __double2loint(__longlong_as_double(a)); }
+__device__ __inline__ S32   getLo                   (S64 a)                 { return __double2loint(__longlong_as_double(a)); }
+__device__ __inline__ U32   getHi                   (U64 a)                 { return __double2hiint(__longlong_as_double(a)); }
+__device__ __inline__ S32   getHi                   (S64 a)                 { return __double2hiint(__longlong_as_double(a)); }
+__device__ __inline__ U64   combineLoHi             (U32 lo, U32 hi)        { return __double_as_longlong(__hiloint2double(hi, lo)); }
+__device__ __inline__ S64   combineLoHi             (S32 lo, S32 hi)        { return __double_as_longlong(__hiloint2double(hi, lo)); }
+__device__ __inline__ U32   getLaneMaskLt           (void)                  { U32 r; asm("mov.u32 %0, %lanemask_lt;" : "=r"(r)); return r; }
+__device__ __inline__ U32   getLaneMaskLe           (void)                  { U32 r; asm("mov.u32 %0, %lanemask_le;" : "=r"(r)); return r; }
+__device__ __inline__ U32   getLaneMaskGt           (void)                  { U32 r; asm("mov.u32 %0, %lanemask_gt;" : "=r"(r)); return r; }
+__device__ __inline__ U32   getLaneMaskGe           (void)                  { U32 r; asm("mov.u32 %0, %lanemask_ge;" : "=r"(r)); return r; }
+__device__ __inline__ int   findLeadingOne          (U32 v)                 { U32 r; asm("bfind.u32 %0, %1;" : "=r"(r) : "r"(v)); return r; }
+__device__ __inline__ bool  singleLane              (void)                  { return ((::__ballot_sync(~0u, true) & getLaneMaskLt()) == 0); }
+
+__device__ __inline__ void  add_add_carry           (U32& rlo, U32 alo, U32 blo, U32& rhi, U32 ahi, U32 bhi) { U64 r = combineLoHi(alo, ahi) + combineLoHi(blo, bhi); rlo = getLo(r); rhi = getHi(r); }
+__device__ __inline__ S32   f32_to_s32_sat          (F32 a)                 { S32 v; asm("cvt.rni.sat.s32.f32 %0, %1;" : "=r"(v) : "f"(a)); return v; }
+__device__ __inline__ U32   f32_to_u32_sat          (F32 a)                 { U32 v; asm("cvt.rni.sat.u32.f32 %0, %1;" : "=r"(v) : "f"(a)); return v; }
+__device__ __inline__ U32   f32_to_u32_sat_rmi      (F32 a)                 { U32 v; asm("cvt.rmi.sat.u32.f32 %0, %1;" : "=r"(v) : "f"(a)); return v; }
+__device__ __inline__ U32   f32_to_u8_sat           (F32 a)                 { U32 v; asm("cvt.rni.sat.u8.f32 %0, %1;" : "=r"(v) : "f"(a)); return v; }
+__device__ __inline__ S64   f32_to_s64              (F32 a)                 { S64 v; asm("cvt.rni.s64.f32 %0, %1;" : "=l"(v) : "f"(a)); return v; }
+__device__ __inline__ S32   add_s16lo_s16lo			(S32 a, S32 b)			{ S32 v; asm("vadd.s32.s32.s32 %0, %1.h0, %2.h0;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ S32   add_s16hi_s16lo			(S32 a, S32 b)			{ S32 v; asm("vadd.s32.s32.s32 %0, %1.h1, %2.h0;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ S32   add_s16lo_s16hi			(S32 a, S32 b)			{ S32 v; asm("vadd.s32.s32.s32 %0, %1.h0, %2.h1;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ S32   add_s16hi_s16hi			(S32 a, S32 b)			{ S32 v; asm("vadd.s32.s32.s32 %0, %1.h1, %2.h1;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ S32   sub_s16lo_s16lo			(S32 a, S32 b)			{ S32 v; asm("vsub.s32.s32.s32 %0, %1.h0, %2.h0;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ S32   sub_s16hi_s16lo			(S32 a, S32 b)			{ S32 v; asm("vsub.s32.s32.s32 %0, %1.h1, %2.h0;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ S32   sub_s16lo_s16hi			(S32 a, S32 b)			{ S32 v; asm("vsub.s32.s32.s32 %0, %1.h0, %2.h1;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ S32   sub_s16hi_s16hi			(S32 a, S32 b)			{ S32 v; asm("vsub.s32.s32.s32 %0, %1.h1, %2.h1;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ S32   sub_u16lo_u16lo			(U32 a, U32 b)			{ S32 v; asm("vsub.s32.u32.u32 %0, %1.h0, %2.h0;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ S32   sub_u16hi_u16lo			(U32 a, U32 b)			{ S32 v; asm("vsub.s32.u32.u32 %0, %1.h1, %2.h0;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ S32   sub_u16lo_u16hi			(U32 a, U32 b)			{ S32 v; asm("vsub.s32.u32.u32 %0, %1.h0, %2.h1;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ S32   sub_u16hi_u16hi			(U32 a, U32 b)			{ S32 v; asm("vsub.s32.u32.u32 %0, %1.h1, %2.h1;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ U32   add_b0					(U32 a, U32 b)			{ U32 v; asm("vadd.u32.u32.u32 %0, %1.b0, %2;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ U32   add_b1					(U32 a, U32 b)			{ U32 v; asm("vadd.u32.u32.u32 %0, %1.b1, %2;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ U32   add_b2					(U32 a, U32 b)			{ U32 v; asm("vadd.u32.u32.u32 %0, %1.b2, %2;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ U32   add_b3					(U32 a, U32 b)			{ U32 v; asm("vadd.u32.u32.u32 %0, %1.b3, %2;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ U32   vmad_b0					(U32 a, U32 b, U32 c)	{ U32 v; asm("vmad.u32.u32.u32 %0, %1.b0, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   vmad_b1					(U32 a, U32 b, U32 c)	{ U32 v; asm("vmad.u32.u32.u32 %0, %1.b1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   vmad_b2					(U32 a, U32 b, U32 c)	{ U32 v; asm("vmad.u32.u32.u32 %0, %1.b2, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   vmad_b3					(U32 a, U32 b, U32 c)	{ U32 v; asm("vmad.u32.u32.u32 %0, %1.b3, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   vmad_b0_b3				(U32 a, U32 b, U32 c)	{ U32 v; asm("vmad.u32.u32.u32 %0, %1.b0, %2.b3, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   vmad_b1_b3				(U32 a, U32 b, U32 c)	{ U32 v; asm("vmad.u32.u32.u32 %0, %1.b1, %2.b3, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   vmad_b2_b3				(U32 a, U32 b, U32 c)	{ U32 v; asm("vmad.u32.u32.u32 %0, %1.b2, %2.b3, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   vmad_b3_b3				(U32 a, U32 b, U32 c)	{ U32 v; asm("vmad.u32.u32.u32 %0, %1.b3, %2.b3, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   add_mask8				(U32 a, U32 b)			{ U32 v; U32 z=0; asm("vadd.u32.u32.u32 %0.b0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(z)); return v; }
+__device__ __inline__ U32   sub_mask8				(U32 a, U32 b)			{ U32 v; U32 z=0; asm("vsub.u32.u32.u32 %0.b0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(z)); return v; }
+__device__ __inline__ S32   max_max					(S32 a, S32 b, S32 c)	{ S32 v; asm("vmax.s32.s32.s32.max %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ S32   min_min					(S32 a, S32 b, S32 c)	{ S32 v; asm("vmin.s32.s32.s32.min %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ S32   max_add					(S32 a, S32 b, S32 c)	{ S32 v; asm("vmax.s32.s32.s32.add %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ S32   min_add					(S32 a, S32 b, S32 c)	{ S32 v; asm("vmin.s32.s32.s32.add %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   add_add					(U32 a, U32 b, U32 c)	{ U32 v; asm("vadd.u32.u32.u32.add %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   sub_add					(U32 a, U32 b, U32 c)	{ U32 v; asm("vsub.u32.u32.u32.add %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   add_sub					(U32 a, U32 b, U32 c)	{ U32 v; asm("vsub.u32.u32.u32.add %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(c), "r"(b)); return v; }
+__device__ __inline__ S32   add_clamp_0_x			(S32 a, S32 b, S32 c)	{ S32 v; asm("vadd.u32.s32.s32.sat.min %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ S32   add_clamp_b0			(S32 a, S32 b, S32 c)	{ S32 v; asm("vadd.u32.s32.s32.sat %0.b0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ S32   add_clamp_b2			(S32 a, S32 b, S32 c)	{ S32 v; asm("vadd.u32.s32.s32.sat %0.b2, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ U32   prmt					(U32 a, U32 b, U32 c)   { U32 v; asm("prmt.b32 %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ S32   u32lo_sext              (U32 a)                 { U32 v; asm("cvt.s16.u32 %0, %1;" : "=r"(v) : "r"(a)); return v; }
+__device__ __inline__ U32   slct                    (U32 a, U32 b, S32 c)   { U32 v; asm("slct.u32.s32 %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ S32   slct                    (S32 a, S32 b, S32 c)   { S32 v; asm("slct.s32.s32 %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
+__device__ __inline__ F32   slct                    (F32 a, F32 b, S32 c)   { F32 v; asm("slct.f32.s32 %0, %1, %2, %3;" : "=f"(v) : "f"(a), "f"(b), "r"(c)); return v; }
+__device__ __inline__ U32   isetge                  (S32 a, S32 b)          { U32 v; asm("set.ge.u32.s32 %0, %1, %2;" : "=r"(v) : "r"(a), "r"(b)); return v; }
+__device__ __inline__ F64   rcp_approx              (F64 a)                 { F64 v; asm("rcp.approx.ftz.f64 %0, %1;" : "=d"(v) : "d"(a)); return v; }
+__device__ __inline__ F32   fma_rm                  (F32 a, F32 b, F32 c)   { F32 v; asm("fma.rm.f32 %0, %1, %2, %3;" : "=f"(v) : "f"(a), "f"(b), "f"(c)); return v; }
+__device__ __inline__ U32   idiv_fast               (U32 a, U32 b);
+
+__device__ __inline__ uint3 setupPleq               (float3 values, int2 v0, int2 d1, int2 d2, F32 areaRcp);
+
+__device__ __inline__ void  cover8x8_setupLUT           (volatile U64* lut);
+__device__ __inline__ U64   cover8x8_exact_fast         (S32 ox, S32 oy, S32 dx, S32 dy, U32 flips, volatile const U64* lut); // Assumes viewport <= 2^11, subpixels <= 2^4, no guardband.
+__device__ __inline__ U64   cover8x8_lookupMask         (S64 yinit, U32 yinc, U32 flips, volatile const U64* lut);
+
+__device__ __inline__ U64   cover8x8_exact_noLUT        (S32 ox, S32 oy, S32 dx, S32 dy); // optimized reference implementation, does not require look-up table
+__device__ __inline__ U64   cover8x8_conservative_noLUT (S32 ox, S32 oy, S32 dx, S32 dy);
+__device__ __inline__ U64   cover8x8_generateMask_noLUT (S32 curr, S32 dx, S32 dy);
+
+template <class T> __device__ __inline__ void sortShared(T* ptr, int numItems); // Assumes that numItems <= threadsInBlock. Must sync before & after the call.
+
+__device__ __inline__ const CRImageParams& getImageParams(const CRParams& p, int idx)
+{
+    return (idx < CR_EMBED_IMAGE_PARAMS) ? p.imageParamsFirst[idx] : p.imageParamsExtra[idx - CR_EMBED_IMAGE_PARAMS];
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ int clipPolygonWithPlane(F32* baryOut, const F32* baryIn, int numIn, F32 v0, F32 v1, F32 v2)
+{
+    int numOut = 0;
+    if (numIn >= 3)
+    {
+        int ai = (numIn - 1) * 2;
+        F32 av = v0 + v1 * baryIn[ai + 0] + v2 * baryIn[ai + 1];
+        for (int bi = 0; bi < numIn * 2; bi += 2)
+        {
+            F32 bv = v0 + v1 * baryIn[bi + 0] + v2 * baryIn[bi + 1];
+            if (av * bv < 0.0f)
+            {
+                F32 bc = av / (av - bv);
+                F32 ac = 1.0f - bc;
+                baryOut[numOut + 0] = baryIn[ai + 0] * ac + baryIn[bi + 0] * bc;
+                baryOut[numOut + 1] = baryIn[ai + 1] * ac + baryIn[bi + 1] * bc;
+                numOut += 2;
+            }
+            if (bv >= 0.0f)
+            {
+                baryOut[numOut + 0] = baryIn[bi + 0];
+                baryOut[numOut + 1] = baryIn[bi + 1];
+                numOut += 2;
+            }
+            ai = bi;
+            av = bv;
+        }
+    }
+    return (numOut >> 1);
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ int clipTriangleWithFrustum(F32* bary, const F32* v0, const F32* v1, const F32* v2, const F32* d1, const F32* d2)
+{
+    int num = 3;
+    bary[0] = 0.0f, bary[1] = 0.0f;
+    bary[2] = 1.0f, bary[3] = 0.0f;
+    bary[4] = 0.0f, bary[5] = 1.0f;
+
+    if ((v0[3] < fabsf(v0[0])) | (v1[3] < fabsf(v1[0])) | (v2[3] < fabsf(v2[0])))
+    {
+        F32 temp[18];
+        num = clipPolygonWithPlane(temp, bary, num, v0[3] + v0[0], d1[3] + d1[0], d2[3] + d2[0]);
+        num = clipPolygonWithPlane(bary, temp, num, v0[3] - v0[0], d1[3] - d1[0], d2[3] - d2[0]);
+    }
+    if ((v0[3] < fabsf(v0[1])) | (v1[3] < fabsf(v1[1])) | (v2[3] < fabsf(v2[1])))
+    {
+        F32 temp[18];
+        num = clipPolygonWithPlane(temp, bary, num, v0[3] + v0[1], d1[3] + d1[1], d2[3] + d2[1]);
+        num = clipPolygonWithPlane(bary, temp, num, v0[3] - v0[1], d1[3] - d1[1], d2[3] - d2[1]);
+    }
+    if ((v0[3] < fabsf(v0[2])) | (v1[3] < fabsf(v1[2])) | (v2[3] < fabsf(v2[2])))
+    {
+        F32 temp[18];
+        num = clipPolygonWithPlane(temp, bary, num, v0[3] + v0[2], d1[3] + d1[2], d2[3] + d2[2]);
+        num = clipPolygonWithPlane(bary, temp, num, v0[3] - v0[2], d1[3] - d1[2], d2[3] - d2[2]);
+    }
+    return num;
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ U32 idiv_fast(U32 a, U32 b)
+{
+    return f32_to_u32_sat_rmi(((F32)a + 0.5f) / (F32)b);
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ U32 toABGR(float4 color)
+{
+	// 11 instructions: 4*FFMA, 4*F2I, 3*PRMT
+	U32 x = f32_to_u32_sat_rmi(fma_rm(color.x, (1 << 24) * 255.0f, (1 << 24) * 0.5f));
+	U32 y = f32_to_u32_sat_rmi(fma_rm(color.y, (1 << 24) * 255.0f, (1 << 24) * 0.5f));
+	U32 z = f32_to_u32_sat_rmi(fma_rm(color.z, (1 << 24) * 255.0f, (1 << 24) * 0.5f));
+	U32 w = f32_to_u32_sat_rmi(fma_rm(color.w, (1 << 24) * 255.0f, (1 << 24) * 0.5f));
+	return prmt(prmt(x, y, 0x0073), prmt(z, w, 0x0073), 0x5410);
+}
+
+//------------------------------------------------------------------------
+// v0 = subpixels relative to the bottom-left sampling point
+
+__device__ __inline__ uint3 setupPleq(float3 values, int2 v0, int2 d1, int2 d2, F32 areaRcp)
+{
+    F32 mx = fmaxf(fmaxf(values.x, values.y), values.z);
+    int sh = ::min(::max((__float_as_int(mx) >> 23) - (127 + 22), 0), 8);
+    S32 t0 = (U32)values.x >> sh;
+    S32 t1 = ((U32)values.y >> sh) - t0;
+    S32 t2 = ((U32)values.z >> sh) - t0;
+
+    U32 rcpMant = (__float_as_int(areaRcp) & 0x007FFFFF) | 0x00800000;
+    int rcpShift = (23 + 127) - (__float_as_int(areaRcp) >> 23);
+
+    uint3 pleq;
+    S64 xc = ((S64)t1 * d2.y - (S64)t2 * d1.y) * rcpMant;
+    S64 yc = ((S64)t2 * d1.x - (S64)t1 * d2.x) * rcpMant;
+    pleq.x = (U32)(xc >> (rcpShift - (sh + CR_SUBPIXEL_LOG2)));
+    pleq.y = (U32)(yc >> (rcpShift - (sh + CR_SUBPIXEL_LOG2)));
+
+    S32 centerX = (v0.x * 2 + min_min(d1.x, d2.x, 0) + max_max(d1.x, d2.x, 0)) >> (CR_SUBPIXEL_LOG2 + 1);
+    S32 centerY = (v0.y * 2 + min_min(d1.y, d2.y, 0) + max_max(d1.y, d2.y, 0)) >> (CR_SUBPIXEL_LOG2 + 1);
+    S32 vcx = v0.x - (centerX << CR_SUBPIXEL_LOG2);
+    S32 vcy = v0.y - (centerY << CR_SUBPIXEL_LOG2);
+
+    pleq.z = t0 << sh;
+    pleq.z -= (U32)(((xc >> 13) * vcx + (yc >> 13) * vcy) >> (rcpShift - (sh + 13)));
+    pleq.z -= pleq.x * centerX + pleq.y * centerY;
+    return pleq;
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ void cover8x8_setupLUT(volatile U64* lut)
+{
+    for (S32 lutIdx = threadIdx.x + blockDim.x * threadIdx.y; lutIdx < CR_COVER8X8_LUT_SIZE; lutIdx += blockDim.x * blockDim.y)
+    {
+        int half       = (lutIdx < (12 << 5)) ? 0 : 1;
+        int yint       = (lutIdx >> 5) - half * 12 - 3;
+        U32 shape      = ((lutIdx >> 2) & 7) << (31 - 2);
+        S32 slctSwapXY = lutIdx << (31 - 1);
+        S32 slctNegX   = lutIdx << (31 - 0);
+        S32 slctCompl  = slctSwapXY ^ slctNegX;
+
+        U64 mask = 0;
+        int xlo = half * 4;
+        int xhi = xlo + 4;
+        for (int x = xlo; x < xhi; x++)
+        {
+            int ylo = slct(0, ::max(yint, 0), slctCompl);
+            int yhi = slct(::min(yint, 8), 8, slctCompl);
+            for (int y = ylo; y < yhi; y++)
+            {
+                int xx = slct(x, y, slctSwapXY);
+                int yy = slct(y, x, slctSwapXY);
+                xx = slct(xx, 7 - xx, slctNegX);
+                mask |= (U64)1 << (xx + yy * 8);
+            }
+            yint += shape >> 31;
+            shape <<= 1;
+        }
+        lut[lutIdx] = mask;
+    }
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ U64 cover8x8_exact_fast(S32 ox, S32 oy, S32 dx, S32 dy, U32 flips, volatile const U64* lut) // 52 instr
+{
+    F32  yinitBias  = (F32)(1 << (31 - CR_MAXVIEWPORT_LOG2 - CR_SUBPIXEL_LOG2 * 2));
+    F32  yinitScale = (F32)(1 << (32 - CR_SUBPIXEL_LOG2));
+    F32  yincScale  = 65536.0f * 65536.0f;
+
+    S32  slctFlipY  = flips << (31 - CR_FLIPBIT_FLIP_Y);
+    S32  slctFlipX  = flips << (31 - CR_FLIPBIT_FLIP_X);
+    S32  slctSwapXY = flips << (31 - CR_FLIPBIT_SWAP_XY);
+
+    // Evaluate cross product.
+
+    S32 t = ox * dy - oy * dx;
+    F32 det = (F32)slct(t, t - dy * (7 << CR_SUBPIXEL_LOG2), slctFlipX);
+    if (flips >= (1 << CR_FLIPBIT_COMPL))
+        det = -det;
+
+    // Represent Y as a function of X.
+
+    F32 xrcp  = 1.0f / (F32)::abs(slct(dx, dy, slctSwapXY));
+    F32 yzero = det * yinitScale * xrcp + yinitBias;
+    S64 yinit = f32_to_s64(slct(yzero, -yzero, slctFlipY));
+    U32 yinc  = f32_to_u32_sat((F32)::abs(slct(dy, dx, slctSwapXY)) * xrcp * yincScale);
+
+    // Lookup.
+
+    return cover8x8_lookupMask(yinit, yinc, flips, lut);
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ U64 cover8x8_lookupMask(S64 yinit, U32 yinc, U32 flips, volatile const U64* lut)
+{
+    // First half.
+
+    U32 yfrac = getLo(yinit);
+    U32 shape = add_clamp_0_x(getHi(yinit) + 4, 0, 11);
+    add_add_carry(yfrac, yfrac, yinc, shape, shape, shape);
+    add_add_carry(yfrac, yfrac, yinc, shape, shape, shape);
+    add_add_carry(yfrac, yfrac, yinc, shape, shape, shape);
+    int oct = flips & ((1 << CR_FLIPBIT_FLIP_X) | (1 << CR_FLIPBIT_SWAP_XY));
+    U64 mask = *(U64*)((U8*)lut + oct + (shape << 5));
+
+    // Second half.
+
+    add_add_carry(yfrac, yfrac, yinc, shape, shape, shape);
+    shape = add_clamp_0_x(getHi(yinit) + 4, __popc(shape & 15), 11);
+    add_add_carry(yfrac, yfrac, yinc, shape, shape, shape);
+    add_add_carry(yfrac, yfrac, yinc, shape, shape, shape);
+    add_add_carry(yfrac, yfrac, yinc, shape, shape, shape);
+    mask |= *(U64*)((U8*)lut + oct + (shape << 5) + (12 << 8));
+    return (flips >= (1 << CR_FLIPBIT_COMPL)) ? ~mask : mask;
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ U64 cover8x8_exact_noLUT(S32 ox, S32 oy, S32 dx, S32 dy)
+{
+    S32 curr = ox * dy - oy * dx;
+    if (dy > 0 || (dy == 0 && dx <= 0)) curr--; // exclusive
+    return cover8x8_generateMask_noLUT(curr, dx, dy);
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ U64 cover8x8_conservative_noLUT(S32 ox, S32 oy, S32 dx, S32 dy)
+{
+    S32 curr = ox * dy - oy * dx;
+    if (dy > 0 || (dy == 0 && dx <= 0)) curr--; // exclusive
+    curr += (::abs(dx) + ::abs(dy)) << (CR_SUBPIXEL_LOG2 - 1);
+    return cover8x8_generateMask_noLUT(curr, dx, dy);
+}
+
+//------------------------------------------------------------------------
+
+__device__ __inline__ U64 cover8x8_generateMask_noLUT(S32 curr, S32 dx, S32 dy)
+{
+    curr += (dx - dy) * (7 << CR_SUBPIXEL_LOG2);
+    S32 stepX = dy << (CR_SUBPIXEL_LOG2 + 1);
+    S32 stepYorig = -dx - dy * 7;
+    S32 stepY = stepYorig << (CR_SUBPIXEL_LOG2 + 1);
+
+    U32 hi = isetge(curr, 0);
+    U32 frac = curr + curr;
+    for (int i = 62; i >= 32; i--)
+        add_add_carry(frac, frac, ((i & 7) == 7) ? stepY : stepX, hi, hi, hi);
+
+	U32 lo = 0;
+    for (int i = 31; i >= 0; i--)
+        add_add_carry(frac, frac, ((i & 7) == 7) ? stepY : stepX, lo, lo, lo);
+
+	lo ^= lo >> 1,  hi ^= hi >> 1;
+	lo ^= lo >> 2,  hi ^= hi >> 2;
+	lo ^= lo >> 4,  hi ^= hi >> 4;
+	lo ^= lo >> 8,  hi ^= hi >> 8;
+	lo ^= lo >> 16, hi ^= hi >> 16;
+
+	if (dy < 0)
+    {
+        lo ^= 0x55AA55AA;
+        hi ^= 0x55AA55AA;
+    }
+	if (stepYorig < 0)
+    {
+        lo ^= 0xFF00FF00;
+        hi ^= 0x00FF00FF;
+    }
+	if ((hi & 1) != 0)
+		lo = ~lo;
+
+    return combineLoHi(lo, hi);
+}
+
+//------------------------------------------------------------------------
+
+template <class T> __device__ __inline__ void sortShared(T* ptr, int numItems)
+{
+    int thrInBlock = threadIdx.x + threadIdx.y * blockDim.x;
+    int range = 16;
+
+    // Use transposition sort within each 16-wide subrange.
+
+    int base = thrInBlock * 2;
+    bool act = (base < numItems - 1);
+    U32 actMask = __ballot_sync(~0u, act);
+    if (act)
+    {
+        bool tryOdd = (base < numItems - 2 && (~base & (range - 2)) != 0);
+        T mid = ptr[base + 1];
+
+        for (int iter = 0; iter < range; iter += 2)
+        {
+            // Evens.
+
+            T tmp = ptr[base + 0];
+            if (tmp > mid)
+            {
+                ptr[base + 0] = mid;
+                mid = tmp;
+            }
+            __syncwarp(actMask);
+
+            // Odds.
+
+            if (tryOdd)
+            {
+                tmp = ptr[base + 2];
+                if (mid > tmp)
+                {
+                    ptr[base + 2] = mid;
+                    mid = tmp;
+                }
+            }
+            __syncwarp(actMask);
+        }
+        ptr[base + 1] = mid;
+    }
+
+    // Multiple subranges => Merge hierarchically.
+
+    for (; range < numItems; range <<= 1)
+    {
+        // Assuming that we would insert the current item into the other
+        // subrange, use binary search to find the appropriate slot.
+
+        __syncthreads();
+
+        T item;
+        int slot;
+        if (thrInBlock < numItems)
+        {
+            item = ptr[thrInBlock];
+            slot = (thrInBlock & -range) ^ range;
+            if (slot < numItems)
+            {
+                T tmp = ptr[slot];
+                bool inclusive = ((thrInBlock & range) != 0);
+                if (tmp < item || (inclusive && tmp == item))
+                {
+                    for (int step = (range >> 1); step != 0; step >>= 1)
+                    {
+                        int probe = slot + step;
+                        if (probe < numItems)
+                        {
+                            tmp = ptr[probe];
+                            if (tmp < item || (inclusive && tmp == item))
+                                slot = probe;
+                        }
+                    }
+                    slot++;
+                }
+            }
+        }
+
+        // Store the item at an appropriate place.
+
+        __syncthreads();
+
+        if (thrInBlock < numItems)
+            ptr[slot + (thrInBlock & (range * 2 - 1)) - range] = item;
+    }
+}
+
+//------------------------------------------------------------------------
+}
--- a/nvdiffrast/common/rasterize.cu
+++ b/nvdiffrast/common/rasterize.cu
@@ -9,6 +9,106 @@
 #include "common.h"
 #include "rasterize.h"

+//------------------------------------------------------------------------
+// Cuda forward rasterizer pixel shader kernel.
+
+__global__ void RasterizeCudaFwdShaderKernel(const RasterizeCudaFwdShaderParams p)
+{
+    // Calculate pixel position.
+    int px = blockIdx.x * blockDim.x + threadIdx.x;
+    int py = blockIdx.y * blockDim.y + threadIdx.y;
+    int pz = blockIdx.z;
+    if (px >= p.width || py >= p.height || pz >= p.depth)
+        return;
+
+    // Pixel index.
+    int pidx = px + p.width * (py + p.height * pz);
+
+    // Fetch triangle idx.
+    int triIdx = p.in_idx[pidx] - 1;
+    if (triIdx < 0 || triIdx >= p.numTriangles)
+    {
+        // No or corrupt triangle.
+        ((float4*)p.out)[pidx] = make_float4(0.0, 0.0, 0.0, 0.0); // Clear out.
+        ((float4*)p.out_db)[pidx] = make_float4(0.0, 0.0, 0.0, 0.0); // Clear out_db.
+        return;
+    }
+
+    // Fetch vertex indices.
+    int vi0 = p.tri[triIdx * 3 + 0];
+    int vi1 = p.tri[triIdx * 3 + 1];
+    int vi2 = p.tri[triIdx * 3 + 2];
+
+    // Bail out if vertex indices are corrupt.
+    if (vi0 < 0 || vi0 >= p.numVertices ||
+        vi1 < 0 || vi1 >= p.numVertices ||
+        vi2 < 0 || vi2 >= p.numVertices)
+        return;
+
+    // In instance mode, adjust vertex indices by minibatch index.
+    if (p.instance_mode)
+    {
+        vi0 += pz * p.numVertices;
+        vi1 += pz * p.numVertices;
+        vi2 += pz * p.numVertices;
+    }
+
+    // Fetch vertex positions.
+    float4 p0 = ((float4*)p.pos)[vi0];
+    float4 p1 = ((float4*)p.pos)[vi1];
+    float4 p2 = ((float4*)p.pos)[vi2];
+
+    // Evaluate edge functions.
+    float fx = p.xs * (float)px + p.xo;
+    float fy = p.ys * (float)py + p.yo;
+    float p0x = p0.x - fx * p0.w;
+    float p0y = p0.y - fy * p0.w;
+    float p1x = p1.x - fx * p1.w;
+    float p1y = p1.y - fy * p1.w;
+    float p2x = p2.x - fx * p2.w;
+    float p2y = p2.y - fy * p2.w;
+    float a0 = p1x*p2y - p1y*p2x;
+    float a1 = p2x*p0y - p2y*p0x;
+    float a2 = p0x*p1y - p0y*p1x;
+
+    // Perspective correct, normalized barycentrics.
+    float iw = 1.f / (a0 + a1 + a2);
+    float b0 = a0 * iw;
+    float b1 = a1 * iw;
+
+    // Compute z/w for depth buffer.
+    float z = p0.z * a0 + p1.z * a1 + p2.z * a2;
+    float w = p0.w * a0 + p1.w * a1 + p2.w * a2;
+    float zw = z / w;
+
+    // Clamps to avoid NaNs.
+    b0 = __saturatef(b0); // Clamp to [+0.0, 1.0].
+    b1 = __saturatef(b1); // Clamp to [+0.0, 1.0].
+    zw = fmaxf(fminf(zw, 1.f), -1.f);
+
+    // Emit output.
+    ((float4*)p.out)[pidx] = make_float4(b0, b1, zw, (float)(triIdx + 1));
+
+    // Calculate bary pixel differentials.
+    float dfxdx = p.xs * iw;
+    float dfydy = p.ys * iw;
+    float da0dx = p2.y*p1.w - p1.y*p2.w;
+    float da0dy = p1.x*p2.w - p2.x*p1.w;
+    float da1dx = p0.y*p2.w - p2.y*p0.w;
+    float da1dy = p2.x*p0.w - p0.x*p2.w;
+    float da2dx = p1.y*p0.w - p0.y*p1.w;
+    float da2dy = p0.x*p1.w - p1.x*p0.w;
+    float datdx = da0dx + da1dx + da2dx;
+    float datdy = da0dy + da1dy + da2dy;
+    float dudx = dfxdx * (b0 * datdx - da0dx);
+    float dudy = dfydy * (b0 * datdy - da0dy);
+    float dvdx = dfxdx * (b1 * datdx - da1dx);
+    float dvdy = dfydy * (b1 * datdy - da1dy);
+
+    // Emit bary pixel differentials.
+    ((float4*)p.out_db)[pidx] = make_float4(dudx, dudy, dvdx, dvdy);
+}
+
 //------------------------------------------------------------------------
 // Gradient Cuda kernel.

@@ -16,7 +116,7 @@ template <bool ENABLE_DB>
 static __forceinline__ __device__ void RasterizeGradKernelTemplate(const RasterizeGradParams p)
 {
    // Temporary space for coalesced atomics.
-    CA_DECLARE_TEMP(RAST_GRAD_MAX_KERNEL_BLOCK_WIDTH * RAST_GRAD_MAX_KERNEL_BLOCK_HEIGHT);    
+    CA_DECLARE_TEMP(RAST_GRAD_MAX_KERNEL_BLOCK_WIDTH * RAST_GRAD_MAX_KERNEL_BLOCK_HEIGHT);

    // Calculate pixel position.
    int px = blockIdx.x * blockDim.x + threadIdx.x;
@@ -64,7 +164,7 @@ static __forceinline__ __device__ void RasterizeGradKernelTemplate(const Rasteri

    // Initialize coalesced atomics.
    CA_SET_GROUP(triIdx);
-    
+
    // Fetch vertex positions.
    float4 p0 = ((float4*)p.pos)[vi0];
    float4 p1 = ((float4*)p.pos)[vi1];

--- a/nvdiffrast/common/rasterize.h
+++ b/nvdiffrast/common/rasterize.h
@@ -11,9 +11,30 @@
 //------------------------------------------------------------------------
 // Constants and helpers.

+#define RAST_CUDA_FWD_SHADER_KERNEL_BLOCK_WIDTH  8
+#define RAST_CUDA_FWD_SHADER_KERNEL_BLOCK_HEIGHT 8
 #define RAST_GRAD_MAX_KERNEL_BLOCK_WIDTH  8
 #define RAST_GRAD_MAX_KERNEL_BLOCK_HEIGHT 8

+//------------------------------------------------------------------------
+// CUDA forward rasterizer shader kernel params.
+
+struct RasterizeCudaFwdShaderParams
+{
+    const float*    pos;            // Vertex positions.
+    const int*      tri;            // Triangle indices.
+    const int*      in_idx;         // Triangle idx buffer from rasterizer.
+    float*          out;            // Main output buffer.
+    float*          out_db;         // Bary pixel gradient output buffer.
+    int             numTriangles;   // Number of triangles.
+    int             numVertices;    // Number of vertices.
+    int             width;          // Image width.
+    int             height;         // Image height.
+    int             depth;          // Size of minibatch.
+    int             instance_mode;  // 1 if in instance rendering mode.
+    float           xs, xo, ys, yo; // Pixel position to clip-space x, y transform.
+};
+
 //------------------------------------------------------------------------
 // Gradient CUDA kernel params.

@@ -35,52 +56,3 @@ struct RasterizeGradParams
 };

 //------------------------------------------------------------------------
-// Do not try to include OpenGL stuff when compiling CUDA kernels for torch.
-
-#if !(defined(NVDR_TORCH) && defined(__CUDACC__))
-#include "framework.h"
-#include "glutil.h"
-
-//------------------------------------------------------------------------
-// OpenGL-related persistent state for forward op.
-
-struct RasterizeGLState // Must be initializable by memset to zero.
-{
-    int                     width;              // Allocated frame buffer width.
-    int                     height;             // Allocated frame buffer height.
-    int                     depth;              // Allocated frame buffer depth.
-    int                     posCount;           // Allocated position buffer in floats.
-    int                     triCount;           // Allocated triangle buffer in ints.
-    GLContext               glctx;
-    GLuint                  glFBO;
-    GLuint                  glColorBuffer[2];
-    GLuint                  glPrevOutBuffer;
-    GLuint                  glDepthStencilBuffer;
-    GLuint                  glVAO;
-    GLuint                  glTriBuffer;
-    GLuint                  glPosBuffer;
-    GLuint                  glProgram;
-    GLuint                  glProgramDP;
-    GLuint                  glVertexShader;
-    GLuint                  glGeometryShader;
-    GLuint                  glFragmentShader;
-    GLuint                  glFragmentShaderDP;
-    cudaGraphicsResource_t  cudaColorBuffer[2];
-    cudaGraphicsResource_t  cudaPrevOutBuffer;
-    cudaGraphicsResource_t  cudaPosBuffer;
-    cudaGraphicsResource_t  cudaTriBuffer;
-    int                     enableDB;
-    int                     enableZModify;      // Modify depth in shader, workaround for a rasterization issue on A100.
-};
-
-//------------------------------------------------------------------------
-// Shared C++ code prototypes.
-
-void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s, int cudaDeviceIdx);
-bool rasterizeResizeBuffers(NVDR_CTX_ARGS, RasterizeGLState& s, int posCount, int triCount, int width, int height, int depth);
-void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, const float* posPtr, int posCount, int vtxPerInstance, const int32_t* triPtr, int triCount, const int32_t* rangesPtr, int width, int height, int depth, int peeling_idx);
-void rasterizeCopyResults(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, float** outputPtr, int width, int height, int depth);
-void rasterizeReleaseBuffers(NVDR_CTX_ARGS, RasterizeGLState& s);
-
-//------------------------------------------------------------------------
-#endif // !(defined(NVDR_TORCH) && defined(__CUDACC__))