Bugfixes to #30, #32, #36

0c39902a · Samuli Laine · a4e7a4db · 0c39902a · 0c39902a · 0c39902a
Commit 0c39902a authored Jul 23, 2021 by Samuli Laine
8 changed files
--- a/docs/index.html
+++ b/docs/index.html
@@ -318,7 +318,9 @@ div.image-parent {
 <li><a href="#geometry-and-minibatches-range-mode-vs-instanced-mode">Geometry and minibatches: Range mode vs Instanced mode</a></li>
 <li><a href="#image-space-derivatives">Image-space derivatives</a></li>
 <li><a href="#mipmaps-and-texture-dimensions">Mipmaps and texture dimensions</a></li>
-<li><a href="#running-on-multiple-gpus">Running on multiple GPUs</a></li>
+<li><a href="#running-on-multiple-gpus">Running on multiple GPUs</a><ul>
+<li><a href="#note-on-torch.nn.dataparallel">Note on torch.nn.DataParallel</a></li>
+</ul></li>
 <li><a href="#rendering-multiple-depth-layers">Rendering multiple depth layers</a></li>
 <li><a href="#differences-between-pytorch-and-tensorflow">Differences between PyTorch and TensorFlow</a><ul>
 <li><a href="#manual-opengl-contexts-in-pytorch">Manual OpenGL contexts in PyTorch</a></li>
@@ -368,6 +370,7 @@ Examples of things we've done with nvdiffrast
 <h3 id="linux">Linux</h3>
 <p>We recommend running nvdiffrast on <a href="https://www.docker.com/">Docker</a>. To build a Docker image with nvdiffrast and PyTorch 1.6 installed, run:</p>
 <div class="sourceCode" id="cb2"><pre class="sourceCode bash"><code class="sourceCode bash"><a class="sourceLine" id="cb2-1" data-line-number="1"><span class="ex">./run_sample.sh</span> --build-container</a></code></pre></div>
+<p>We recommend using Ubuntu, as some Linux distributions might not have all the required packages available — at least CentOS is reportedly problematic.</p>
 <p>To try out some of the provided code examples, run:</p>
 <div class="sourceCode" id="cb3"><pre class="sourceCode bash"><code class="sourceCode bash"><a class="sourceLine" id="cb3-1" data-line-number="1"><span class="ex">./run_sample.sh</span> ./samples/torch/cube.py --resolution 32</a></code></pre></div>
 <p>Alternatively, if you have all the dependencies taken care of (consult the included Dockerfile for reference), you can install nvdiffrast in your local Python site-packages by running</p>
@@ -389,8 +392,8 @@ Examples of things we've done with nvdiffrast
 <p>Nvdiffrast offers four differentiable rendering primitives: <strong>rasterization</strong>, <strong>interpolation</strong>, <strong>texturing</strong>, and <strong>antialiasing</strong>. The operation of the primitives is described here in a platform-agnostic way. Platform-specific documentation can be found in the API reference section.</p>
 <p>In this section we ignore the minibatch axis for clarity and assume a minibatch size of one. However, all operations support minibatches as detailed later.</p>
 <h3 id="rasterization">Rasterization</h3>
-<p>The rasterization operation takes as inputs a tensor of vertex positions and a tensor of vertex index triplets that specify the triangles. Vertex positions are specified in NDC (Normalized Device Coordinate) space, i.e., after modelview and projection transformations. Performing these transformations is left as the user's responsibility. In NDC, the view frustum is a cube in homogeneous coordinates where <span class="math inline"><em>x</em>/<em>w</em></span>, <span class="math inline"><em>y</em>/<em>w</em></span>, <span class="math inline"><em>z</em>/<em>w</em></span> are all between -1 and +1.</p>
+<p>The rasterization operation takes as inputs a tensor of vertex positions and a tensor of vertex index triplets that specify the triangles. Vertex positions are specified in clip space, i.e., after modelview and projection transformations. Performing these transformations is left as the user's responsibility. In clip space, the view frustum is a cube in homogeneous coordinates where <span class="math inline"><em>x</em>/<em>w</em></span>, <span class="math inline"><em>y</em>/<em>w</em></span>, <span class="math inline"><em>z</em>/<em>w</em></span> are all between -1 and +1.</p>
-<p>The output of the rasterization operation is a 4-channel float32 image with tuple (<span class="math inline"><em>u</em></span>, <span class="math inline"><em>v</em></span>, <span class="math inline"><em>z</em>/<em>w</em></span>, <span class="math inline"><em>t</em><em>r</em><em>i</em><em>a</em><em>n</em><em>g</em><em>l</em><em>e</em>_<em>i</em><em>d</em></span>) in each pixel. Values <span class="math inline"><em>u</em></span> and <span class="math inline"><em>v</em></span> are the barycentric coordinates within a triangle: the first vertex in the vertex index triplet obtains <span class="math inline">(<em>u</em>, <em>v</em>) = (1, 0)</span>, the second vertex <span class="math inline">(<em>u</em>, <em>v</em>) = (0, 1)</span> and the third vertex <span class="math inline">(<em>u</em>, <em>v</em>) = (0, 0)</span>. NDC-space depth value <span class="math inline"><em>z</em>/<em>w</em></span> is used later by the antialiasing operation to infer occlusion relations between triangles, and it does not propagate gradients to the vertex position input. Field <span class="math inline"><em>t</em><em>r</em><em>i</em><em>a</em><em>n</em><em>g</em><em>l</em><em>e</em>_<em>i</em><em>d</em></span> is the triangle index, offset by one. Pixels where no triangle was rasterized will receive a zero in all channels.</p>
+<p>The output of the rasterization operation is a 4-channel float32 image with tuple (<span class="math inline"><em>u</em></span>, <span class="math inline"><em>v</em></span>, <span class="math inline"><em>z</em>/<em>w</em></span>, <span class="math inline"><em>t</em><em>r</em><em>i</em><em>a</em><em>n</em><em>g</em><em>l</em><em>e</em>_<em>i</em><em>d</em></span>) in each pixel. Values <span class="math inline"><em>u</em></span> and <span class="math inline"><em>v</em></span> are the barycentric coordinates within a triangle: the first vertex in the vertex index triplet obtains <span class="math inline">(<em>u</em>, <em>v</em>) = (1, 0)</span>, the second vertex <span class="math inline">(<em>u</em>, <em>v</em>) = (0, 1)</span> and the third vertex <span class="math inline">(<em>u</em>, <em>v</em>) = (0, 0)</span>. Normalized depth value <span class="math inline"><em>z</em>/<em>w</em></span> is used later by the antialiasing operation to infer occlusion relations between triangles, and it does not propagate gradients to the vertex position input. Field <span class="math inline"><em>t</em><em>r</em><em>i</em><em>a</em><em>n</em><em>g</em><em>l</em><em>e</em>_<em>i</em><em>d</em></span> is the triangle index, offset by one. Pixels where no triangle was rasterized will receive a zero in all channels.</p>
 <p>Rasterization is point-sampled, i.e., the geometry is not smoothed, blurred, or made partially transparent in any way, in contrast to some previous differentiable rasterizers. The contents of a pixel always represent a single surface point that is on the closest surface visible along the ray through the pixel center.</p>
 <p>Point-sampled coverage does not produce vertex position gradients related to occlusion and visibility effects. This is because the motion of vertices does not change the coverage in a continuous way — a triangle is either rasterized into a pixel or not. In nvdiffrast, the occlusion/visibility related gradients are generated in the antialiasing operation that typically occurs towards the end of the rendering pipeline.</p>
 <div class="image-parent">
@@ -510,7 +513,7 @@ Rendered in 4×4 higher resolution and downsampled
 <p>Nvdiffrast follows OpenGL's coordinate systems and other conventions. This is partially because we use OpenGL to accelerate the rasterization operation, but mostly so that there is a <a href="https://xkcd.com/927/">single standard to follow</a>.</p>
 <ul>
 <li>
-The NDC coordinate system, used for specifying vertex positions in rasterization, maps to screen so that <span class="math inline"><em>x</em></span> increases towards right side of screen, <span class="math inline"><em>y</em></span> increases towards top of screen, and <strong><span class="math inline"><em>z</em></span> increases towards the viewer</strong>.
+When rasterizing, the normalized device coordinates — i.e., clip-space coordinates after division by <span class="math inline"><em>w</em></span> — map to screen so that <span class="math inline"><em>x</em></span> increases towards right side of screen, <span class="math inline"><em>y</em></span> increases towards top of screen, and <strong><span class="math inline"><em>z</em></span> increases towards the viewer</strong>.
 </li>
 <li>
 <strong>The memory order of image data in OpenGL, and consequently in nvdiffrast, is bottom-up.</strong> This means that row 0 of a tensor containing an image is the bottom row of the texture/image, which is the opposite of the more common scanline order. If you want to keep your image data in the conventional top-down order in your code, but have it logically the right way up inside nvdiffrast, you will need to flip the images vertically when crossing the boundary.
@@ -729,6 +732,10 @@ Mip level 5
 <p>Nvdiffrast supports computation on multiple GPUs in both PyTorch and TensorFlow. As is the convention in PyTorch, the operations are always executed on the device on which the input tensors reside. All GPU input tensors must reside on the same device, and the output tensors will unsurprisingly end up on that same device. In addition, the rasterization operation requires that its OpenGL context was created for the correct device. In TensorFlow, the OpenGL context is automatically created on the device of the rasterization operation when it is executed for the first time.</p>
 <p>On Windows, nvdiffrast implements OpenGL device selection in a way that can be done only once per process — after one context is created, all future ones will end up on the same GPU. Hence you cannot expect to run the rasterization operation on multiple GPUs within the same process. Trying to do so will either cause a crash or incur a significant performance penalty. However, with PyTorch it is common to distribute computation across GPUs by launching a separate process for each GPU, so this is not a huge concern. Note that any OpenGL context created within the same process, even for something like a GUI window, will prevent changing the device later. Therefore, if you want to run the rasterization operation on other than the default GPU, be sure to create its OpenGL context before initializing any other OpenGL-powered libraries.</p>
 <p>On Linux everything just works, and you can create rasterizer OpenGL contexts on multiple devices within the same process.</p>
+<h4 id="note-on-torch.nn.dataparallel">Note on torch.nn.DataParallel</h4>
+<p>PyTorch offers <code>torch.nn.DataParallel</code> wrapper class for splitting the execution of a minibatch across multiple threads. Unfortunately, this class is fundamentally incompatible with OpenGL-dependent operations, as it spawns a new set of threads at each call (as of PyTorch 1.9.0, at least). Using previously created OpenGL contexts in these new threads, even if taking care to not use the same context in multiple threads, causes them to be migrated around and this has resulted in ever-growing GPU memory usage and abysmal GPU utilization. Therefore, we advise against using <code>torch.nn.DataParallel</code> for rasterization operations that depend on the OpenGL contexts.</p>
+<p>Notably, <code>torch.nn.DistributedDataParallel</code> spawns subprocesses that are much more persistent. The subprocesses must create their own OpenGL contexts as part of initialization, and as such they do not suffer from this problem.</p>
+<p>GitHub issue <a href="https://github.com/NVlabs/nvdiffrast/issues/23">#23</a>, especially <a href="https://github.com/NVlabs/nvdiffrast/issues/23#issuecomment-851577382">this comment</a>, contains further analysis and suggestions for workarounds.</p>
 <h3 id="rendering-multiple-depth-layers">Rendering multiple depth layers</h3>
 <p>Sometimes there is a need to render scenes with partially transparent surfaces. In this case, it is not sufficient to find only the surfaces that are closest to the camera, as you may also need to know what lies behind them. For this purpose, nvdiffrast supports <em>depth peeling</em> that lets you extract multiple closest surfaces for each pixel.</p>
 <p>With depth peeling, we start by rasterizing the closest surfaces as usual. We then perform a second rasterization pass with the same geometry, but this time we cull all previously rendered surface points at each pixel, effectively extracting the second-closest depth layer. This can be repeated as many times as desired, so that we can extract as many depth layers as we like. See the images below for example results of depth peeling with each depth layer shaded and antialiased.</p>
@@ -897,7 +904,9 @@ device.</td></tr></table><div class="methods">Methods, only available if context
 <div class="apifunc"><h4><code>nvdiffrast.torch.rasterize(<em>glctx</em>, <em>pos</em>, <em>tri</em>, <em>resolution</em>, <em>ranges</em>=<span class="defarg">None</span>, <em>grad_db</em>=<span class="defarg">True</span>)</code>&nbsp;<span class="sym_function">Function</span></h4>
 <p class="shortdesc">Rasterize triangles.</p><p class="longdesc">All input tensors must be contiguous and reside in GPU memory except for
 the <code>ranges</code> tensor that, if specified, has to reside in CPU memory. The
-output tensors will be contiguous and reside in GPU memory.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">glctx</td><td class="arg_short">OpenGL context of type <code>RasterizeGLContext</code>.</td></tr><tr class="arg"><td class="argname">pos</td><td class="arg_short">Vertex position tensor with dtype <code>torch.float32</code>. To enable range
+output tensors will be contiguous and reside in GPU memory.</p><p class="longdesc">Note: For an unknown reason, on Windows the very first rasterization call using
+a newly created OpenGL context may *sometimes* output a blank buffer. This is a
+known bug and has never been observed to affect subsequent calls.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">glctx</td><td class="arg_short">OpenGL context of type <code>RasterizeGLContext</code>.</td></tr><tr class="arg"><td class="argname">pos</td><td class="arg_short">Vertex position tensor with dtype <code>torch.float32</code>. To enable range
 mode, this tensor should have a 2D shape [num_vertices, 4]. To enable
 instanced mode, use a 3D shape [minibatch_size, num_vertices, 4].</td></tr><tr class="arg"><td class="argname">tri</td><td class="arg_short">Triangle tensor with shape [num_triangles, 3] and dtype <code>torch.int32</code>.</td></tr><tr class="arg"><td class="argname">resolution</td><td class="arg_short">Output resolution as integer tuple (height, width).</td></tr><tr class="arg"><td class="argname">ranges</td><td class="arg_short">In range mode, tensor with shape [minibatch_size, 2] and dtype
 <code>torch.int32</code>, specifying start indices and counts into <code>tri</code>.
@@ -962,7 +971,8 @@ part of texture coordinates. Mode 'clamp' clamps texture coordinates to the
 centers of the boundary texels. Mode 'zero' virtually extends the texture with
 all-zero values in all directions.</td></tr><tr class="arg"><td class="argname">max_mip_level</td><td class="arg_short">If specified, limits the number of mipmaps constructed and used in mipmap-based
 filter modes.</td></tr></table><div class="returns">Returns:<div class="return_description">A tensor containing the results of the texture sampling with shape
-[minibatch_size, height, width, tex_channels].</div></div></div>
+[minibatch_size, height, width, tex_channels]. Cube map fetches with invalid uv coordinates
+(e.g., zero vectors) output all zeros and do not propagate gradients.</div></div></div>
 <div class="apifunc"><h4><code>nvdiffrast.torch.texture_construct_mip(<em>tex</em>, <em>max_mip_level</em>=<span class="defarg">None</span>, <em>cube_mode</em>=<span class="defarg">False</span>)</code>&nbsp;<span class="sym_function">Function</span></h4>
 <p class="shortdesc">Construct a mipmap stack for a texture.</p><p class="longdesc">This function can be used for constructing a mipmap stack for a texture that is known to remain
 constant. This avoids reconstructing it every time <code>texture()</code> is called.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">tex</td><td class="arg_short">Texture tensor with the same constraints as in <code>texture()</code>.</td></tr><tr class="arg"><td class="argname">max_mip_level</td><td class="arg_short">If specified, limits the number of mipmaps constructed.</td></tr><tr class="arg"><td class="argname">cube_mode</td><td class="arg_short">Must be set to True if <code>tex</code> specifies a cube map texture.</td></tr></table><div class="returns">Returns:<div class="return_description">An opaque object containing the mipmap stack. This can be supplied in a call to <code>texture()</code> 
@@ -977,7 +987,7 @@ known to remain constant. This avoids reconstructing it every time <code>antiali
 GPU memory.</td></tr></table><div class="returns">Returns:<div class="return_description">An opaque object containing the topology hash. This can be supplied in a call to 
 <code>antialias()</code> in the <code>topology_hash</code> argument.</div></div></div>
 <div class="apifunc"><h4><code>nvdiffrast.torch.get_log_level(<em></em>)</code>&nbsp;<span class="sym_function">Function</span></h4>
-<p class="shortdesc">Get current log level.</p><p class="longdesc"></p><div class="returns">Returns:<div class="return_description">Current log level in nvdiffrast. See <code>set_log_level()</code> for possible values.</div></div></div>
+<p class="shortdesc">Get current log level.</p><div class="returns">Returns:<div class="return_description">Current log level in nvdiffrast. See <code>set_log_level()</code> for possible values.</div></div></div>
 <div class="apifunc"><h4><code>nvdiffrast.torch.set_log_level(<em>level</em>)</code>&nbsp;<span class="sym_function">Function</span></h4>
 <p class="shortdesc">Set log level.</p><p class="longdesc">Log levels follow the convention on the C++ side of Torch:
  0 = Info,

--- a/nvdiffrast/__init__.py
+++ b/nvdiffrast/__init__.py
@@ -6,4 +6,4 @@
 # distribution of this software and related documentation without an express
 # license agreement from NVIDIA CORPORATION is strictly prohibited.
-__version__ = '0.2.5'
+__version__ = '0.2.6'
--- a/nvdiffrast/common/rasterize.cpp
+++ b/nvdiffrast/common/rasterize.cpp
@@ -29,6 +29,18 @@ static int ROUND_UP_BITS(uint32_t x, uint32_t y)
    return (x | m) + 1u;
 }
+//------------------------------------------------------------------------
+// Draw command struct used by rasterizer.
+struct GLDrawCmd
+{
+    uint32_t    count;
+    uint32_t    instanceCount;
+    uint32_t    firstIndex;
+    uint32_t    baseVertex;
+    uint32_t    baseInstance;
+};
 //------------------------------------------------------------------------
 // GL helpers.
@@ -386,14 +398,6 @@ void rasterizeResizeBuffers(NVDR_CTX_ARGS, RasterizeGLState& s, int posCount, in
        for (int i=0; i < num_outputs; i++)
            NVDR_CHECK_CUDA_ERROR(cudaGraphicsGLRegisterImage(&s.cudaColorBuffer[i], s.glColorBuffer[i], GL_TEXTURE_3D, cudaGraphicsRegisterFlagsReadOnly));
    }
-    // Resize range arrays?
-    if ((unsigned int)depth > s.drawCmdBuffer.size())
-    {
-        int newSize = (depth > 64) ? ROUND_UP_BITS(depth, 1) : 64;
-        LOG(INFO) << "Increasing range array size to " << newSize << " elements";
-        s.drawCmdBuffer.resize(newSize);
-    }
 }
 void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, const float* posPtr, int posCount, int vtxPerInstance, const int32_t* triPtr, int triCount, const int32_t* rangesPtr, int width, int height, int depth, int peeling_idx)
@@ -487,6 +491,9 @@ void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, co
    }
    else
    {
+        // Populate a buffer for draw commands and execute it.
+        std::vector<GLDrawCmd> drawCmdBuffer(depth);
        if (!rangesPtr)
        {
            // Fill in range array to instantiate the same triangles for each output layer.
@@ -494,7 +501,7 @@ void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, co
            // the first dimension in addressing the triangle array.
            for (int i=0; i < depth; i++)
            {
-                GLDrawCmd& cmd = s.drawCmdBuffer[i];
+                GLDrawCmd& cmd = drawCmdBuffer[i];
                cmd.firstIndex    = 0;
                cmd.count         = triCount;
                cmd.baseVertex    = vtxPerInstance * i;
@@ -509,7 +516,7 @@ void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, co
            // the first dimension in addressing the triangle array.
            for (int i=0, j=0; i < depth; i++)
            {
-                GLDrawCmd& cmd = s.drawCmdBuffer[i];
+                GLDrawCmd& cmd = drawCmdBuffer[i];
                int first = rangesPtr[j++];
                int count = rangesPtr[j++];
                NVDR_CHECK(first >= 0 && count >= 0, "range contains negative values");
@@ -523,7 +530,7 @@ void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, co
        }
        // Draw!
-        NVDR_CHECK_GL_ERROR(glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, &s.drawCmdBuffer[0], depth, sizeof(GLDrawCmd)));
+        NVDR_CHECK_GL_ERROR(glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, &drawCmdBuffer[0], depth, sizeof(GLDrawCmd)));
    }
 }
@@ -557,4 +564,36 @@ void rasterizeCopyResults(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t strea
    NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnmapResources(num_outputs, s.cudaColorBuffer, stream));
 }
+void rasterizeReleaseBuffers(NVDR_CTX_ARGS, RasterizeGLState& s)
+{
+    int num_outputs = s.enableDB ? 2 : 1;
+    if (s.cudaPosBuffer)
+    {
+        NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnregisterResource(s.cudaPosBuffer));
+        s.cudaPosBuffer = 0;
+    }
+    if (s.cudaTriBuffer)
+    {
+        NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnregisterResource(s.cudaTriBuffer));
+        s.cudaTriBuffer = 0;
+    }
+    for (int i=0; i < num_outputs; i++)
+    {
+        if (s.cudaColorBuffer[i])
+        {
+            NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnregisterResource(s.cudaColorBuffer[i]));
+            s.cudaColorBuffer[i] = 0;
+        }
+    }
+    if (s.cudaPrevOutBuffer)
+    {
+        NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnregisterResource(s.cudaPrevOutBuffer));
+        s.cudaPrevOutBuffer = 0;
+    }
+}
 //------------------------------------------------------------------------
--- a/nvdiffrast/common/rasterize.h
+++ b/nvdiffrast/common/rasterize.h
@@ -41,22 +41,10 @@ struct RasterizeGradParams
 #include "framework.h"
 #include "glutil.h"
-//------------------------------------------------------------------------
-// Draw command struct used by rasterizer.
-struct GLDrawCmd
-{
-    uint32_t    count;
-    uint32_t    instanceCount;
-    uint32_t    firstIndex;
-    uint32_t    baseVertex;
-    uint32_t    baseInstance;
-};
 //------------------------------------------------------------------------
 // OpenGL-related persistent state for forward op.
-struct RasterizeGLState
+struct RasterizeGLState // Must be initializable by memset to zero.
 {
    int                     width;              // Allocated frame buffer width.
    int                     height;             // Allocated frame buffer height.
@@ -81,7 +69,6 @@ struct RasterizeGLState
    cudaGraphicsResource_t  cudaPrevOutBuffer;
    cudaGraphicsResource_t  cudaPosBuffer;
    cudaGraphicsResource_t  cudaTriBuffer;
-    std::vector<GLDrawCmd>  drawCmdBuffer;
    int                     enableDB;
 };
@@ -92,6 +79,7 @@ void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s, int cudaDeviceId
 void rasterizeResizeBuffers(NVDR_CTX_ARGS, RasterizeGLState& s, int posCount, int triCount, int width, int height, int depth);
 void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, const float* posPtr, int posCount, int vtxPerInstance, const int32_t* triPtr, int triCount, const int32_t* rangesPtr, int width, int height, int depth, int peeling_idx);
 void rasterizeCopyResults(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, float** outputPtr, int width, int height, int depth);
+void rasterizeReleaseBuffers(NVDR_CTX_ARGS, RasterizeGLState& s);
 //------------------------------------------------------------------------
 #endif // !(defined(NVDR_TORCH) && defined(__CUDACC__))
--- a/nvdiffrast/common/texture.cu
+++ b/nvdiffrast/common/texture.cu
@@ -18,6 +18,8 @@ static __device__ __forceinline__ void accum_from_mem(float* a, int s, float4 b,
 static __device__ __forceinline__ void accum_to_mem(float&  a, float* b, int s) { a += b[0]; }
 static __device__ __forceinline__ void accum_to_mem(float2& a, float* b, int s) { float2 v = a; v.x += b[0]; v.y += b[s]; a = v; }
 static __device__ __forceinline__ void accum_to_mem(float4& a, float* b, int s) { float4 v = a; v.x += b[0]; v.y += b[s]; v.z += b[2*s]; v.w += b[3*s]; a = v; }
+static __device__ __forceinline__ bool isfinite_vec3(const float3& a) { return isfinite(a.x) && isfinite(a.y) && isfinite(a.z); }
+static __device__ __forceinline__ bool isfinite_vec4(const float4& a) { return isfinite(a.x) && isfinite(a.y) && isfinite(a.z) && isfinite(a.w); }
 template<class T> static __device__ __forceinline__ T lerp  (const T& a, const T& b, float c) { return a + c * (b - a); }
 template<class T> static __device__ __forceinline__ T bilerp(const T& a, const T& b, const T& c, const T& d, const float2& e) { return lerp(lerp(a, b, e.x), lerp(c, d, e.x), e.y); }
@@ -110,6 +112,8 @@ static __device__ __forceinline__ int indexCubeMap(float& x, float& y, float z)
    float m1 = (idx != 2) ? -m : m;
    x = x * m0 + .5;
    y = y * m1 + .5;
+    if (!isfinite(x) || !isfinite(y))
+        return -1; // Invalid uv.
    x = fminf(fmaxf(x, 0.f), 1.f);
    y = fminf(fmaxf(y, 0.f), 1.f);
    return idx;
@@ -137,7 +141,10 @@ static __device__ __forceinline__ float3 indexCubeMapGrad(float3 uv, float gu, f
    float gy = (idx & 0x0c) ? gl : -gv;
    float gz = (idx & 0x30) ? gl : (idx & 0x03) ? gu : gv;
    gz = (idx & 0x09) ? -gz : gz;
-    return make_float3(gx, gy, gz) * (m * .5f);
+    float3 res = make_float3(gx, gy, gz) * (m * .5f);
+    if (!isfinite_vec3(res))
+        return make_float3(0.f, 0.f, 0.f); // Invalid uv.
+    return res;
 }
 // Based on dL/d(d{s,t}/s{X,Y}), compute dL/d(d{x,y,z}/d{X,Y}). This is just two
@@ -171,6 +178,11 @@ static __device__ __forceinline__ void indexCubeMapGrad4(float3 uv, float4 dw, f
    }
    g0 = make_float3(gx0, gy0, gz0) * (m * .5f);
    g1 = make_float3(gx1, gy1, gz1) * (m * .5f);
+    if (!isfinite_vec3(g0) || !isfinite_vec3(g1))
+    {
+        g0 = make_float3(0.f, 0.f, 0.f); // Invalid uv.
+        g1 = make_float3(0.f, 0.f, 0.f);
+    }
 }
 // Compute d{s,t}/d{X,Y} based on d{x,y,z}/d{X,Y} at a given 3D lookup vector.
@@ -197,27 +209,33 @@ static __device__ __forceinline__ float4 indexCubeMapGradST(float3 uv, float3 dv
    gu *= (idx & 0x34) ? -mm : mm;
    gv *= (idx & 0x2e) ? -mm : mm;
+    float4 res;
    if (idx & 0x03)
    {
-        return make_float4(gu * dvdX.x + dm * dvdX.z,
+        res = make_float4(gu * dvdX.x + dm * dvdX.z,
-                           gu * dvdY.x + dm * dvdY.z,
+                          gu * dvdY.x + dm * dvdY.z,
-                           gv * dvdX.x - dm * dvdX.y,
+                          gv * dvdX.x - dm * dvdX.y,
-                           gv * dvdY.x - dm * dvdY.y);
+                          gv * dvdY.x - dm * dvdY.y);
    }
    else if (idx & 0x0c)
    {
-        return make_float4(gu * dvdX.y + dm * dvdX.x,
+        res = make_float4(gu * dvdX.y + dm * dvdX.x,
-                           gu * dvdY.y + dm * dvdY.x,
+                          gu * dvdY.y + dm * dvdY.x,
-                           gv * dvdX.y + dm * dvdX.z,
+                          gv * dvdX.y + dm * dvdX.z,
-                           gv * dvdY.y + dm * dvdY.z);
+                          gv * dvdY.y + dm * dvdY.z);
-    } 
+    }
    else // (idx & 0x30)
    {
-        return make_float4(gu * dvdX.z + copysignf(dm, c) * dvdX.x,
+        res = make_float4(gu * dvdX.z + copysignf(dm, c) * dvdX.x,
-                           gu * dvdY.z + copysignf(dm, c) * dvdY.x,
+                          gu * dvdY.z + copysignf(dm, c) * dvdY.x,
-                           gv * dvdX.z - dm * dvdX.y,
+                          gv * dvdX.z - dm * dvdX.y,
-                           gv * dvdY.z - dm * dvdY.y);
+                          gv * dvdY.z - dm * dvdY.y);
    }
+    if (!isfinite_vec4(res))
+        return make_float4(0.f, 0.f, 0.f, 0.f);
+    return res;
 }
 // Compute d(d{s,t}/d{X,Y})/d{x,y,z}, i.e., how the pixel derivatives of 2D face
@@ -313,7 +331,10 @@ static __device__ __forceinline__ int indexTextureNearest(const TextureKernelPar
    if (CUBE_MODE)
    {
        // No wrap. Fold face index into tz right away.
-        tz = 6 * tz + indexCubeMap(u, v, uv.z); // Rewrites u, v.
+        int idx = indexCubeMap(u, v, uv.z); // Rewrites u, v.
+        if (idx < 0)
+            return -1; // Invalid uv.
+        tz = 6 * tz + idx;
    }
    else
    {
@@ -364,6 +385,11 @@ static __device__ __forceinline__ float2 indexTextureLinear(const TextureKernelP
    {
        // Neither clamp or wrap.
        face = indexCubeMap(u, v, uv.z); // Rewrites u, v.
+        if (face < 0)
+        {
+            tcOut.x = tcOut.y = tcOut.z = tcOut.w = -1; // Invalid uv.
+            return make_float2(0.f, 0.f);
+        }
        u = u * (float)w - 0.5f;
        v = v * (float)h - 0.5f;
    }
@@ -511,14 +537,16 @@ static __device__ __forceinline__ void calculateMipLevel(int& level0, int& level
            float d_f_ddtdX = vscl * (dtdx * (l2aw + AB) + dtdy * Cw);
            float d_f_ddtdY = vscl * (dtdy * (l2aw - AB) + dtdx * Cw);
-            *pdw = make_float4(d_f_ddsdX, d_f_ddsdY, d_f_ddtdX, d_f_ddtdY);
+            float4 d_f_dw = make_float4(d_f_ddsdX, d_f_ddsdY, d_f_ddtdX, d_f_ddtdY);
+            if (!CUBE_MODE)
+                *pdw = isfinite_vec4(d_f_dw) ? d_f_dw : make_float4(0.f, 0.f, 0.f, 0.f);
            // In cube maps, there is also a texture coordinate vs. mip level gradient.
+            // Only output nonzero vectors if both are free of inf/Nan garbage.
            if (CUBE_MODE)
            {
                float4 dx, dy, dz;
                indexCubeMapGrad2(uv, dvdX, dvdY, dx, dy, dz);
                float3 d_dsdX_dv = make_float3(dx.x, dy.x, dz.x);
                float3 d_dsdY_dv = make_float3(dx.y, dy.y, dz.y);
                float3 d_dtdX_dv = make_float3(dx.z, dy.z, dz.z);
@@ -530,12 +558,14 @@ static __device__ __forceinline__ void calculateMipLevel(int& level0, int& level
                d_f_dv += d_dtdX_dv * d_f_ddtdX;
                d_f_dv += d_dtdY_dv * d_f_ddtdY;
-                *pdfdv = d_f_dv;
+                bool finite = isfinite_vec4(d_f_dw) && isfinite_vec3(d_f_dv);
+                *pdw   = finite ? d_f_dw : make_float4(0.f, 0.f, 0.f, 0.f);
+                *pdfdv = finite ? d_f_dv : make_float3(0.f, 0.f, 0.f);
            }
        }
        // Finally, calculate mip level.
-        flevel = .5f * __log2f(lenMajorSqr);
+        flevel = .5f * __log2f(lenMajorSqr); // May be inf/NaN, but clamp fixes it.
    }
    // Bias the mip level and clamp.
@@ -560,6 +590,7 @@ static __device__ __forceinline__ void calculateMipLevel(int& level0, int& level
 template<class T>
 static __device__ __forceinline__ void fetchQuad(T& a00, T& a10, T& a01, T& a11, const float* pIn, int4 tc, bool corner)
 {
+    // For invalid cube map uv, tc will be all negative, and all texel values will be zero.
    if (corner)
    {
        T avg = zero_value<T>();
@@ -584,6 +615,7 @@ static __device__ __forceinline__ void fetchQuad(T& a00, T& a10, T& a01, T& a11,
 static __device__ __forceinline__ void accumQuad(float4 c, float* pOut, int level, int4 tc, bool corner, CA_TEMP_PARAM)
 {
+    // For invalid cube map uv, tc will be all negative, and no accumulation will take place.
    if (corner)
    {
        float cb;

--- a/nvdiffrast/lib/setgpu.lib
+++ b/nvdiffrast/lib/setgpu.lib
--- a/nvdiffrast/torch/ops.py
+++ b/nvdiffrast/torch/ops.py
@@ -179,7 +179,7 @@ class _rasterize_func(torch.autograd.Function):
    @staticmethod
    def backward(ctx, dy, ddb):
-        pos, tri, out = ctx.saved_variables
+        pos, tri, out = ctx.saved_tensors
        if ctx.saved_grad_db:
            g_pos = _get_plugin().rasterize_grad_db(pos, tri, out, dy, ddb)
        else:
@@ -194,6 +194,10 @@ def rasterize(glctx, pos, tri, resolution, ranges=None, grad_db=True):
    the `ranges` tensor that, if specified, has to reside in CPU memory. The
    output tensors will be contiguous and reside in GPU memory.
+    Note: For an unknown reason, on Windows the very first rasterization call using
+    a newly created OpenGL context may *sometimes* output a blank buffer. This is a
+    known bug and has never been observed to affect subsequent calls.
    Args:
        glctx: OpenGL context of type `RasterizeGLContext`.
        pos: Vertex position tensor with dtype `torch.float32`. To enable range
@@ -321,7 +325,7 @@ class _interpolate_func_da(torch.autograd.Function):
    @staticmethod
    def backward(ctx, dy, dda):
-        attr, rast, tri, rast_db = ctx.saved_variables
+        attr, rast, tri, rast_db = ctx.saved_tensors
        diff_attrs_all, diff_attrs_list = ctx.saved_misc
        g_attr, g_rast, g_rast_db = _get_plugin().interpolate_grad_da(attr, rast, tri, dy, rast_db, dda, diff_attrs_all, diff_attrs_list)
        return g_attr, g_rast, None, g_rast_db, None, None
@@ -336,7 +340,7 @@ class _interpolate_func(torch.autograd.Function):
    @staticmethod
    def backward(ctx, dy, _):
-        attr, rast, tri = ctx.saved_variables
+        attr, rast, tri = ctx.saved_tensors
        g_attr, g_rast = _get_plugin().interpolate_grad(attr, rast, tri, dy)
        return g_attr, g_rast, None
@@ -415,7 +419,7 @@ class _texture_func_mip(torch.autograd.Function):
    @staticmethod
    def backward(ctx, dy):
-        tex, uv, uv_da, mip_level_bias, *mip_stack = ctx.saved_variables
+        tex, uv, uv_da, mip_level_bias, *mip_stack = ctx.saved_tensors
        filter_mode, mip_wrapper, filter_mode_enum, boundary_mode_enum = ctx.saved_misc
        if filter_mode == 'linear-mipmap-linear':
            g_tex, g_uv, g_uv_da, g_mip_level_bias, g_mip_stack = _get_plugin().texture_grad_linear_mipmap_linear(tex, uv, dy, uv_da, mip_level_bias, mip_wrapper, mip_stack, filter_mode_enum, boundary_mode_enum)
@@ -435,7 +439,7 @@ class _texture_func(torch.autograd.Function):
    @staticmethod
    def backward(ctx, dy):
-        tex, uv = ctx.saved_variables
+        tex, uv = ctx.saved_tensors
        filter_mode, filter_mode_enum, boundary_mode_enum = ctx.saved_misc
        if filter_mode == 'linear':
            g_tex, g_uv = _get_plugin().texture_grad_linear(tex, uv, dy, filter_mode_enum, boundary_mode_enum)
@@ -490,7 +494,8 @@ def texture(tex, uv, uv_da=None, mip_level_bias=None, mip=None, filter_mode='aut
    Returns:
        A tensor containing the results of the texture sampling with shape
-        [minibatch_size, height, width, tex_channels].
+        [minibatch_size, height, width, tex_channels]. Cube map fetches with invalid uv coordinates
+        (e.g., zero vectors) output all zeros and do not propagate gradients.
    """
    # Default filter mode.
@@ -580,7 +585,7 @@ class _antialias_func(torch.autograd.Function):
    @staticmethod
    def backward(ctx, dy):
-        color, rast, pos, tri = ctx.saved_variables
+        color, rast, pos, tri = ctx.saved_tensors
        pos_gradient_boost, work_buffer = ctx.saved_misc
        g_color, g_pos = _get_plugin().antialias_grad(color, rast, pos, tri, dy, work_buffer)
        if pos_gradient_boost != 1.0:

--- a/nvdiffrast/torch/torch_rasterize.cpp
+++ b/nvdiffrast/torch/torch_rasterize.cpp
@@ -34,6 +34,9 @@ RasterizeGLStateWrapper::RasterizeGLStateWrapper(bool enableDB, bool automatic_,
 RasterizeGLStateWrapper::~RasterizeGLStateWrapper(void)
 {
+    setGLContext(pState->glctx);
+    rasterizeReleaseBuffers(NVDR_CTX_PARAMS, *pState);
+    releaseGLContext();
    destroyGLContext(pState->glctx);
    delete pState;
 }