Support for multiple GPUs, mip bias input for texture op

ce7063f1 · Samuli Laine · 2468e2a0 · ce7063f1 · ce7063f1 · ce7063f1
Commit ce7063f1 authored Jan 24, 2021 by Samuli Laine
20 changed files
--- a/docs/index.html
+++ b/docs/index.html
@@ -315,16 +315,17 @@ div.image-parent {
 <li><a href="#geometry-and-minibatches-range-mode-vs-instanced-mode">Geometry and minibatches: Range mode vs Instanced mode</a></li>
 <li><a href="#image-space-derivatives">Image-space derivatives</a></li>
 <li><a href="#mipmaps-and-texture-dimensions">Mipmaps and texture dimensions</a></li>
+<li><a href="#running-on-multiple-gpus">Running on multiple GPUs</a></li>
 <li><a href="#differences-between-pytorch-and-tensorflow">Differences between PyTorch and TensorFlow</a><ul>
 <li><a href="#manual-opengl-contexts-in-pytorch">Manual OpenGL contexts in PyTorch</a></li>
 </ul></li>
 </ul></li>
 <li><a href="#samples">Samples</a><ul>
-<li><a href="#triangle.py">triangle.py</a></li>
-<li><a href="#cube.py">cube.py</a></li>
-<li><a href="#earth.py">earth.py</a></li>
-<li><a href="#envphong.py">envphong.py</a></li>
-<li><a href="#pose.py">pose.py</a></li>
+<li><a href="#triangle.py"><span>triangle.py</span></a></li>
+<li><a href="#cube.py"><span>cube.py</span></a></li>
+<li><a href="#earth.py"><span>earth.py</span></a></li>
+<li><a href="#envphong.py"><span>envphong.py</span></a></li>
+<li><a href="#pose.py"><span>pose.py</span></a></li>
 </ul></li>
 <li><a href="#pytorch-api-reference">PyTorch API reference</a></li>
 <li><a href="#licenses">Licenses</a></li>
@@ -373,7 +374,13 @@ Examples of things we've done with nvdiffrast
 <p>If the compiler binary (<code>cl.exe</code>) cannot be found in <code>PATH</code>, nvdiffrast will search for it heuristically. If this fails you may need to add it manually via</p>
 <pre><code>&quot;C:\Program Files (x86)\Microsoft Visual Studio\...\...\VC\Auxiliary\Build\vcvars64.bat&quot;</code></pre>
 <p>where the exact path depends on the version and edition of VS you have installed.</p>
-<p>To install nvdiffrast in your local site-packages, run <code>pip install .</code> at the root of the repository. Alternatively, you can add the repository root directory to your <code>PYTHONPATH</code>.</p>
+<p>To install nvdiffrast in your local site-packages, run:</p>
+<div class="sourceCode" id="cb6"><pre class="sourceCode bash"><code class="sourceCode bash"><a class="sourceLine" id="cb6-1" data-line-number="1"><span class="co"># Ninja is required run-time to build PyTorch extensions</span></a>
+<a class="sourceLine" id="cb6-2" data-line-number="2"><span class="ex">pip</span> install ninja</a>
+<a class="sourceLine" id="cb6-3" data-line-number="3"></a>
+<a class="sourceLine" id="cb6-4" data-line-number="4"><span class="co"># Run at the root of the repository to install nvdiffrast</span></a>
+<a class="sourceLine" id="cb6-5" data-line-number="5"><span class="ex">pip</span> install .</a></code></pre></div>
+<p>Instead of <code>pip install .</code> you can also just add the repository root directory to your <code>PYTHONPATH</code>.</p>
 <h2 id="primitive-operations">Primitive operations</h2>
 <p>Nvdiffrast offers four differentiable rendering primitives: <strong>rasterization</strong>, <strong>interpolation</strong>, <strong>texturing</strong>, and <strong>antialiasing</strong>. The operation of the primitives is described here in a platform-agnostic way. Platform-specific documentation can be found in the API reference section.</p>
 <p>In this section we ignore the minibatch axis for clarity and assume a minibatch size of one. However, all operations support minibatches as detailed later.</p>
@@ -441,7 +448,7 @@ Background replaced with white
 </div>
 </div>
 <p>The middle image above shows the result of texture sampling using the interpolated texture coordinates from the previous step. Why is the background pink? The texture coordinates <span class="math inline">(<em>s</em>, <em>t</em>)</span> read as zero at those pixels, but that is a perfectly valid point to sample the texture. It happens that Spot's texture (left) has pink color at its <span class="math inline">(0, 0)</span> corner, and therefore all pixels in the background obtain that color as a result of the texture sampling operation. On the right, we have replaced the color of the <q>empty</q> pixels with a white color. Here's one way to do this in PyTorch:</p>
-<div class="sourceCode" id="cb6"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb6-1" data-line-number="1">img_right <span class="op">=</span> torch.where(rast_out[..., <span class="dv">3</span>:] <span class="op">&gt;</span> <span class="dv">0</span>, img_left, torch.tensor(<span class="fl">1.0</span>).cuda())</a></code></pre></div>
+<div class="sourceCode" id="cb7"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb7-1" data-line-number="1">img_right <span class="op">=</span> torch.where(rast_out[..., <span class="dv">3</span>:] <span class="op">&gt;</span> <span class="dv">0</span>, img_left, torch.tensor(<span class="fl">1.0</span>).cuda())</a></code></pre></div>
 <p>where <code>rast_out</code> is the output of the rasterization operation. We simply test if the <span class="math inline"><em>t</em><em>r</em><em>i</em><em>a</em><em>n</em><em>g</em><em>l</em><em>e</em>_<em>i</em><em>d</em></span> field, i.e., channel 3 of the rasterizer output, is greater than zero, indicating that a triangle was rendered in that pixel. If so, we take the color from the textured image, and otherwise we take constant 1.0.</p>
 <h3 id="antialiasing">Antialiasing</h3>
 <p>The last of the four primitive operations in nvdiffrast is antialiasing. Based on the geometry input (vertex positions and triangles), it will smooth out discontinuties at silhouette edges in a given image. The smoothing is based on a local approximation of coverage — an approximate integral over a pixel is calculated based on the exact location of relevant edges and the point-sampled colors at pixel centers.</p>
@@ -515,7 +522,7 @@ For 2D textures, the coordinate origin <span class="math inline">(<em>s</em>, 
 <p>In <strong>instanced mode</strong>, the topology of the mesh will be shared for each minibatch index. The triangle tensor is still a 2D tensor with shape [<em>num_triangles</em>, 3], but the vertex positions are specified using a 3D tensor of shape [<em>minibatch_size</em>, <em>num_vertices</em>, 4]. With a 3D vertex position tensor, the rasterizer will not require the range tensor input, but will take the minibatch size from the first dimension of the vertex position tensor. The same triangles are rendered to each minibatch index, but with vertex positions taken from the corresponding slice of the vertex position tensor. In this mode, the attribute tensor in interpolation has to be a 3D tensor similar to position tensor, i.e., of shape [<em>minibatch_size</em>, <em>num_vertices</em>, <em>num_attributes</em>]. However, you can provide an attribute tensor with minibatch size of 1, and it will be broadcast across the minibatch.</p>
 <h3 id="image-space-derivatives">Image-space derivatives</h3>
 <p>We skirted around a pretty fundamental question in the description of the texturing operation above. In order to determine the proper amount of prefiltering for sampling a texture, we need to know how densely it is being sampled. But how can we know the sampling density when each pixel knows of a just a single surface point?</p>
-<p>The solution is to track the image-space derivatives of all things leading up to the texture sampling operation. <em>These are not the same thing as the gradients used in the backward pass</em>, even though they both involve differentiation! Consider the barycentrics <span class="math inline">(<em>u</em>, <em>v</em>)</span> produced by the rasterization operation. They change by some amount when moving horizontally or vertically in the image plane. If we denote the image-space coordinates as <span class="math inline">(<em>X</em>, <em>Y</em>)</span>, the image-space derivatives of the barycentrics would be <span class="math inline">∂<em>u</em>/∂<em>X</em></span>, <span class="math inline">∂<em>u</em>/∂<em>Y</em></span>, <span class="math inline">∂<em>v</em>/∂<em>X</em></span>, and <span class="math inline">∂<em>v</em>/∂<em>Y</em></span>. We can organize these into a 2×2 Jacobian matrix that describes the local relationship between <span class="math inline">(<em>u</em>, <em>v</em>)</span> and <span class="math inline">(<em>X</em>, <em>Y</em>)</span>. This matrix is generally different at every pixel.</p>
+<p>The solution is to track the image-space derivatives of all things leading up to the texture sampling operation. <em>These are not the same thing as the gradients used in the backward pass</em>, even though they both involve differentiation! Consider the barycentrics <span class="math inline">(<em>u</em>, <em>v</em>)</span> produced by the rasterization operation. They change by some amount when moving horizontally or vertically in the image plane. If we denote the image-space coordinates as <span class="math inline">(<em>X</em>, <em>Y</em>)</span>, the image-space derivatives of the barycentrics would be <span class="math inline">∂<em>u</em>/∂<em>X</em></span>, <span class="math inline">∂<em>u</em>/∂<em>Y</em></span>, <span class="math inline">∂<em>v</em>/∂<em>X</em></span>, and <span class="math inline">∂<em>v</em>/∂<em>Y</em></span>. We can organize these into a 2×2 Jacobian matrix that describes the local relationship between <span class="math inline">(<em>u</em>, <em>v</em>)</span> and <span class="math inline">(<em>X</em>, <em>Y</em>)</span>. This matrix is generally different at every pixel. For the purpose of image-space derivatives, the units of <span class="math inline"><em>X</em></span> and <span class="math inline"><em>Y</em></span> are pixels. Hence, <span class="math inline">∂<em>u</em>/∂<em>X</em></span> is the local approximation of how much <span class="math inline"><em>u</em></span> changes when moving a distance of one pixel in the horizontal direction, and so on.</p>
 <p>Once we know how the barycentrics change w.r.t. pixel position, the interpolation operation can use this to determine how the attributes change w.r.t. pixel position. When attributes are used as texture coordinates, we can therefore tell how the texture sampling position (in texture space) changes when moving around within the pixel (up to a local, linear approximation, that is). This <em>texture footprint</em> tells us the scale on which the texture should be prefiltered. In more practical terms, it tells us which mipmap level(s) to use when sampling the texture.</p>
 <p>In nvdiffrast, the rasterization operation can be configured to output the image-space derivatives of the barycentrics in an auxiliary 4-channel output tensor, ordered (<span class="math inline">∂<em>u</em>/∂<em>X</em></span>, <span class="math inline">∂<em>u</em>/∂<em>Y</em></span>, <span class="math inline">∂<em>v</em>/∂<em>X</em></span>, <span class="math inline">∂<em>v</em>/∂<em>Y</em></span>) from channel 0 to 3. The interpolation operation can take this auxiliary tensor as input and compute image-space derivatives of any set of attributes being interpolated. Finally, the texture sampling operation requires the image-space derivatives of the texture coordinates if a prefiltered sampling mode is being used.</p>
 <p>There is nothing magic about these image-space derivatives. They are tensors like the, e.g., the texture coordinates themselves, they propagate gradients backwards, and so on. For example, if you want to artificially blur or sharpen the texture when sampling it, you can simply multiply the tensor carrying the image-space derivatives of the texture coordinates <span class="math inline">∂{<em>s</em>, <em>t</em>}/∂{<em>X</em>, <em>Y</em>}</span> by a scalar value before feeding it into the texture sampling operation. This scales the texture footprints and thus adjusts the amount of prefiltering. If your loss function prefers a different level of sharpness, this multiplier will receive a nonzero gradient.</p>
@@ -713,6 +720,10 @@ Mip level 5
 </div>
 <p>Scaling the atlas to, say, 256×32 pixels would feel silly because the dimensions of the sub-images are perfectly fine, and downsampling the different sub-images together — which would happen after the 5×1 resolution — would not make sense anyway. For this reason, the texture sampling operation allows the user to specify the maximum number of mipmap levels to be constructed and used. In this case, setting <code>max_mip_level=5</code> would stop at the 5×1 mipmap and prevent the error.</p>
 <p>It is a deliberate design choice that nvdiffrast doesn't just stop automatically at a mipmap size it cannot downsample, but requires the user to specify a limit when the texture dimensions are not powers of two. The goal is to avoid bugs where prefiltered texture sampling mysteriously doesn't work due to an oddly sized texture. It would be confusing if a 256×256 texture gave beautifully prefiltered texture samples, a 255×255 texture suddenly had no prefiltering at all, and a 254×254 texture did just a bit of prefiltering (one level) but not more.</p>
+<h3 id="running-on-multiple-gpus">Running on multiple GPUs</h3>
+<p>Nvdiffrast supports computation on multiple GPUs in both PyTorch and TensorFlow. As is the convention in PyTorch, the operations are always executed on the device on which the input tensors reside. All GPU input tensors must reside on the same device, and the output tensors will unsurprisingly end up on that same device. In addition, the rasterization operation requires that its OpenGL context was created for the correct device. In TensorFlow, the OpenGL context is automatically created on the device of the rasterization operation when it is executed for the first time.</p>
+<p>On Windows, nvdiffrast implements OpenGL device selection in a way that can be done only once per process — after one context is created, all future ones will end up on the same GPU. Hence you cannot expect to run the rasterization operation on multiple GPUs within the same process. Trying to do so will either cause a crash or incur a significant performance penalty. However, with PyTorch it is common to distribute computation across GPUs by launching a separate process for each GPU, so this is not a huge concern. Note that any OpenGL context created within the same process, even for something like a GUI window, will prevent changing the device later. Therefore, if you want to run the rasterization operation on other than the default GPU, be sure to create its OpenGL context before initializing any other OpenGL-powered libraries.</p>
+<p>On Linux everything just works, and you can create rasterizer OpenGL contexts on multiple devices within the same process.</p>
 <h3 id="differences-between-pytorch-and-tensorflow">Differences between PyTorch and TensorFlow</h3>
 <p>Nvdiffrast can be used from PyTorch and from TensorFlow 1.x; the latter may change to TensorFlow 2.x if there is demand. These frameworks operate somewhat differently and that is reflected in the respective APIs. Simplifying a bit, in TensorFlow 1.x you construct a persistent graph out of persistent nodes, and run many batches of data through it. In PyTorch, there is no persistent graph or nodes, but a new, ephemeral graph is constructed for each batch of data and destroyed immediately afterwards. Therefore, there is also no persistent state for the operations. There is the <code>torch.nn.Module</code> abstraction for festooning operations with persistent state, but we do not use it.</p>
 <p>As a consequence, things that would be part of persistent state of an nvdiffrast operation in TensorFlow must be stored by the user in PyTorch, and supplied to the operations as needed. In practice, this is a very small difference and amounts to just a couple of lines of code in most cases.</p>
@@ -726,7 +737,7 @@ Mip level 5
 <p>In manual mode, the user assumes the responsibility of setting and releasing the OpenGL context. Most of the time, if you don't have any other libraries that would be using OpenGL, you can just set the context once after having created it and keep it set until the program exits. However, keep in mind that the active OpenGL context is a thread-local resource, so it needs to be set in the same CPU thread as it will be used, and it cannot be set simultaneously in multiple CPU threads.</p>
 <h2 id="samples">Samples</h2>
 <p>Nvdiffrast comes with a set of samples that were crafted to support the research paper. Each sample is available in both PyTorch and TensorFlow versions. Details such as command-line parameters, logging format, etc., may not be identical between the versions, and generally the PyTorch versions should be considered definitive. The command-line examples below are for the PyTorch versions.</p>
-<h3 id="triangle.py">triangle.py</h3>
+<h3 id="triangle.py"><a href="https://github.com/NVlabs/nvdiffrast/blob/main/samples/torch/triangle.py">triangle.py</a></h3>
 <p>This is a minimal sample that renders a triangle and saves the resulting image into a file (<code>tri.png</code>) in the current directory. Running this should be the first step to verify that you have everything set up correctly. Rendering is done using the rasterization and interpolation operations, so getting the correct output image means that both OpenGL and CUDA are working as intended under the hood.</p>
 <p>Example command line:</p>
 <pre><code>python triangle.py</code></pre>
@@ -740,7 +751,7 @@ The expected output image
 </div>
 </div>
 </div>
-<h3 id="cube.py">cube.py</h3>
+<h3 id="cube.py"><a href="https://github.com/NVlabs/nvdiffrast/blob/main/samples/torch/cube.py">cube.py</a></h3>
 <p>In this sample, we optimize the vertex positions and colors of a cube mesh, starting from a semi-randomly initialized state. The optimization is based on image-space loss in extremely low resolutions such as 4×4, 8×8, or 16×16 pixels. The goal of this sample is to examine the rate of geometrical convergence when the triangles are only a few pixels in size. It serves to illustrate that the antialiasing operation, despite being approximative, yields good enough position gradients even in 4×4 resolution to guide the optimization to the goal.</p>
 <p>Example command line:</p>
 <pre><code>python cube.py --resolution 16 --display-interval 10</code></pre>
@@ -762,7 +773,7 @@ Rendering pipeline
 </div>
 <p>The image above shows a live view of the sample. Top row shows the low-resolution rendered image and reference image that the image-space loss is calculated from. Bottom row shows the current mesh (and colors) and reference mesh in high resolution so that convergence can be seen more easily visually.</p>
 <p>In the pipeline diagram, green boxes indicate nvdiffrast operations, whereas blue boxes are other computation. Red boxes are the learned tensors and gray are non-learned tensors or other data.</p>
-<h3 id="earth.py">earth.py</h3>
+<h3 id="earth.py"><a href="https://github.com/NVlabs/nvdiffrast/blob/main/samples/torch/earth.py">earth.py</a></h3>
 <p>The goal of this sample is to compare texture convergence with and without prefiltered texture sampling. The texture is learned based on image-space loss against high-quality reference renderings in random orientations and at random distances. When prefiltering is disabled, the texture is not learned properly because of spotty gradient updates caused by aliasing. This shows as a much worse PSNR for the texture, compared to learning with prefiltering enabled. See the paper for further discussion.</p>
 <p>Example command lines:</p>
 <table>
@@ -800,7 +811,7 @@ Rendering pipeline
 </div>
 </div>
 <p>The interactive view shows the current texture mapped onto the mesh, with or without prefiltered texture sampling as specified via the command-line parameter. In this sample, no antialiasing is performed because we are not learning vertex positions and hence need no gradients related to them.</p>
-<h3 id="envphong.py">envphong.py</h3>
+<h3 id="envphong.py"><a href="https://github.com/NVlabs/nvdiffrast/blob/main/samples/torch/envphong.py">envphong.py</a></h3>
 <p>In this sample, a more complex shading model is used compared to the vertex colors or plain texture in the previous ones. Here, we learn a reflected environment map and parameters of a Phong BRDF model given a known mesh. The optimization is based on image-space loss against reference renderings in random orientations. The shading model of mirror reflection plus a Phong BRDF is not physically sensible, but it works as a reasonably simple strawman that would not be possible to implement with previous differentiable rasterizers that bundle rasterization, shading, lighting, and texturing together. The sample also illustrates the use of cube mapping for representing a learned texture in a spherical domain.</p>
 <p>Example command line:</p>
 <pre><code>python envphong.py --display-interval 10</code></pre>
@@ -821,7 +832,7 @@ Rendering pipeline
 </div>
 </div>
 <p>In the interactive view, we see the rendering with the current environment map and Phong BRDF parameters, both gradually improving during the optimization.</p>
-<h3 id="pose.py">pose.py</h3>
+<h3 id="pose.py"><a href="https://github.com/NVlabs/nvdiffrast/blob/main/samples/torch/pose.py">pose.py</a></h3>
 <p>Pose fitting based on an image-space loss is a classical task in differentiable rendering. In this sample, we solve a pose optimization problem with a simple cube with differently colored sides. We detail the optimization method in the paper, but in brief, it combines gradient-free greedy optimization in an initialization phase and gradient-based optimization in a fine-tuning phase.</p>
 <p>Example command line:</p>
 <pre><code>python pose.py --display-interval 10</code></pre>
@@ -838,10 +849,13 @@ Interactive view of pose.py
 <p>The interactive view shows, from left to right: target pose, best found pose, and current pose. When viewed live, the two stages of optimization are clearly visible. In the first phase, the best pose updates intermittently when a better initialization is found. In the second phase, the solution converges smoothly to the target via gradient-based optimization.</p>
 <h2 id="pytorch-api-reference">PyTorch API reference</h2>
 <div style="padding-top: 1em;">
-<div class="apifunc"><h4><code>nvdiffrast.torch.RasterizeGLContext(<em>output_db</em>=<span class="defarg">True</span>, <em>mode</em>=<span class="defarg">'automatic'</span>)</code>&nbsp;<span class="sym_class">Class</span></h4>
+<div class="apifunc"><h4><code>nvdiffrast.torch.RasterizeGLContext(<em>output_db</em>=<span class="defarg">True</span>, <em>mode</em>=<span class="defarg">'automatic'</span>, <em>device</em>=<span class="defarg">None</span>)</code>&nbsp;<span class="sym_class">Class</span></h4>
 <p class="shortdesc">Create a new OpenGL rasterizer context.</p><p class="longdesc">Creating an OpenGL context is a slow operation so you should reuse the same
 context in all calls to <code>rasterize()</code> on the same CPU thread. The OpenGL context
-is deleted when the object is destroyed.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">output_db</td><td class="arg_short">Compute and output image-space derivates of barycentrics.</td></tr><tr class="arg"><td class="argname">mode</td><td class="arg_short">OpenGL context handling mode. Valid values are 'manual' and 'automatic'.</td></tr></table><div class="methods">Methods, only available if context was created in manual mode:</div><table class="args"><tr class="arg"><td class="argname">set_context()</td><td class="arg_short">Set (activate) OpenGL context in the current CPU thread.</td></tr><tr class="arg"><td class="argname">release_context()</td><td class="arg_short">Release (deactivate) currently active OpenGL context.</td></tr></table><div class="returns">Returns:<div class="return_description">The newly created OpenGL rasterizer context.</div></div></div>
+is deleted when the object is destroyed.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">output_db</td><td class="arg_short">Compute and output image-space derivates of barycentrics.</td></tr><tr class="arg"><td class="argname">mode</td><td class="arg_short">OpenGL context handling mode. Valid values are 'manual' and 'automatic'.</td></tr><tr class="arg"><td class="argname">device</td><td class="arg_short">Cuda device on which the context is created. Type can be
+<code>torch.device</code>, string (e.g., <code>'cuda:1'</code>), or int. If not
+specified, context will be created on currently active Cuda
+device.</td></tr></table><div class="methods">Methods, only available if context was created in manual mode:</div><table class="args"><tr class="arg"><td class="argname">set_context()</td><td class="arg_short">Set (activate) OpenGL context in the current CPU thread.</td></tr><tr class="arg"><td class="argname">release_context()</td><td class="arg_short">Release (deactivate) currently active OpenGL context.</td></tr></table><div class="returns">Returns:<div class="return_description">The newly created OpenGL rasterizer context.</div></div></div>
 <div class="apifunc"><h4><code>nvdiffrast.torch.rasterize(<em>glctx</em>, <em>pos</em>, <em>tri</em>, <em>resolution</em>, <em>ranges</em>=<span class="defarg">None</span>, <em>grad_db</em>=<span class="defarg">True</span>)</code>&nbsp;<span class="sym_function">Function</span></h4>
 <p class="shortdesc">Rasterize triangles.</p><p class="longdesc">All input tensors must be contiguous and reside in GPU memory except for
 the <code>ranges</code> tensor that, if specified, has to reside in CPU memory. The 
@@ -875,23 +889,24 @@ the image-space derivatives of the selected attributes and has shape
 first selected attribute A will be on channels 0 and 1 as (dA/dX, dA/dY), etc.
 Otherwise, the second output tensor will be an empty tensor with shape
 [minibatch_size, height, width, 0].</div></div></div>
-<div class="apifunc"><h4><code>nvdiffrast.torch.texture(<em>tex</em>, <em>uv</em>, <em>uv_da</em>=<span class="defarg">None</span>, <em>mip</em>=<span class="defarg">None</span>, <em>filter_mode</em>=<span class="defarg">'auto'</span>, <em>boundary_mode</em>=<span class="defarg">'wrap'</span>, <em>max_mip_level</em>=<span class="defarg">None</span>)</code>&nbsp;<span class="sym_function">Function</span></h4>
+<div class="apifunc"><h4><code>nvdiffrast.torch.texture(<em>tex</em>, <em>uv</em>, <em>uv_da</em>=<span class="defarg">None</span>, <em>mip_level_bias</em>=<span class="defarg">None</span>, <em>mip</em>=<span class="defarg">None</span>, <em>filter_mode</em>=<span class="defarg">'auto'</span>, <em>boundary_mode</em>=<span class="defarg">'wrap'</span>, <em>max_mip_level</em>=<span class="defarg">None</span>)</code>&nbsp;<span class="sym_function">Function</span></h4>
 <p class="shortdesc">Perform texture sampling.</p><p class="longdesc">All input tensors must be contiguous and reside in GPU memory. The output tensor
 will be contiguous and reside in GPU memory.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">tex</td><td class="arg_short">Texture tensor with dtype <code>torch.float32</code>. For 2D textures, must have shape
 [minibatch_size, tex_height, tex_width, tex_channels]. For cube map textures,
 must have shape [minibatch_size, 6, tex_height, tex_width, tex_channels] where
-tex_width and tex_height are equal. Note that <code>boundary_mode</code> must also be set 
-to 'cube' to enable cube map mode. Broadcasting is supported along the minibatch axis.</td></tr><tr class="arg"><td class="argname">uv</td><td class="arg_short">Tensor containing per-pixel texture coordinates. When sampling a 2D texture, 
+tex_width and tex_height are equal. Note that <code>boundary_mode</code> must also be set
+to 'cube' to enable cube map mode. Broadcasting is supported along the minibatch axis.</td></tr><tr class="arg"><td class="argname">uv</td><td class="arg_short">Tensor containing per-pixel texture coordinates. When sampling a 2D texture,
 must have shape [minibatch_size, height, width, 2]. When sampling a cube map
 texture, must have shape [minibatch_size, height, width, 3].</td></tr><tr class="arg"><td class="argname">uv_da</td><td class="arg_short">(Optional) Tensor containing image-space derivatives of texture coordinates.
 Must have same shape as <code>uv</code> except for the last dimension that is to be twice
-as long.</td></tr><tr class="arg"><td class="argname">mip</td><td class="arg_short">(Optional) Preconstructed mipmap stack from a <code>texture_construct_mip()</code> call. If not
-specified, the mipmap stack is constructed internally and discarded afterwards.</td></tr><tr class="arg"><td class="argname">filter_mode</td><td class="arg_short">Texture filtering mode to be used. Valid values are 'auto', 'nearest', 
+as long.</td></tr><tr class="arg"><td class="argname">mip_level_bias</td><td class="arg_short">(Optional) Per-pixel bias for mip level selection. If <code>uv_da</code> is omitted,
+determines mip level directly. Must have shape [minibatch_size, height, width].</td></tr><tr class="arg"><td class="argname">mip</td><td class="arg_short">(Optional) Preconstructed mipmap stack from a <code>texture_construct_mip()</code> call. If not
+specified, the mipmap stack is constructed internally and discarded afterwards.</td></tr><tr class="arg"><td class="argname">filter_mode</td><td class="arg_short">Texture filtering mode to be used. Valid values are 'auto', 'nearest',
 'linear', 'linear-mipmap-nearest', and 'linear-mipmap-linear'. Mode 'auto'
-selects 'linear' if <code>uv_da</code> is not specified, and 'linear-mipmap-linear'
-when <code>uv_da</code> is specified, these being the highest-quality modes possible
-depending on the availability of the image-space derivatives of the texture
-coordinates.</td></tr><tr class="arg"><td class="argname">boundary_mode</td><td class="arg_short">Valid values are 'wrap', 'clamp', 'zero', and 'cube'. If <code>tex</code> defines a
+selects 'linear' if neither <code>uv_da</code> or <code>mip_level_bias</code> is specified, and
+'linear-mipmap-linear' when at least one of them is specified, these being
+the highest-quality modes possible depending on the availability of the
+image-space derivatives of the texture coordinates or direct mip level information.</td></tr><tr class="arg"><td class="argname">boundary_mode</td><td class="arg_short">Valid values are 'wrap', 'clamp', 'zero', and 'cube'. If <code>tex</code> defines a
 cube map, this must be set to 'cube'. The default mode 'wrap' takes fractional
 part of texture coordinates. Mode 'clamp' clamps texture coordinates to the
 centers of the boundary texels. Mode 'zero' virtually extends the texture with
@@ -915,9 +930,9 @@ GPU memory.</td></tr></table><div class="returns">Returns:<div class="return_des
 <p class="shortdesc">Get current log level.</p><p class="longdesc"></p><div class="returns">Returns:<div class="return_description">Current log level in nvdiffrast. See <code>set_log_level()</code> for possible values.</div></div></div>
 <div class="apifunc"><h4><code>nvdiffrast.torch.set_log_level(<em>level</em>)</code>&nbsp;<span class="sym_function">Function</span></h4>
 <p class="shortdesc">Set log level.</p><p class="longdesc">Log levels follow the convention on the C++ side of Torch:
-  0 = Info, 
-  1 = Warning, 
-  2 = Error, 
+  0 = Info,
+  1 = Warning,
+  2 = Error,
  3 = Fatal.
 The default log level is 1.</p><div class="arguments">Arguments:</div><table class="args"><tr class="arg"><td class="argname">level</td><td class="arg_short">New log level as integer. Internal nvdiffrast messages of this 
 severity or higher will be printed, while messages of lower

--- a/nvdiffrast/__init__.py
+++ b/nvdiffrast/__init__.py
@@ -6,4 +6,4 @@
 # distribution of this software and related documentation without an express
 # license agreement from NVIDIA CORPORATION is strictly prohibited.

-__version__ = '0.2.0'
+__version__ = '0.2.1'
--- a/nvdiffrast/common/common.h
+++ b/nvdiffrast/common/common.h
@@ -185,6 +185,8 @@ template<class T> static __device__ __forceinline__ void swap(T& a, T& b)
 //------------------------------------------------------------------------
 // Coalesced atomics. These are all done via macros.

+#if __CUDA_ARCH__ >= 700 // Warp match instruction __match_any_sync() is only available on compute capability 7.x and higher
+
 #define CA_TEMP       _ca_temp
 #define CA_TEMP_PARAM float* CA_TEMP
 #define CA_DECLARE_TEMP(threads_per_block) \
@@ -228,5 +230,24 @@ template<class T> static __device__ __forceinline__ void swap(T& a, T& b)
        caAtomicAdd((ptr)+(idx), (value));          \
    } while(0)

+//------------------------------------------------------------------------
+// Disable atomic coalescing for compute capability lower than 7.x
+
+#else // __CUDA_ARCH__ >= 700
+#define CA_TEMP _ca_temp
+#define CA_TEMP_PARAM float CA_TEMP
+#define CA_DECLARE_TEMP(threads_per_block) CA_TEMP_PARAM
+#define CA_SET_GROUP_MASK(group, thread_mask)
+#define CA_SET_GROUP(group)
+#define caAtomicAdd(ptr, value) atomicAdd((ptr), (value))
+#define caAtomicAdd3_xyw(ptr, x, y, w)  \
+    do {                                \
+        atomicAdd((ptr), (x));          \
+        atomicAdd((ptr)+1, (y));        \
+        atomicAdd((ptr)+3, (w));        \
+    } while(0)
+#define caAtomicAddTexture(ptr, level, idx, value) atomicAdd((ptr)+(idx), (value))
+#endif // __CUDA_ARCH__ >= 700
+
 //------------------------------------------------------------------------
 #endif // __CUDACC__
--- a/nvdiffrast/common/framework.h
+++ b/nvdiffrast/common/framework.h
@@ -36,6 +36,7 @@ using namespace tensorflow::shape_inference;
 #include <torch/extension.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/CUDAUtils.h>
+#include <c10/cuda/CUDAGuard.h>
 #include <pybind11/numpy.h>
 #endif
 #define NVDR_CTX_ARGS int _nvdr_ctx_dummy

--- a/nvdiffrast/common/glutil.inl
+++ b/nvdiffrast/common/glutil.inl
@@ -37,26 +37,43 @@ struct GLContext
 static void setGLContext(GLContext& glctx)
 {
    if (!glctx.hglrc)
-        LOG(ERROR) << "setGLContext() called with null gltcx";
+        LOG(FATAL) << "setGLContext() called with null gltcx";
    if (!wglMakeCurrent(glctx.hdc, glctx.hglrc))
-        LOG(ERROR) << "wglMakeCurrent() failed when setting GL context";
+        LOG(FATAL) << "wglMakeCurrent() failed when setting GL context";

    if (glctx.glewInitialized)
        return;
    GLenum result = glewInit();
    if (result != GLEW_OK)
-        LOG(ERROR) << "glewInit() failed, return value = " << result;
+        LOG(FATAL) << "glewInit() failed, return value = " << result;
    glctx.glewInitialized = 1;
 }

 static void releaseGLContext(void)
 {
    if (!wglMakeCurrent(NULL, NULL))
-        LOG(ERROR) << "wglMakeCurrent() failed when releasing GL context";
+        LOG(FATAL) << "wglMakeCurrent() failed when releasing GL context";
 }

-static GLContext createGLContext(void)
+extern "C" int set_gpu(const char*);
+
+static GLContext createGLContext(int cudaDeviceIdx)
 {
+    if (cudaDeviceIdx >= 0)
+    {
+        char pciBusId[256] = "";
+        LOG(INFO) << "Creating GL context for Cuda device " << cudaDeviceIdx;
+        if (cudaDeviceGetPCIBusId(pciBusId, 255, cudaDeviceIdx) != CUDA_SUCCESS)
+        {
+            LOG(INFO) << "PCI bus id query failed";
+        }
+        else
+        {
+            int res = set_gpu(pciBusId);
+            LOG(INFO) << "Selecting device with PCI bus id " << pciBusId << " - " << (res ? "failed, expect crash or major slowdown" : "success");
+        }
+    }
+
    HINSTANCE hInstance = GetModuleHandle(NULL);
    WNDCLASS wc = {};
    wc.style         = CS_OWNDC;
@@ -101,7 +118,7 @@ static GLContext createGLContext(void)
 static void destroyGLContext(GLContext& glctx)
 {
    if (!glctx.hglrc)
-        LOG(ERROR) << "destroyGLContext() called with null gltcx";
+        LOG(FATAL) << "destroyGLContext() called with null gltcx";

    // If this is the current context, release it.
    if (wglGetCurrentContext() == glctx.hglrc)
@@ -109,13 +126,13 @@ static void destroyGLContext(GLContext& glctx)

    HWND hwnd = WindowFromDC(glctx.hdc);
    if (!hwnd)
-        LOG(ERROR) << "WindowFromDC() failed";
+        LOG(FATAL) << "WindowFromDC() failed";
    if (!ReleaseDC(hwnd, glctx.hdc))
-        LOG(ERROR) << "ReleaseDC() failed";
+        LOG(FATAL) << "ReleaseDC() failed";
    if (!wglDeleteContext(glctx.hglrc))
-        LOG(ERROR) << "wglDeleteContext() failed";
+        LOG(FATAL) << "wglDeleteContext() failed";
    if (!DestroyWindow(hwnd))
-        LOG(ERROR) << "DestroyWindow() failed";
+        LOG(FATAL) << "DestroyWindow() failed";

    LOG(INFO) << std::hex << std::setfill('0')
              << "WGL OpenGL context destroyed (hdc: 0x" << std::setw(8) << (uint32_t)(uintptr_t)glctx.hdc
@@ -140,6 +157,7 @@ static void destroyGLContext(GLContext& glctx)
 #   include <GL/glew.h> // Use system-supplied glew.h
 #endif
 #include <EGL/egl.h>
+#include <EGL/eglext.h>
 #include <GL/gl.h>
 #include <cuda_gl_interop.h>

@@ -148,7 +166,6 @@ static void destroyGLContext(GLContext& glctx)
 struct GLContext
 {
    EGLDisplay  display;
-    EGLSurface  surface;
    EGLContext  context;
    int         glewInitialized;
 };
@@ -158,9 +175,9 @@ struct GLContext
 static void setGLContext(GLContext& glctx)
 {
    if (!glctx.context)
-        LOG(ERROR) << "setGLContext() called with null gltcx";
+        LOG(FATAL) << "setGLContext() called with null gltcx";

-    if (!eglMakeCurrent(glctx.display, glctx.surface, glctx.surface, glctx.context))
+    if (!eglMakeCurrent(glctx.display, EGL_NO_SURFACE, EGL_NO_SURFACE, glctx.context))
        LOG(ERROR) << "eglMakeCurrent() failed when setting GL context";

    if (glctx.glewInitialized)
@@ -168,7 +185,7 @@ static void setGLContext(GLContext& glctx)

    GLenum result = glewInit();
    if (result != GLEW_OK)
-        LOG(ERROR) << "glewInit() failed, return value = " << result;
+        LOG(FATAL) << "glewInit() failed, return value = " << result;
    glctx.glewInitialized = 1;
 }

@@ -178,21 +195,83 @@ static void releaseGLContext(void)
    if (display == EGL_NO_DISPLAY)
        LOG(WARNING) << "releaseGLContext() called with no active display";
    if (!eglMakeCurrent(display, EGL_NO_SURFACE, EGL_NO_SURFACE, EGL_NO_CONTEXT))
-        LOG(ERROR) << "eglMakeCurrent() failed when releasing GL context";
+        LOG(FATAL) << "eglMakeCurrent() failed when releasing GL context";
 }

-static GLContext createGLContext(void)
+static EGLDisplay getCudaDisplay(int cudaDeviceIdx)
 {
-    // Initialize.
+    typedef EGLBoolean (*eglQueryDevicesEXT_t)(EGLint, EGLDeviceEXT, EGLint*);
+    typedef EGLBoolean (*eglQueryDeviceAttribEXT_t)(EGLDeviceEXT, EGLint, EGLAttrib*);
+    typedef EGLDisplay (*eglGetPlatformDisplayEXT_t)(EGLenum, void*, const EGLint*);

-    EGLDisplay display = eglGetDisplay(EGL_DEFAULT_DISPLAY);
-    if (display == EGL_NO_DISPLAY)
-        LOG(ERROR) << "eglGetDisplay() failed";
+    eglQueryDevicesEXT_t eglQueryDevicesEXT = (eglQueryDevicesEXT_t)eglGetProcAddress("eglQueryDevicesEXT");
+    if (!eglQueryDevicesEXT)
+    {
+        LOG(INFO) << "eglGetProcAddress(\"eglQueryDevicesEXT\") failed";
+        return 0;
+    }
+
+    eglQueryDeviceAttribEXT_t eglQueryDeviceAttribEXT = (eglQueryDeviceAttribEXT_t)eglGetProcAddress("eglQueryDeviceAttribEXT");
+    if (!eglQueryDeviceAttribEXT)
+    {
+        LOG(INFO) << "eglGetProcAddress(\"eglQueryDeviceAttribEXT\") failed";
+        return 0;
+    }
+
+    eglGetPlatformDisplayEXT_t eglGetPlatformDisplayEXT = (eglGetPlatformDisplayEXT_t)eglGetProcAddress("eglGetPlatformDisplayEXT");
+    if (!eglGetPlatformDisplayEXT)
+    {
+        LOG(INFO) << "eglGetProcAddress(\"eglGetPlatformDisplayEXT\") failed";
+        return 0;
+    }
+
+    int num_devices = 0;
+    eglQueryDevicesEXT(0, 0, &num_devices);
+    if (!num_devices)
+        return 0;
+    
+    EGLDisplay display = 0;
+    EGLDeviceEXT* devices = (EGLDeviceEXT*)malloc(num_devices * sizeof(void*));
+    eglQueryDevicesEXT(num_devices, devices, &num_devices);
+    for (int i=0; i < num_devices; i++)
+    {
+        EGLDeviceEXT device = devices[i]; 
+        intptr_t value = -1;
+        if (eglQueryDeviceAttribEXT(device, EGL_CUDA_DEVICE_NV, &value) && value == cudaDeviceIdx)
+        {
+            display = eglGetPlatformDisplayEXT(EGL_PLATFORM_DEVICE_EXT, device, 0);
+            break;
+        }
+    }
+
+    free(devices);
+    return display;
+}
+
+static GLContext createGLContext(int cudaDeviceIdx)
+{
+    EGLDisplay display = 0;
+
+    if (cudaDeviceIdx >= 0)
+    {
+        char pciBusId[256] = "";
+        LOG(INFO) << "Creating GL context for Cuda device " << cudaDeviceIdx;
+        display = getCudaDisplay(cudaDeviceIdx);
+        if (!display)
+            LOG(INFO) << "Failed, falling back to default display";
+    }
+
+    if (!display)  
+    {
+        display = eglGetDisplay(EGL_DEFAULT_DISPLAY);
+        if (display == EGL_NO_DISPLAY)
+            LOG(FATAL) << "eglGetDisplay() failed";
+    }

    EGLint major;
    EGLint minor;
    if (!eglInitialize(display, &major, &minor))
-        LOG(ERROR) << "eglInitialize() failed";
+        LOG(FATAL) << "eglInitialize() failed";

    // Choose configuration.

@@ -211,45 +290,32 @@ static GLContext createGLContext(void)
    EGLConfig config;
    EGLint num_config;
    if (!eglChooseConfig(display, context_attribs, &config, 1, &num_config))
-        LOG(ERROR) << "eglChooseConfig() failed";
-
-    // Create dummy pbuffer surface.
-
-    const EGLint surface_attribs[] = {
-        EGL_WIDTH,      1,
-        EGL_HEIGHT,     1,
-        EGL_NONE
-    };
-
-    EGLSurface surface = eglCreatePbufferSurface(display, config, surface_attribs);
-    if (surface == EGL_NO_SURFACE)
-        LOG(ERROR) << "eglCreatePbufferSurface() failed";
+        LOG(FATAL) << "eglChooseConfig() failed";

    // Create GL context.

    if (!eglBindAPI(EGL_OPENGL_API))
-        LOG(ERROR) << "eglBindAPI() failed";
+        LOG(FATAL) << "eglBindAPI() failed";

    EGLContext context = eglCreateContext(display, config, EGL_NO_CONTEXT, NULL);
    if (context == EGL_NO_CONTEXT)
-        LOG(ERROR) << "eglCreateContext() failed";
+        LOG(FATAL) << "eglCreateContext() failed";

    // Done.

    LOG(INFO) << "EGL " << (int)minor << "." << (int)major << " OpenGL context created (disp: 0x"
              << std::hex << std::setfill('0')
              << std::setw(16) << (uintptr_t)display
-              << ", surf: 0x" << std::setw(16) << (uintptr_t)surface
              << ", ctx: 0x" << std::setw(16) << (uintptr_t)context << ")";

-    GLContext glctx = {display, surface, context, 0};
+    GLContext glctx = {display, context, 0};
    return glctx;
 }

 static void destroyGLContext(GLContext& glctx)
 {
    if (!glctx.context)
-        LOG(ERROR) << "destroyGLContext() called with null gltcx";
+        LOG(FATAL) << "destroyGLContext() called with null gltcx";

    // If this is the current context, release it.
    if (eglGetCurrentContext() == glctx.context)
@@ -257,13 +323,10 @@ static void destroyGLContext(GLContext& glctx)

    if (!eglDestroyContext(glctx.display, glctx.context))
        LOG(ERROR) << "eglDestroyContext() failed";
-    if (!eglDestroySurface(glctx.display, glctx.surface))
-        LOG(ERROR) << "eglDestroySurface() failed";

    LOG(INFO) << "EGL OpenGL context destroyed (disp: 0x"
              << std::hex << std::setfill('0')
              << std::setw(16) << (uintptr_t)glctx.display
-              << ", surf: 0x" << std::setw(16) << (uintptr_t)glctx.surface
              << ", ctx: 0x" << std::setw(16) << (uintptr_t)glctx.context << ")";

    memset(&glctx, 0, sizeof(GLContext));

--- a/nvdiffrast/common/rasterize.cpp
+++ b/nvdiffrast/common/rasterize.cpp
@@ -76,12 +76,12 @@ static void constructGLProgram(NVDR_CTX_ARGS, GLuint* pProgram, GLuint glVertexS
 //------------------------------------------------------------------------
 // Shared C++ functions.

-void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s)
+void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s, int cudaDeviceIdx)
 {
    // Create GL context and set it current.
-    s.glctx = createGLContext();
+    s.glctx = createGLContext(cudaDeviceIdx);
    setGLContext(s.glctx);
- 
+
    // Version check.
    GLint vMajor = 0;
    GLint vMinor = 0;
@@ -90,7 +90,7 @@ void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s)
    glGetError(); // Clear possible GL_INVALID_ENUM error in version query.
    LOG(INFO) << "OpenGL version reported as " << vMajor << "." << vMinor;
    NVDR_CHECK((vMajor == 4 && vMinor >= 4) || vMajor > 4, "OpenGL 4.4 or later is required");
-    
+
    // Number of output buffers.
    int num_outputs = s.enableDB ? 2 : 1;

@@ -319,7 +319,7 @@ void rasterizeResizeBuffers(NVDR_CTX_ARGS, RasterizeGLState& s, int posCount, in
        s.width  = ROUND_UP(s.width, 32);
        s.height = ROUND_UP(s.height, 32);
        LOG(INFO) << "Increasing frame buffer size to (width, height, depth) = (" << s.width << ", " << s.height << ", " << s.depth << ")";
-        
+
        // Allocate color buffers.
        for (int i=0; i < num_outputs; i++)
        {

--- a/nvdiffrast/common/rasterize.h
+++ b/nvdiffrast/common/rasterize.h
@@ -83,7 +83,7 @@ struct RasterizeGLState
 //------------------------------------------------------------------------
 // Shared C++ code prototypes.

-void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s);
+void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s, int cudaDeviceIdx);
 void rasterizeResizeBuffers(NVDR_CTX_ARGS, RasterizeGLState& s, int posCount, int triCount, int width, int height, int depth);
 void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, const float* posPtr, int posCount, int vtxPerInstance, const int32_t* triPtr, int triCount, const int32_t* rangesPtr, int width, int height, int depth);
 void rasterizeCopyResults(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, float** outputPtr, int width, int height, int depth);

--- a/nvdiffrast/common/texture.cu
+++ b/nvdiffrast/common/texture.cu
@@ -50,7 +50,7 @@ static __device__ __forceinline__ int4 wrapCubeMap(int face, int ix0, int ix1, i
    int cy = (iy0 < 0) ? 0 : (iy1 >= w) ? 6 : 3;
    int c = cx + cy;
    if (c >= 5)
-        c--; 
+        c--;
    c = (face << 3) + c;

    // Compute coordinates and faces.
@@ -250,7 +250,7 @@ static __device__ __forceinline__ void indexCubeMapGrad2(float3 uv, float3 dvdX,
    float mv = (idx & 0x2e) ? -mm : mm;
    gu *= -2.0 * m * mu;
    gv *= -2.0 * m * mv;
-    
+
    if (idx & 0x03)
    {
        dx.x = gu * dvdX.x + dm * dvdX.z;
@@ -311,7 +311,7 @@ static __device__ __forceinline__ int indexTextureNearest(const TextureKernelPar

    // Cube map indexing.
    if (CUBE_MODE)
-    {   
+    {
        // No wrap. Fold face index into tz right away.
        tz = 6 * tz + indexCubeMap(u, v, uv.z); // Rewrites u, v.
    }
@@ -379,7 +379,7 @@ static __device__ __forceinline__ float2 indexTextureLinear(const TextureKernelP
        // Move to texel space.
        u = u * (float)w - 0.5f;
        v = v * (float)h - 0.5f;
-        
+
        if (p.boundaryMode == TEX_BOUNDARY_MODE_CLAMP)
        {
            // Clamp to center of edge texels.
@@ -387,7 +387,7 @@ static __device__ __forceinline__ float2 indexTextureLinear(const TextureKernelP
            v = fminf(fmaxf(v, 0.f), h - 1.f);
            clampU = (u == 0.f || u == w - 1.f);
            clampV = (v == 0.f || v == h - 1.f);
-        } 
+        }
    }

    // Compute texel coordinates and weights.
@@ -397,7 +397,7 @@ static __device__ __forceinline__ float2 indexTextureLinear(const TextureKernelP
    int iv1 = iv0 + (clampV ? 0 : 1);
    u -= (float)iu0;
    v -= (float)iv0;
-    
+
    // Cube map wrapping.
    bool cubeWrap = CUBE_MODE && (iu0 < 0 || iv0 < 0 || iu1 >= w || iv1 >= h);
    if (cubeWrap)
@@ -448,107 +448,109 @@ static __device__ __forceinline__ float2 indexTextureLinear(const TextureKernelP
 //------------------------------------------------------------------------
 // Mip level calculation.

-template <bool CUBE_MODE, int FILTER_MODE>
+template <bool CUBE_MODE, bool BIAS_ONLY, int FILTER_MODE>
 static __device__ __forceinline__ void calculateMipLevel(int& level0, int& level1, float& flevel, const TextureKernelParams& p, int pidx, float3 uv, float4* pdw, float3* pdfdv)
 {
    // Do nothing if mips not in use.
    if (FILTER_MODE == TEX_MODE_NEAREST || FILTER_MODE == TEX_MODE_LINEAR)
        return;

-    // Get pixel derivatives of texture coordinates.
-    float4 uvDA;
-    float3 dvdX, dvdY; // Gradients use these later.
-    if (CUBE_MODE)
-    {
-        // Fetch.
-        float2 d0 = ((const float2*)p.uvDA)[3 * pidx + 0];
-        float2 d1 = ((const float2*)p.uvDA)[3 * pidx + 1];
-        float2 d2 = ((const float2*)p.uvDA)[3 * pidx + 2];    
-
-        // Map d{x,y,z}/d{X,Y} into d{s,t}/d{X,Y}.
-        dvdX = make_float3(d0.x, d1.x, d2.x); // d{x,y,z}/dX
-        dvdY = make_float3(d0.y, d1.y, d2.y); // d{x,y,z}/dY
-        uvDA = indexCubeMapGradST(uv, dvdX, dvdY); // d{s,t}/d{X,Y}
-    }
-    else
-    {
-        // Fetch.
-        uvDA = ((const float4*)p.uvDA)[pidx];
-    }
-
-    // Scaling factors.
-    float uscl = p.texWidth;
-    float vscl = p.texHeight;
-
-    // d[s,t]/d[X,Y].
-    float dsdx = uvDA.x * uscl;
-    float dsdy = uvDA.y * uscl;
-    float dtdx = uvDA.z * vscl;
-    float dtdy = uvDA.w * vscl;
-
-    // Calculate footprint axis lengths.
-    float A = dsdx*dsdx + dtdx*dtdx;
-    float B = dsdy*dsdy + dtdy*dtdy;
-    float C = dsdx*dsdy + dtdx*dtdy;
-    float l2b = 0.5 * (A + B);
-    float l2n = 0.25 * (A-B)*(A-B) + C*C;
-    float l2a = sqrt(l2n);
-    float lenMinorSqr = fmaxf(0.0, l2b - l2a);
-    float lenMajorSqr = l2b + l2a;
-
-    // Footprint vs. mip level gradient.
-    if (pdw && FILTER_MODE == TEX_MODE_LINEAR_MIPMAP_LINEAR)
+    // Determine mip level based on UV pixel derivatives. If no derivatives are given (mip level bias only), leave as zero.
+    if (!BIAS_ONLY)
    {
-        float dw   = 0.72134752f / (l2n + l2a * l2b); // Constant is 0.5/ln(2).
-        float AB   = dw * .5f * (A - B);
-        float Cw   = dw * C;
-        float l2aw = dw * l2a;
-        float d_f_ddsdX = uscl * (dsdx * (l2aw + AB) + dsdy * Cw);
-        float d_f_ddsdY = uscl * (dsdy * (l2aw - AB) + dsdx * Cw);
-        float d_f_ddtdX = vscl * (dtdx * (l2aw + AB) + dtdy * Cw);
-        float d_f_ddtdY = vscl * (dtdy * (l2aw - AB) + dtdx * Cw);
-
-        *pdw = make_float4(d_f_ddsdX, d_f_ddsdY, d_f_ddtdX, d_f_ddtdY);
-    
-        // In cube maps, there is also a texture coordinate vs. mip level gradient.
+        // Get pixel derivatives of texture coordinates.
+        float4 uvDA;
+        float3 dvdX, dvdY; // Gradients use these later.
        if (CUBE_MODE)
        {
-            float4 dx, dy, dz;
-            indexCubeMapGrad2(uv, dvdX, dvdY, dx, dy, dz);
+            // Fetch.
+            float2 d0 = ((const float2*)p.uvDA)[3 * pidx + 0];
+            float2 d1 = ((const float2*)p.uvDA)[3 * pidx + 1];
+            float2 d2 = ((const float2*)p.uvDA)[3 * pidx + 2];
+
+            // Map d{x,y,z}/d{X,Y} into d{s,t}/d{X,Y}.
+            dvdX = make_float3(d0.x, d1.x, d2.x); // d{x,y,z}/dX
+            dvdY = make_float3(d0.y, d1.y, d2.y); // d{x,y,z}/dY
+            uvDA = indexCubeMapGradST(uv, dvdX, dvdY); // d{s,t}/d{X,Y}
+        }
+        else
+        {
+            // Fetch.
+            uvDA = ((const float4*)p.uvDA)[pidx];
+        }

-            float3 d_dsdX_dv = make_float3(dx.x, dy.x, dz.x);
-            float3 d_dsdY_dv = make_float3(dx.y, dy.y, dz.y);
-            float3 d_dtdX_dv = make_float3(dx.z, dy.z, dz.z);
-            float3 d_dtdY_dv = make_float3(dx.w, dy.w, dz.w);
+        // Scaling factors.
+        float uscl = p.texWidth;
+        float vscl = p.texHeight;
+
+        // d[s,t]/d[X,Y].
+        float dsdx = uvDA.x * uscl;
+        float dsdy = uvDA.y * uscl;
+        float dtdx = uvDA.z * vscl;
+        float dtdy = uvDA.w * vscl;
+
+        // Calculate footprint axis lengths.
+        float A = dsdx*dsdx + dtdx*dtdx;
+        float B = dsdy*dsdy + dtdy*dtdy;
+        float C = dsdx*dsdy + dtdx*dtdy;
+        float l2b = 0.5 * (A + B);
+        float l2n = 0.25 * (A-B)*(A-B) + C*C;
+        float l2a = sqrt(l2n);
+        float lenMinorSqr = fmaxf(0.0, l2b - l2a);
+        float lenMajorSqr = l2b + l2a;
+
+        // Footprint vs. mip level gradient.
+        if (pdw && FILTER_MODE == TEX_MODE_LINEAR_MIPMAP_LINEAR)
+        {
+            float dw   = 0.72134752f / (l2n + l2a * l2b); // Constant is 0.5/ln(2).
+            float AB   = dw * .5f * (A - B);
+            float Cw   = dw * C;
+            float l2aw = dw * l2a;
+            float d_f_ddsdX = uscl * (dsdx * (l2aw + AB) + dsdy * Cw);
+            float d_f_ddsdY = uscl * (dsdy * (l2aw - AB) + dsdx * Cw);
+            float d_f_ddtdX = vscl * (dtdx * (l2aw + AB) + dtdy * Cw);
+            float d_f_ddtdY = vscl * (dtdy * (l2aw - AB) + dtdx * Cw);
+
+            *pdw = make_float4(d_f_ddsdX, d_f_ddsdY, d_f_ddtdX, d_f_ddtdY);
+
+            // In cube maps, there is also a texture coordinate vs. mip level gradient.
+            if (CUBE_MODE)
+            {
+                float4 dx, dy, dz;
+                indexCubeMapGrad2(uv, dvdX, dvdY, dx, dy, dz);

-            float3 d_f_dv = make_float3(0.f, 0.f, 0.f);
-            d_f_dv += d_dsdX_dv * d_f_ddsdX;
-            d_f_dv += d_dsdY_dv * d_f_ddsdY;
-            d_f_dv += d_dtdX_dv * d_f_ddtdX;
-            d_f_dv += d_dtdY_dv * d_f_ddtdY;
+                float3 d_dsdX_dv = make_float3(dx.x, dy.x, dz.x);
+                float3 d_dsdY_dv = make_float3(dx.y, dy.y, dz.y);
+                float3 d_dtdX_dv = make_float3(dx.z, dy.z, dz.z);
+                float3 d_dtdY_dv = make_float3(dx.w, dy.w, dz.w);

-            *pdfdv = d_f_dv;
+                float3 d_f_dv = make_float3(0.f, 0.f, 0.f);
+                d_f_dv += d_dsdX_dv * d_f_ddsdX;
+                d_f_dv += d_dsdY_dv * d_f_ddsdY;
+                d_f_dv += d_dtdX_dv * d_f_ddtdX;
+                d_f_dv += d_dtdY_dv * d_f_ddtdY;
+
+                *pdfdv = d_f_dv;
+            }
        }
+
+        // Finally, calculate mip level.
+        flevel = .5f * __log2f(lenMajorSqr);
    }

-    // Calculate true mip level and clamp.
-    flevel = .5f * __log2f(lenMajorSqr);
+    // Bias the mip level and clamp.
+    if (p.mipLevelBias)
+        flevel += p.mipLevelBias[pidx];
    flevel = fminf(fmaxf(flevel, 0.f), (float)p.mipLevelMax);

-    if (FILTER_MODE == TEX_MODE_LINEAR_MIPMAP_NEAREST)
-    {
-        // Linear-mipmap-nearest.
-        level0 = __float2int_rn(flevel);
-    }
-    else
+    // Calculate levels depending on filter mode.
+    level0 = __float2int_rd(flevel);
+
+    // Leave everything else at zero if flevel == 0 (magnification) or when in linear-mipmap-nearest mode.
+    if (FILTER_MODE == TEX_MODE_LINEAR_MIPMAP_LINEAR && flevel > 0.f)
    {
-        // Linear-mipmap-linear.
-        if (flevel > 0.f) // Leave everything at zero if flevel == 0 (magnification)
-        {
-            level0 = __float2int_rd(flevel);
-            level1 = min(level0 + 1, p.mipLevelMax);
-            flevel -= level0; // Fractional part. Zero if clamped on last level.
-        }
+        level1 = min(level0 + 1, p.mipLevelMax);
+        flevel -= level0; // Fractional part. Zero if clamped on last level.
    }
 }

@@ -672,7 +674,7 @@ __global__ void MipBuildKernel4(const TextureKernelParams p) { MipBuildKernelTem
 //------------------------------------------------------------------------
 // Forward kernel.

-template <class T, int C, bool CUBE_MODE, int FILTER_MODE>
+template <class T, int C, bool CUBE_MODE, bool BIAS_ONLY, int FILTER_MODE>
 static __forceinline__ __device__ void TextureFwdKernelTemplate(const TextureKernelParams p)
 {
    // Calculate pixel position.
@@ -702,7 +704,7 @@ static __forceinline__ __device__ void TextureFwdKernelTemplate(const TextureKer
        int tc = indexTextureNearest<CUBE_MODE>(p, uv, tz);
        tc *= p.channels;
        const float* pIn = p.tex;
-        
+
        // Copy if valid tc, otherwise output zero.
        for (int i=0; i < p.channels; i += C)
            *((T*)&pOut[i]) = (tc >= 0) ? *((const T*)&pIn[tc + i]) : zero_value<T>();
@@ -714,7 +716,7 @@ static __forceinline__ __device__ void TextureFwdKernelTemplate(const TextureKer
    float  flevel = 0.f; // Fractional level.
    int    level0 = 0;   // Discrete level 0.
    int    level1 = 0;   // Discrete level 1.
-    calculateMipLevel<CUBE_MODE, FILTER_MODE>(level0, level1, flevel, p, pidx, uv, 0, 0);
+    calculateMipLevel<CUBE_MODE, BIAS_ONLY, FILTER_MODE>(level0, level1, flevel, p, pidx, uv, 0, 0);

    // Get texel indices and pointer for level 0.
    int4 tc0 = make_int4(0, 0, 0, 0);
@@ -766,30 +768,42 @@ static __forceinline__ __device__ void TextureFwdKernelTemplate(const TextureKer
 }

 // Template specializations.
-__global__ void TextureFwdKernelNearest1                 (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, false, TEX_MODE_NEAREST>(p); }
-__global__ void TextureFwdKernelNearest2                 (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, false, TEX_MODE_NEAREST>(p); }
-__global__ void TextureFwdKernelNearest4                 (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, false, TEX_MODE_NEAREST>(p); }
-__global__ void TextureFwdKernelLinear1                  (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, false, TEX_MODE_LINEAR>(p); }
-__global__ void TextureFwdKernelLinear2                  (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, false, TEX_MODE_LINEAR>(p); }
-__global__ void TextureFwdKernelLinear4                  (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, false, TEX_MODE_LINEAR>(p); }
-__global__ void TextureFwdKernelLinearMipmapNearest1     (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
-__global__ void TextureFwdKernelLinearMipmapNearest2     (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
-__global__ void TextureFwdKernelLinearMipmapNearest4     (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
-__global__ void TextureFwdKernelLinearMipmapLinear1      (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
-__global__ void TextureFwdKernelLinearMipmapLinear2      (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
-__global__ void TextureFwdKernelLinearMipmapLinear4      (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
-__global__ void TextureFwdKernelCubeNearest1             (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, true,  TEX_MODE_NEAREST>(p); }
-__global__ void TextureFwdKernelCubeNearest2             (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, true,  TEX_MODE_NEAREST>(p); }
-__global__ void TextureFwdKernelCubeNearest4             (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, true,  TEX_MODE_NEAREST>(p); }
-__global__ void TextureFwdKernelCubeLinear1              (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, true,  TEX_MODE_LINEAR>(p); }
-__global__ void TextureFwdKernelCubeLinear2              (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, true,  TEX_MODE_LINEAR>(p); }
-__global__ void TextureFwdKernelCubeLinear4              (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, true,  TEX_MODE_LINEAR>(p); }
-__global__ void TextureFwdKernelCubeLinearMipmapNearest1 (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
-__global__ void TextureFwdKernelCubeLinearMipmapNearest2 (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
-__global__ void TextureFwdKernelCubeLinearMipmapNearest4 (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
-__global__ void TextureFwdKernelCubeLinearMipmapLinear1  (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
-__global__ void TextureFwdKernelCubeLinearMipmapLinear2  (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
-__global__ void TextureFwdKernelCubeLinearMipmapLinear4  (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelNearest1                    (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, false, false, TEX_MODE_NEAREST>(p); }
+__global__ void TextureFwdKernelNearest2                    (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, false, false, TEX_MODE_NEAREST>(p); }
+__global__ void TextureFwdKernelNearest4                    (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, false, false, TEX_MODE_NEAREST>(p); }
+__global__ void TextureFwdKernelLinear1                     (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, false, false, TEX_MODE_LINEAR>(p); }
+__global__ void TextureFwdKernelLinear2                     (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, false, false, TEX_MODE_LINEAR>(p); }
+__global__ void TextureFwdKernelLinear4                     (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, false, false, TEX_MODE_LINEAR>(p); }
+__global__ void TextureFwdKernelLinearMipmapNearest1        (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, false, false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelLinearMipmapNearest2        (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, false, false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelLinearMipmapNearest4        (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, false, false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelLinearMipmapLinear1         (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, false, false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelLinearMipmapLinear2         (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, false, false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelLinearMipmapLinear4         (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, false, false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelCubeNearest1                (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, true,  false, TEX_MODE_NEAREST>(p); }
+__global__ void TextureFwdKernelCubeNearest2                (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, true,  false, TEX_MODE_NEAREST>(p); }
+__global__ void TextureFwdKernelCubeNearest4                (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, true,  false, TEX_MODE_NEAREST>(p); }
+__global__ void TextureFwdKernelCubeLinear1                 (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, true,  false, TEX_MODE_LINEAR>(p); }
+__global__ void TextureFwdKernelCubeLinear2                 (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, true,  false, TEX_MODE_LINEAR>(p); }
+__global__ void TextureFwdKernelCubeLinear4                 (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, true,  false, TEX_MODE_LINEAR>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapNearest1    (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, true,  false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapNearest2    (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, true,  false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapNearest4    (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, true,  false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapLinear1     (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, true,  false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapLinear2     (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, true,  false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapLinear4     (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, true,  false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelLinearMipmapNearestBO1      (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, false, true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelLinearMipmapNearestBO2      (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, false, true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelLinearMipmapNearestBO4      (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, false, true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelLinearMipmapLinearBO1       (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, false, true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelLinearMipmapLinearBO2       (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, false, true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelLinearMipmapLinearBO4       (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, false, true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapNearestBO1  (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, true,  true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapNearestBO2  (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, true,  true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapNearestBO4  (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, true,  true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapLinearBO1   (const TextureKernelParams p) { TextureFwdKernelTemplate<float,  1, true,  true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapLinearBO2   (const TextureKernelParams p) { TextureFwdKernelTemplate<float2, 2, true,  true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureFwdKernelCubeLinearMipmapLinearBO4   (const TextureKernelParams p) { TextureFwdKernelTemplate<float4, 4, true,  true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }

 //------------------------------------------------------------------------
 // Gradient mip puller kernel.
@@ -856,7 +870,7 @@ __global__ void MipGradKernel4(const TextureKernelParams p) { MipGradKernelTempl
 //------------------------------------------------------------------------
 // Gradient kernel.

-template <bool CUBE_MODE, int FILTER_MODE>
+template <bool CUBE_MODE, bool BIAS_ONLY, int FILTER_MODE>
 static __forceinline__ __device__ void TextureGradKernelTemplate(const TextureKernelParams p)
 {
    // Temporary space for coalesced atomics.
@@ -899,9 +913,14 @@ static __forceinline__ __device__ void TextureGradKernelTemplate(const TextureKe
                ((float3*)p.gradUV)[pidx] = make_float3(0.f, 0.f, 0.f);
            if (FILTER_MODE == TEX_MODE_LINEAR_MIPMAP_LINEAR)
            {
-                ((float2*)p.gradUVDA)[3 * pidx + 0] = make_float2(0.f, 0.f);
-                ((float2*)p.gradUVDA)[3 * pidx + 1] = make_float2(0.f, 0.f);
-                ((float2*)p.gradUVDA)[3 * pidx + 2] = make_float2(0.f, 0.f);
+                if (p.gradUVDA)
+                {
+                    ((float2*)p.gradUVDA)[3 * pidx + 0] = make_float2(0.f, 0.f);
+                    ((float2*)p.gradUVDA)[3 * pidx + 1] = make_float2(0.f, 0.f);
+                    ((float2*)p.gradUVDA)[3 * pidx + 2] = make_float2(0.f, 0.f);
+                }
+                if (p.gradMipLevelBias)
+                    p.gradMipLevelBias[pidx] = 0.f;
            }
        }
        else
@@ -909,7 +928,12 @@ static __forceinline__ __device__ void TextureGradKernelTemplate(const TextureKe
            if (FILTER_MODE != TEX_MODE_NEAREST)
                ((float2*)p.gradUV)[pidx] = make_float2(0.f, 0.f);
            if (FILTER_MODE == TEX_MODE_LINEAR_MIPMAP_LINEAR)
-                ((float4*)p.gradUVDA)[pidx] = make_float4(0.f, 0.f, 0.f, 0.f);
+            {
+                if (p.gradUVDA)
+                    ((float4*)p.gradUVDA)[pidx] = make_float4(0.f, 0.f, 0.f, 0.f);
+                if (p.gradMipLevelBias)
+                    p.gradMipLevelBias[pidx] = 0.f;
+            }
        }
        return;
    }
@@ -944,7 +968,7 @@ static __forceinline__ __device__ void TextureGradKernelTemplate(const TextureKe
    float  flevel = 0.f; // Fractional level.
    int    level0 = 0;   // Discrete level 0.
    int    level1 = 0;   // Discrete level 1.
-    calculateMipLevel<CUBE_MODE, FILTER_MODE>(level0, level1, flevel, p, pidx, uv, &dw, &dfdv);
+    calculateMipLevel<CUBE_MODE, BIAS_ONLY, FILTER_MODE>(level0, level1, flevel, p, pidx, uv, &dw, &dfdv);

    // UV gradient accumulators.
    float gu = 0.f;
@@ -977,7 +1001,7 @@ static __forceinline__ __device__ void TextureGradKernelTemplate(const TextureKe
        {
            float dy = pDy[i];
            accumQuad(tw0 * dy, pOut0, level0, tc0, corner0, CA_TEMP);
-            
+
            float a00, a10, a01, a11;
            fetchQuad<float>(a00, a10, a01, a11, pIn0, tc0, corner0);
            float ad = (a11 + a00 - a10 - a01);
@@ -1037,7 +1061,7 @@ static __forceinline__ __device__ void TextureGradKernelTemplate(const TextureKe
            // Texture gradients for second level.
            float dy1 = flevel * dy;
            accumQuad(tw1 * dy1, pOut1, level1, tc1, corner1, CA_TEMP);
-    
+
            // UV gradients for second level.
            float b00, b10, b01, b11;
            fetchQuad<float>(b00, b10, b01, b11, pIn1, tc1, corner1);
@@ -1058,31 +1082,43 @@ static __forceinline__ __device__ void TextureGradKernelTemplate(const TextureKe
    else
        ((float2*)p.gradUV)[pidx] = make_float2(gu, gv);

-    // Final UV pixel differential gradients.
-    dw *= df; // dL/(d{s,y}/d{X,Y}) = df/(d{s,y}/d{X,Y}) * dL/df.
+    // Store mip level bias gradient.
+    if (p.gradMipLevelBias)
+        p.gradMipLevelBias[pidx] = df;

-    // Store them.
-    if (CUBE_MODE)
-    {        
-        // Remap from dL/(d{s,t}/s{X,Y}) to dL/(d{x,y,z}/d{X,Y}).
-        float3 g0, g1;
-        indexCubeMapGrad4(uv, dw, g0, g1); 
-        ((float2*)p.gradUVDA)[3 * pidx + 0] = make_float2(g0.x, g1.x);
-        ((float2*)p.gradUVDA)[3 * pidx + 1] = make_float2(g0.y, g1.y);
-        ((float2*)p.gradUVDA)[3 * pidx + 2] = make_float2(g0.z, g1.z);
+    // Store UV pixel differential gradients.
+    if (!BIAS_ONLY)
+    {
+        // Final gradients.
+        dw *= df; // dL/(d{s,y}/d{X,Y}) = df/(d{s,y}/d{X,Y}) * dL/df.
+
+        // Store them.
+        if (CUBE_MODE)
+        {
+            // Remap from dL/(d{s,t}/s{X,Y}) to dL/(d{x,y,z}/d{X,Y}).
+            float3 g0, g1;
+            indexCubeMapGrad4(uv, dw, g0, g1);
+            ((float2*)p.gradUVDA)[3 * pidx + 0] = make_float2(g0.x, g1.x);
+            ((float2*)p.gradUVDA)[3 * pidx + 1] = make_float2(g0.y, g1.y);
+            ((float2*)p.gradUVDA)[3 * pidx + 2] = make_float2(g0.z, g1.z);
+        }
+        else
+            ((float4*)p.gradUVDA)[pidx] = dw;
    }
-    else
-        ((float4*)p.gradUVDA)[pidx] = dw;
 }

 // Template specializations.
-__global__ void TextureGradKernelNearest                 (const TextureKernelParams p) { TextureGradKernelTemplate<false, TEX_MODE_NEAREST>(p); }
-__global__ void TextureGradKernelLinear                  (const TextureKernelParams p) { TextureGradKernelTemplate<false, TEX_MODE_LINEAR>(p); }
-__global__ void TextureGradKernelLinearMipmapNearest     (const TextureKernelParams p) { TextureGradKernelTemplate<false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
-__global__ void TextureGradKernelLinearMipmapLinear      (const TextureKernelParams p) { TextureGradKernelTemplate<false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
-__global__ void TextureGradKernelCubeNearest             (const TextureKernelParams p) { TextureGradKernelTemplate<true,  TEX_MODE_NEAREST>(p); }
-__global__ void TextureGradKernelCubeLinear              (const TextureKernelParams p) { TextureGradKernelTemplate<true,  TEX_MODE_LINEAR>(p); }
-__global__ void TextureGradKernelCubeLinearMipmapNearest (const TextureKernelParams p) { TextureGradKernelTemplate<true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
-__global__ void TextureGradKernelCubeLinearMipmapLinear  (const TextureKernelParams p) { TextureGradKernelTemplate<true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureGradKernelNearest                    (const TextureKernelParams p) { TextureGradKernelTemplate<false, false, TEX_MODE_NEAREST>(p); }
+__global__ void TextureGradKernelLinear                     (const TextureKernelParams p) { TextureGradKernelTemplate<false, false, TEX_MODE_LINEAR>(p); }
+__global__ void TextureGradKernelLinearMipmapNearest        (const TextureKernelParams p) { TextureGradKernelTemplate<false, false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureGradKernelLinearMipmapLinear         (const TextureKernelParams p) { TextureGradKernelTemplate<false, false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureGradKernelCubeNearest                (const TextureKernelParams p) { TextureGradKernelTemplate<true,  false, TEX_MODE_NEAREST>(p); }
+__global__ void TextureGradKernelCubeLinear                 (const TextureKernelParams p) { TextureGradKernelTemplate<true,  false, TEX_MODE_LINEAR>(p); }
+__global__ void TextureGradKernelCubeLinearMipmapNearest    (const TextureKernelParams p) { TextureGradKernelTemplate<true,  false, TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureGradKernelCubeLinearMipmapLinear     (const TextureKernelParams p) { TextureGradKernelTemplate<true,  false, TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureGradKernelLinearMipmapNearestBO      (const TextureKernelParams p) { TextureGradKernelTemplate<false, true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureGradKernelLinearMipmapLinearBO       (const TextureKernelParams p) { TextureGradKernelTemplate<false, true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }
+__global__ void TextureGradKernelCubeLinearMipmapNearestBO  (const TextureKernelParams p) { TextureGradKernelTemplate<true,  true,  TEX_MODE_LINEAR_MIPMAP_NEAREST>(p); }
+__global__ void TextureGradKernelCubeLinearMipmapLinearBO   (const TextureKernelParams p) { TextureGradKernelTemplate<true,  true,  TEX_MODE_LINEAR_MIPMAP_LINEAR>(p); }

 //------------------------------------------------------------------------
--- a/nvdiffrast/common/texture.h
+++ b/nvdiffrast/common/texture.h
@@ -40,7 +40,8 @@ struct TextureKernelParams
 {
    const float*    tex;                            // Incoming texture buffer.
    const float*    uv;                             // Incoming texcoord buffer.
-    const float*    uvDA;                           // Incoming uv pixel diffs. NULL if mips disabled.
+    const float*    uvDA;                           // Incoming uv pixel diffs or NULL.
+    const float*    mipLevelBias;                   // Incoming mip level bias or NULL.
    const float*    dy;                             // Incoming output gradient.
    float*          mip;                            // Mip data buffer.
    float*          out;                            // Outgoing texture data.
@@ -48,7 +49,8 @@ struct TextureKernelParams
    float*          gradTexMip;                     // Temporary texture gradients for mip levels > 0.
    float*          gradUV;                         // Outgoing texcoord gradient.
    float*          gradUVDA;                       // Outgoing texcoord pixel differential gradient.
-    int             enableMip;                      // If true, we have uv_da input and mip output tensor.
+    float*          gradMipLevelBias;               // Outgoing mip level bias gradient.
+    int             enableMip;                      // If true, we have uv_da and/or mip_level_bias input(s), and a mip tensor.
    int             filterMode;                     // One of the TEX_MODE_ constants.
    int             boundaryMode;                   // One of the TEX_BOUNDARY_MODE_ contants.
    int             texConst;                       // If true, texture is known to be constant.

--- a/nvdiffrast/lib/setgpu.lib
+++ b/nvdiffrast/lib/setgpu.lib
--- a/nvdiffrast/tensorflow/ops.py
+++ b/nvdiffrast/tensorflow/ops.py
@@ -19,7 +19,7 @@ from . import plugin_loader
 def _get_gl_opts():
    libs = {
        'posix': ['GL', 'GLEW'],
-        'nt':    ['gdi32', 'glew32s', 'opengl32', 'user32'], 
+        'nt':    ['gdi32', 'glew32s', 'opengl32', 'user32', 'setgpu'],
    }
    return ['-l' + x for x in libs[os.name]]


--- a/nvdiffrast/tensorflow/tf_rasterize.cu
+++ b/nvdiffrast/tensorflow/tf_rasterize.cu
@@ -12,11 +12,11 @@
 struct RasterizeFwdOp : public OpKernel
 {
    RasterizeGLState        m_glState;              // OpenGL-related persistent state.
-    int                     m_tri_const;            // 1 if triangle array is known to be constant.    
+    int                     m_tri_const;            // 1 if triangle array is known to be constant.

    RasterizeFwdOp(OpKernelConstruction* ctx):
        OpKernel(ctx)
-    {        
+    {
        memset(&m_glState, 0, sizeof(RasterizeGLState));
        OP_REQUIRES_OK(ctx, ctx->GetAttr("enable_db", &m_glState.enableDB));
        OP_REQUIRES_OK(ctx, ctx->GetAttr("tri_const", &m_tri_const));
@@ -48,7 +48,7 @@ struct RasterizeFwdOp : public OpKernel
            OP_REQUIRES(ctx, pos.dims() == 2 && pos.dim_size(0) > 0 && pos.dim_size(1) == 4, errors::InvalidArgument("range mode - pos must have shape [>0, 4]"));
            OP_REQUIRES(ctx, tri.dims() == 2 && tri.dim_size(0) > 0 && tri.dim_size(1) == 3, errors::InvalidArgument("tri must have shape [>0, 3]"));
            OP_REQUIRES(ctx, resolution.dims() == 1 && resolution.dim_size(0) == 2, errors::InvalidArgument("resolution must have shape [2]"));
-            OP_REQUIRES(ctx, ranges.dims() == 2 && ranges.dim_size(0) > 0 && ranges.dim_size(1) == 2, errors::InvalidArgument("range mode - ranges must have shape [>0, 2]"));            
+            OP_REQUIRES(ctx, ranges.dims() == 2 && ranges.dim_size(0) > 0 && ranges.dim_size(1) == 2, errors::InvalidArgument("range mode - ranges must have shape [>0, 2]"));
        }

        // Get output shape.
@@ -65,12 +65,16 @@ struct RasterizeFwdOp : public OpKernel
        // Init context and GL?
        bool initCtx = !m_glState.glFBO;
        if (initCtx)
-            rasterizeInitGLContext(ctx, m_glState); // In common/rasterize.inl
+        {
+            const DeviceBase::GpuDeviceInfo* g = ctx->device()->tensorflow_gpu_device_info();
+            int cudaDeviceIdx = g ? g->gpu_id : -1;
+            rasterizeInitGLContext(ctx, m_glState, cudaDeviceIdx); // In common/rasterize.cpp
+        }
        else
            setGLContext(m_glState.glctx); // (Re-)Activate GL context.

        // Resize all buffers.
-        rasterizeResizeBuffers(ctx, m_glState, posCount, triCount, width, height, depth); // In common/rasterize.inl
+        rasterizeResizeBuffers(ctx, m_glState, posCount, triCount, width, height, depth); // In common/rasterize.cpp

        // Newly created GL objects sometimes don't map properly to CUDA until after first context swap. Workaround.
        if (initCtx)
@@ -79,7 +83,7 @@ struct RasterizeFwdOp : public OpKernel
            releaseGLContext();
            setGLContext(m_glState.glctx);
        }
-    
+
        // Copy input data to GL and render.
        const float* posPtr = pos.flat<float>().data();
        const int32_t* rangesPtr = instance_mode ? 0 : ranges.flat<int32_t>().data(); // This is in CPU memory.
@@ -178,7 +182,7 @@ struct RasterizeGradOp : public OpKernel
        p.out = out.flat<float>().data();
        p.dy  = dy.flat<float>().data();
        p.ddb = ENABLE_DB ? ddb.flat<float>().data() : 0;
-        
+
        // Set up pixel position to clip space x, y transform.
        p.xs = 2.f / (float)p.width;
        p.xo = 1.f / (float)p.width - 1.f;

--- a/nvdiffrast/torch/ops.py
+++ b/nvdiffrast/torch/ops.py
@@ -45,9 +45,9 @@ def _get_plugin():

    # Linker options.
    if os.name == 'posix':
-        ldflags = ['-lGL', '-lGLEW']
+        ldflags = ['-lGL', '-lGLEW', '-lEGL']
    elif os.name == 'nt':
-        libs = ['gdi32', 'glew32s', 'opengl32', 'user32']
+        libs = ['gdi32', 'glew32s', 'opengl32', 'user32', 'setgpu']
        ldflags = ['/LIBPATH:' + lib_dir] + ['/DEFAULTLIB:' + x for x in libs]

    # List of source files.
@@ -103,9 +103,9 @@ def set_log_level(level):
    '''Set log level.

    Log levels follow the convention on the C++ side of Torch:
-      0 = Info, 
-      1 = Warning, 
-      2 = Error, 
+      0 = Info,
+      1 = Warning,
+      2 = Error,
      3 = Fatal.
    The default log level is 1.

@@ -121,7 +121,7 @@ def set_log_level(level):
 #----------------------------------------------------------------------------

 class RasterizeGLContext:
-    def __init__(self, output_db=True, mode='automatic'):
+    def __init__(self, output_db=True, mode='automatic', device=None):
        '''Create a new OpenGL rasterizer context.

        Creating an OpenGL context is a slow operation so you should reuse the same
@@ -131,7 +131,10 @@ class RasterizeGLContext:
        Args:
          output_db (bool): Compute and output image-space derivates of barycentrics.
          mode: OpenGL context handling mode. Valid values are 'manual' and 'automatic'.
-
+          device (Optional): Cuda device on which the context is created. Type can be
+                             `torch.device`, string (e.g., `'cuda:1'`), or int. If not
+                             specified, context will be created on currently active Cuda
+                             device.
        Returns:
          The newly created OpenGL rasterizer context.
        '''
@@ -139,11 +142,16 @@ class RasterizeGLContext:
        assert mode in ['automatic', 'manual']
        self.output_db = output_db
        self.mode = mode
-        self.cpp_wrapper = _get_plugin().RasterizeGLStateWrapper(output_db, mode == 'automatic')
+        if device is None:
+            cuda_device_idx = torch.cuda.current_device()
+        else:
+            with torch.cuda.device(device):
+                cuda_device_idx = torch.cuda.current_device()
+        self.cpp_wrapper = _get_plugin().RasterizeGLStateWrapper(output_db, mode == 'automatic', cuda_device_idx)

    def set_context(self):
        '''Set (activate) OpenGL context in the current CPU thread.
-           Only available if context was created in manual mode.   
+           Only available if context was created in manual mode.
        '''
        assert self.mode == 'manual'
        self.cpp_wrapper.set_context()
@@ -316,22 +324,26 @@ def interpolate(attr, rast, tri, rast_db=None, diff_attrs=None):
 # Linear-mipmap-linear and linear-mipmap-nearest: Mipmaps enabled.
 class _texture_func_mip(torch.autograd.Function):
    @staticmethod
-    def forward(ctx, filter_mode, tex, uv, uv_da, mip, filter_mode_enum, boundary_mode_enum):
-        out = _get_plugin().texture_fwd_mip(tex, uv, uv_da, mip, filter_mode_enum, boundary_mode_enum)
-        ctx.save_for_backward(tex, uv, uv_da)
+    def forward(ctx, filter_mode, tex, uv, uv_da, mip_level_bias, mip, filter_mode_enum, boundary_mode_enum):
+        if uv_da is None:
+            uv_da = torch.tensor([])
+        if mip_level_bias is None:
+            mip_level_bias = torch.tensor([])
+        out = _get_plugin().texture_fwd_mip(tex, uv, uv_da, mip_level_bias, mip, filter_mode_enum, boundary_mode_enum)
+        ctx.save_for_backward(tex, uv, uv_da, mip_level_bias)
        ctx.saved_misc = filter_mode, mip, filter_mode_enum, boundary_mode_enum
        return out

    @staticmethod
    def backward(ctx, dy):
-        tex, uv, uv_da = ctx.saved_variables
+        tex, uv, uv_da, mip_level_bias = ctx.saved_variables
        filter_mode, mip, filter_mode_enum, boundary_mode_enum = ctx.saved_misc
        if filter_mode == 'linear-mipmap-linear':
-            g_tex, g_uv, g_uv_da = _get_plugin().texture_grad_linear_mipmap_linear(tex, uv, dy, uv_da, mip, filter_mode_enum, boundary_mode_enum)
-            return None, g_tex, g_uv, g_uv_da, None, None, None
+            g_tex, g_uv, g_uv_da, g_mip_level_bias = _get_plugin().texture_grad_linear_mipmap_linear(tex, uv, dy, uv_da, mip_level_bias, mip, filter_mode_enum, boundary_mode_enum)
+            return None, g_tex, g_uv, g_uv_da, g_mip_level_bias, None, None, None
        else: # linear-mipmap-nearest
-            g_tex, g_uv = _get_plugin().texture_grad_linear_mipmap_nearest(tex, uv, dy, uv_da, mip, filter_mode_enum, boundary_mode_enum)
-            return None, g_tex, g_uv, None, None, None, None
+            g_tex, g_uv = _get_plugin().texture_grad_linear_mipmap_nearest(tex, uv, dy, uv_da, mip_level_bias, mip, filter_mode_enum, boundary_mode_enum)
+            return None, g_tex, g_uv, None, None, None, None, None

 # Linear and nearest: Mipmaps disabled.
 class _texture_func(torch.autograd.Function):
@@ -354,7 +366,7 @@ class _texture_func(torch.autograd.Function):
            return None, g_tex, None, None, None

 # Op wrapper.
-def texture(tex, uv, uv_da=None, mip=None, filter_mode='auto', boundary_mode='wrap', max_mip_level=None):
+def texture(tex, uv, uv_da=None, mip_level_bias=None, mip=None, filter_mode='auto', boundary_mode='wrap', max_mip_level=None):
    """Perform texture sampling.

    All input tensors must be contiguous and reside in GPU memory. The output tensor
@@ -364,22 +376,24 @@ def texture(tex, uv, uv_da=None, mip=None, filter_mode='auto', boundary_mode='wr
        tex: Texture tensor with dtype `torch.float32`. For 2D textures, must have shape
             [minibatch_size, tex_height, tex_width, tex_channels]. For cube map textures,
             must have shape [minibatch_size, 6, tex_height, tex_width, tex_channels] where
-             tex_width and tex_height are equal. Note that `boundary_mode` must also be set 
+             tex_width and tex_height are equal. Note that `boundary_mode` must also be set
             to 'cube' to enable cube map mode. Broadcasting is supported along the minibatch axis.
-        uv: Tensor containing per-pixel texture coordinates. When sampling a 2D texture, 
+        uv: Tensor containing per-pixel texture coordinates. When sampling a 2D texture,
            must have shape [minibatch_size, height, width, 2]. When sampling a cube map
            texture, must have shape [minibatch_size, height, width, 3].
        uv_da: (Optional) Tensor containing image-space derivatives of texture coordinates.
               Must have same shape as `uv` except for the last dimension that is to be twice
               as long.
+        mip_level_bias: (Optional) Per-pixel bias for mip level selection. If `uv_da` is omitted,
+                        determines mip level directly. Must have shape [minibatch_size, height, width].
        mip: (Optional) Preconstructed mipmap stack from a `texture_construct_mip()` call. If not
             specified, the mipmap stack is constructed internally and discarded afterwards.
-        filter_mode: Texture filtering mode to be used. Valid values are 'auto', 'nearest', 
+        filter_mode: Texture filtering mode to be used. Valid values are 'auto', 'nearest',
                     'linear', 'linear-mipmap-nearest', and 'linear-mipmap-linear'. Mode 'auto'
-                     selects 'linear' if `uv_da` is not specified, and 'linear-mipmap-linear'
-                     when `uv_da` is specified, these being the highest-quality modes possible
-                     depending on the availability of the image-space derivatives of the texture
-                     coordinates.
+                     selects 'linear' if neither `uv_da` or `mip_level_bias` is specified, and
+                     'linear-mipmap-linear' when at least one of them is specified, these being
+                     the highest-quality modes possible depending on the availability of the
+                     image-space derivatives of the texture coordinates or direct mip level information.
        boundary_mode: Valid values are 'wrap', 'clamp', 'zero', and 'cube'. If `tex` defines a
                       cube map, this must be set to 'cube'. The default mode 'wrap' takes fractional
                       part of texture coordinates. Mode 'clamp' clamps texture coordinates to the
@@ -395,7 +409,7 @@ def texture(tex, uv, uv_da=None, mip=None, filter_mode='auto', boundary_mode='wr

    # Default filter mode.
    if filter_mode == 'auto':
-        filter_mode = 'linear-mipmap-linear' if (uv_da is not None) else 'linear'
+        filter_mode = 'linear-mipmap-linear' if (uv_da is not None or mip_level_bias is not None) else 'linear'

    # Sanitize inputs.
    if max_mip_level is None:
@@ -407,7 +421,7 @@ def texture(tex, uv, uv_da=None, mip=None, filter_mode='auto', boundary_mode='wr
    # Check inputs.
    assert isinstance(tex, torch.Tensor) and isinstance(uv, torch.Tensor)
    if 'mipmap' in filter_mode:
-        assert isinstance(uv_da, torch.Tensor)
+        assert isinstance(uv_da, torch.Tensor) or isinstance(mip_level_bias, torch.Tensor)

    # If mipping disabled via max level=0, we may as well use simpler filtering internally.
    if max_mip_level == 0 and filter_mode in ['linear-mipmap-nearest', 'linear-mipmap-linear']:
@@ -430,10 +444,10 @@ def texture(tex, uv, uv_da=None, mip=None, filter_mode='auto', boundary_mode='wr

    # Choose stub.
    if filter_mode == 'linear-mipmap-linear' or filter_mode == 'linear-mipmap-nearest':
-        return _texture_func_mip.apply(filter_mode, tex, uv, uv_da, mip, filter_mode_enum, boundary_mode_enum)
+        return _texture_func_mip.apply(filter_mode, tex, uv, uv_da, mip_level_bias, mip, filter_mode_enum, boundary_mode_enum)
    else:
        return _texture_func.apply(filter_mode, tex, uv, filter_mode_enum, boundary_mode_enum)
-        
+
 # Mipmap precalculation for cases where the texture stays constant.
 def texture_construct_mip(tex, max_mip_level=None, cube_mode=False):
    """Construct a mipmap stack for a texture.

--- a/nvdiffrast/torch/torch_antialias.cpp
+++ b/nvdiffrast/torch/torch_antialias.cpp
@@ -24,6 +24,7 @@ void AntialiasGradKernel            (const AntialiasKernelParams p);

 TopologyHashWrapper antialias_construct_topology_hash(torch::Tensor tri)
 {
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(tri));
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    AntialiasKernelParams p = {}; // Initialize all fields to zero.

@@ -66,6 +67,7 @@ TopologyHashWrapper antialias_construct_topology_hash(torch::Tensor tri)

 std::tuple<torch::Tensor, torch::Tensor> antialias_fwd(torch::Tensor color, torch::Tensor rast, torch::Tensor pos, torch::Tensor tri, TopologyHashWrapper topology_hash_wrap)
 {
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(color));
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    AntialiasKernelParams p = {}; // Initialize all fields to zero.
    p.instance_mode = (pos.sizes().size() > 2) ? 1 : 0;
@@ -112,10 +114,10 @@ std::tuple<torch::Tensor, torch::Tensor> antialias_fwd(torch::Tensor color, torc
    p.xh = .5f * (float)p.width;
    p.yh = .5f * (float)p.height;
    p.allocTriangles = topology_hash.size(0) / (4 * AA_HASH_ELEMENTS_PER_TRIANGLE);
-   
+
    // Allocate output tensors.
    torch::Tensor out = color.detach().clone(); // Use color as base.
-    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);    
+    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);
    torch::Tensor work_buffer = torch::empty({p.n * p.width * p.height * 8 + 4}, opts); // 8 int for a maximum of two work items per pixel.
    p.output = out.data_ptr<float>();
    p.workBuffer = (int4*)(work_buffer.data_ptr<float>());
@@ -153,6 +155,7 @@ std::tuple<torch::Tensor, torch::Tensor> antialias_fwd(torch::Tensor color, torc

 std::tuple<torch::Tensor, torch::Tensor> antialias_grad(torch::Tensor color, torch::Tensor rast, torch::Tensor pos, torch::Tensor tri, torch::Tensor dy, torch::Tensor work_buffer)
 {
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(color));
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    AntialiasKernelParams p = {}; // Initialize all fields to zero.
    p.instance_mode = (pos.sizes().size() > 2) ? 1 : 0;

--- a/nvdiffrast/torch/torch_bindings.cpp
+++ b/nvdiffrast/torch/torch_bindings.cpp
@@ -13,9 +13,10 @@
 //------------------------------------------------------------------------
 // Op prototypes. Return type macros for readability.

-#define OP_RETURN_T   torch::Tensor
-#define OP_RETURN_TT  std::tuple<torch::Tensor, torch::Tensor>
-#define OP_RETURN_TTT std::tuple<torch::Tensor, torch::Tensor, torch::Tensor>
+#define OP_RETURN_T     torch::Tensor
+#define OP_RETURN_TT    std::tuple<torch::Tensor, torch::Tensor>
+#define OP_RETURN_TTT   std::tuple<torch::Tensor, torch::Tensor, torch::Tensor>
+#define OP_RETURN_TTTT  std::tuple<torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor>

 OP_RETURN_TT        rasterize_fwd                       (RasterizeGLStateWrapper& stateWrapper, torch::Tensor pos, torch::Tensor tri, std::tuple<int, int> resolution, torch::Tensor ranges);
 OP_RETURN_T         rasterize_grad                      (torch::Tensor pos, torch::Tensor tri, torch::Tensor out, torch::Tensor dy);
@@ -26,11 +27,11 @@ OP_RETURN_TT        interpolate_grad                    (torch::Tensor attr, tor
 OP_RETURN_TTT       interpolate_grad_da                 (torch::Tensor attr, torch::Tensor rast, torch::Tensor tri, torch::Tensor dy, torch::Tensor rast_db, torch::Tensor dda, bool diff_attrs_all, std::vector<int>& diff_attrs_vec);
 TextureMipWrapper   texture_construct_mip               (torch::Tensor tex, int max_mip_level, bool cube_mode);
 OP_RETURN_T         texture_fwd                         (torch::Tensor tex, torch::Tensor uv, int filter_mode, int boundary_mode);
-OP_RETURN_T         texture_fwd_mip                     (torch::Tensor tex, torch::Tensor uv, torch::Tensor uv_da, TextureMipWrapper mip, int filter_mode, int boundary_mode);
+OP_RETURN_T         texture_fwd_mip                     (torch::Tensor tex, torch::Tensor uv, torch::Tensor uv_da, torch::Tensor mip_level_bias, TextureMipWrapper mip, int filter_mode, int boundary_mode);
 OP_RETURN_T         texture_grad_nearest                (torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, int filter_mode, int boundary_mode);
 OP_RETURN_TT        texture_grad_linear                 (torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, int filter_mode, int boundary_mode);
-OP_RETURN_TT        texture_grad_linear_mipmap_nearest  (torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, torch::Tensor uv_da, TextureMipWrapper mip, int filter_mode, int boundary_mode);
-OP_RETURN_TTT       texture_grad_linear_mipmap_linear   (torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, torch::Tensor uv_da, TextureMipWrapper mip, int filter_mode, int boundary_mode);
+OP_RETURN_TT        texture_grad_linear_mipmap_nearest  (torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, torch::Tensor uv_da, torch::Tensor mip_level_bias, TextureMipWrapper mip, int filter_mode, int boundary_mode);
+OP_RETURN_TTTT      texture_grad_linear_mipmap_linear   (torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, torch::Tensor uv_da, torch::Tensor mip_level_bias, TextureMipWrapper mip, int filter_mode, int boundary_mode);
 TopologyHashWrapper antialias_construct_topology_hash   (torch::Tensor tri);
 OP_RETURN_TT        antialias_fwd                       (torch::Tensor color, torch::Tensor rast, torch::Tensor pos, torch::Tensor tri, TopologyHashWrapper topology_hash);
 OP_RETURN_TT        antialias_grad                      (torch::Tensor color, torch::Tensor rast, torch::Tensor pos, torch::Tensor tri, torch::Tensor dy, torch::Tensor work_buffer);
@@ -39,7 +40,7 @@ OP_RETURN_TT        antialias_grad                      (torch::Tensor color, to

 PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    // State classes.
-    pybind11::class_<RasterizeGLStateWrapper>(m, "RasterizeGLStateWrapper").def(pybind11::init<bool, bool>())
+    pybind11::class_<RasterizeGLStateWrapper>(m, "RasterizeGLStateWrapper").def(pybind11::init<bool, bool, int>())
        .def("set_context",     &RasterizeGLStateWrapper::setContext)
        .def("release_context", &RasterizeGLStateWrapper::releaseContext);
    pybind11::class_<TextureMipWrapper>(m, "TextureMipWrapper");
@@ -58,8 +59,8 @@ PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("interpolate_grad",                   &interpolate_grad,                      "interpolate gradient op with attribute derivatives");
    m.def("interpolate_grad_da",                &interpolate_grad_da,                   "interpolate gradient op without attribute derivatives");
    m.def("texture_construct_mip",              &texture_construct_mip,                 "texture mipmap construction");
-    m.def("texture_fwd",                        &texture_fwd,                           "texture forward op with mipmapping and texcoord derivatives");
-    m.def("texture_fwd_mip",                    &texture_fwd_mip,                       "texture forward op without mipmapping and texcoord derivatives");
+    m.def("texture_fwd",                        &texture_fwd,                           "texture forward op without mipmapping");
+    m.def("texture_fwd_mip",                    &texture_fwd_mip,                       "texture forward op with mipmapping");
    m.def("texture_grad_nearest",               &texture_grad_nearest,                  "texture gradient op in nearest mode");
    m.def("texture_grad_linear",                &texture_grad_linear,                   "texture gradient op in linear mode");
    m.def("texture_grad_linear_mipmap_nearest", &texture_grad_linear_mipmap_nearest,    "texture gradient op in linear-mipmap-nearest mode");

--- a/nvdiffrast/torch/torch_common.inl
+++ b/nvdiffrast/torch/torch_common.inl
@@ -17,7 +17,7 @@
 #define __func__ __FUNCTION__
 #endif

-#define NVDR_CHECK_DEVICE(...) do { TORCH_CHECK(at::cuda::check_device({__VA_ARGS__}), __func__, "(): Inputs " #__VA_ARGS__ " must reside on current GPU device") } while(0)
+#define NVDR_CHECK_DEVICE(...) do { TORCH_CHECK(at::cuda::check_device({__VA_ARGS__}), __func__, "(): Inputs " #__VA_ARGS__ " must reside on the same GPU device") } while(0)
 #define NVDR_CHECK_CPU(...) do { nvdr_check_cpu({__VA_ARGS__}, __func__, "(): Inputs " #__VA_ARGS__ " must reside on CPU"); } while(0)
 #define NVDR_CHECK_CONTIGUOUS(...) do { nvdr_check_contiguous({__VA_ARGS__}, __func__, "(): Inputs " #__VA_ARGS__ " must be contiguous tensors"); } while(0)
 #define NVDR_CHECK_F32(...) do { nvdr_check_f32({__VA_ARGS__}, __func__, "(): Inputs " #__VA_ARGS__ " must be float32 tensors"); } while(0)

--- a/nvdiffrast/torch/torch_interpolate.cpp
+++ b/nvdiffrast/torch/torch_interpolate.cpp
@@ -41,6 +41,7 @@ static void set_diff_attrs(InterpolateKernelParams& p, bool diff_attrs_all, std:

 std::tuple<torch::Tensor, torch::Tensor> interpolate_fwd_da(torch::Tensor attr, torch::Tensor rast, torch::Tensor tri, torch::Tensor rast_db, bool diff_attrs_all, std::vector<int>& diff_attrs_vec)
 {
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(attr));
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    InterpolateKernelParams p = {}; // Initialize all fields to zero.
    bool enable_da = (rast_db.defined()) && (diff_attrs_all || !diff_attrs_vec.empty());
@@ -86,6 +87,8 @@ std::tuple<torch::Tensor, torch::Tensor> interpolate_fwd_da(torch::Tensor attr,
    // Set attribute pixel differential info if enabled, otherwise leave as zero.
    if (enable_da)
        set_diff_attrs(p, diff_attrs_all, diff_attrs_vec);
+    else
+        p.numDiffAttr = 0;

    // Get input pointers.
    p.attr = attr.data_ptr<float>();
@@ -95,7 +98,7 @@ std::tuple<torch::Tensor, torch::Tensor> interpolate_fwd_da(torch::Tensor attr,
    p.attrBC = (p.instance_mode && attr.size(0) == 1) ? 1 : 0;

    // Allocate output tensors.
-    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);    
+    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);
    torch::Tensor out = torch::empty({p.depth, p.height, p.width, p.numAttr}, opts);
    torch::Tensor out_da = torch::empty({p.depth, p.height, p.width, p.numDiffAttr * 2}, opts);

@@ -133,6 +136,7 @@ std::tuple<torch::Tensor, torch::Tensor> interpolate_fwd(torch::Tensor attr, tor

 std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> interpolate_grad_da(torch::Tensor attr, torch::Tensor rast, torch::Tensor tri, torch::Tensor dy, torch::Tensor rast_db, torch::Tensor dda, bool diff_attrs_all, std::vector<int>& diff_attrs_vec)
 {
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(attr));
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    InterpolateKernelParams p = {}; // Initialize all fields to zero.
    bool enable_da = (rast_db.defined()) && (diff_attrs_all || !diff_attrs_vec.empty());
@@ -190,6 +194,8 @@ std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> interpolate_grad_da(torc
    // Set attribute pixel differential info if enabled, otherwise leave as zero.
    if (enable_da)
        set_diff_attrs(p, diff_attrs_all, diff_attrs_vec);
+    else
+        p.numDiffAttr = 0;

    // Get input pointers.
    p.attr = attr.data_ptr<float>();
@@ -201,7 +207,7 @@ std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> interpolate_grad_da(torc
    p.attrBC = (p.instance_mode && attr_depth < p.depth) ? 1 : 0;

    // Allocate output tensors.
-    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);    
+    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);
    torch::Tensor gradAttr = torch::zeros_like(attr);
    torch::Tensor gradRaster = torch::empty_like(rast);
    torch::Tensor gradRasterDB;

--- a/nvdiffrast/torch/torch_rasterize.cpp
+++ b/nvdiffrast/torch/torch_rasterize.cpp
@@ -21,13 +21,14 @@ void RasterizeGradKernelDb(const RasterizeGradParams p);
 //------------------------------------------------------------------------
 // Python GL state wrapper methods.

-RasterizeGLStateWrapper::RasterizeGLStateWrapper(bool enableDB, bool automatic_)
+RasterizeGLStateWrapper::RasterizeGLStateWrapper(bool enableDB, bool automatic_, int cudaDeviceIdx_)
 {
    pState = new RasterizeGLState();
    automatic = automatic_;
+    cudaDeviceIdx = cudaDeviceIdx_;
    memset(pState, 0, sizeof(RasterizeGLState));
    pState->enableDB = enableDB ? 1 : 0;
-    rasterizeInitGLContext(NVDR_CTX_PARAMS, *pState);
+    rasterizeInitGLContext(NVDR_CTX_PARAMS, *pState, cudaDeviceIdx_);
    releaseGLContext();
 }

@@ -52,6 +53,7 @@ void RasterizeGLStateWrapper::releaseContext(void)

 std::tuple<torch::Tensor, torch::Tensor> rasterize_fwd(RasterizeGLStateWrapper& stateWrapper, torch::Tensor pos, torch::Tensor tri, std::tuple<int, int> resolution, torch::Tensor ranges)
 {
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(pos));
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    RasterizeGLState& s = *stateWrapper.pState;

@@ -62,6 +64,9 @@ std::tuple<torch::Tensor, torch::Tensor> rasterize_fwd(RasterizeGLStateWrapper&
    NVDR_CHECK_F32(pos);
    NVDR_CHECK_I32(tri, ranges);

+    // Check that GL context was created for the correct GPU.
+    NVDR_CHECK(pos.get_device() == stateWrapper.cudaDeviceIdx, "GL context must must reside on the same device as input tensors");
+
    // Determine number of outputs
    int num_outputs = s.enableDB ? 2 : 1;

@@ -101,7 +106,7 @@ std::tuple<torch::Tensor, torch::Tensor> rasterize_fwd(RasterizeGLStateWrapper&
    rasterizeRender(NVDR_CTX_PARAMS, s, stream, posPtr, posCount, vtxPerInstance, triPtr, triCount, rangesPtr, width, height, depth);

    // Allocate output tensors.
-    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);    
+    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);
    torch::Tensor out = torch::empty({depth, height, width, 4}, opts);
    torch::Tensor out_db = torch::empty({depth, height, width, s.enableDB ? 4 : 0}, opts);
    float* outputPtr[2];
@@ -123,6 +128,7 @@ std::tuple<torch::Tensor, torch::Tensor> rasterize_fwd(RasterizeGLStateWrapper&

 torch::Tensor rasterize_grad_db(torch::Tensor pos, torch::Tensor tri, torch::Tensor out, torch::Tensor dy, torch::Tensor ddb)
 {
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(pos));
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    RasterizeGradParams p;
    bool enable_db = ddb.defined();
@@ -178,7 +184,7 @@ torch::Tensor rasterize_grad_db(torch::Tensor pos, torch::Tensor tri, torch::Ten
    p.out = out.data_ptr<float>();
    p.dy  = dy_.data_ptr<float>();
    p.ddb = enable_db ? ddb_.data_ptr<float>() : NULL;
-    
+
    // Set up pixel position to clip space x, y transform.
    p.xs = 2.f / (float)p.width;
    p.xo = 1.f / (float)p.width - 1.f;
@@ -209,7 +215,7 @@ torch::Tensor rasterize_grad_db(torch::Tensor pos, torch::Tensor tri, torch::Ten

 // Version without derivatives.
 torch::Tensor rasterize_grad(torch::Tensor pos, torch::Tensor tri, torch::Tensor out, torch::Tensor dy)
-{ 
+{
    torch::Tensor empty_tensor;
    return rasterize_grad_db(pos, tri, out, dy, empty_tensor);
 }

--- a/nvdiffrast/torch/torch_texture.cpp
+++ b/nvdiffrast/torch/torch_texture.cpp
@@ -42,6 +42,18 @@ void TextureFwdKernelCubeLinearMipmapNearest4   (const TextureKernelParams p);
 void TextureFwdKernelCubeLinearMipmapLinear1    (const TextureKernelParams p);
 void TextureFwdKernelCubeLinearMipmapLinear2    (const TextureKernelParams p);
 void TextureFwdKernelCubeLinearMipmapLinear4    (const TextureKernelParams p);
+void TextureFwdKernelLinearMipmapNearestBO1     (const TextureKernelParams p);
+void TextureFwdKernelLinearMipmapNearestBO2     (const TextureKernelParams p);
+void TextureFwdKernelLinearMipmapNearestBO4     (const TextureKernelParams p);
+void TextureFwdKernelLinearMipmapLinearBO1      (const TextureKernelParams p);
+void TextureFwdKernelLinearMipmapLinearBO2      (const TextureKernelParams p);
+void TextureFwdKernelLinearMipmapLinearBO4      (const TextureKernelParams p);
+void TextureFwdKernelCubeLinearMipmapNearestBO1 (const TextureKernelParams p);
+void TextureFwdKernelCubeLinearMipmapNearestBO2 (const TextureKernelParams p);
+void TextureFwdKernelCubeLinearMipmapNearestBO4 (const TextureKernelParams p);
+void TextureFwdKernelCubeLinearMipmapLinearBO1  (const TextureKernelParams p);
+void TextureFwdKernelCubeLinearMipmapLinearBO2  (const TextureKernelParams p);
+void TextureFwdKernelCubeLinearMipmapLinearBO4  (const TextureKernelParams p);
 void MipGradKernel1                             (const TextureKernelParams p);
 void MipGradKernel2                             (const TextureKernelParams p);
 void MipGradKernel4                             (const TextureKernelParams p);
@@ -53,6 +65,10 @@ void TextureGradKernelCubeNearest               (const TextureKernelParams p);
 void TextureGradKernelCubeLinear                (const TextureKernelParams p);
 void TextureGradKernelCubeLinearMipmapNearest   (const TextureKernelParams p);
 void TextureGradKernelCubeLinearMipmapLinear    (const TextureKernelParams p);
+void TextureGradKernelLinearMipmapNearestBO     (const TextureKernelParams p);
+void TextureGradKernelLinearMipmapLinearBO      (const TextureKernelParams p);
+void TextureGradKernelCubeLinearMipmapNearestBO (const TextureKernelParams p);
+void TextureGradKernelCubeLinearMipmapLinearBO  (const TextureKernelParams p);

 //------------------------------------------------------------------------
 // Modeselektor.
@@ -81,6 +97,7 @@ static void set_modes(TextureKernelParams& p, int filter_mode, int boundary_mode

 TextureMipWrapper texture_construct_mip(torch::Tensor tex, int max_mip_level, bool cube_mode)
 {
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(tex));
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    TextureKernelParams p = {}; // Initialize all fields to zero.
    p.mipLevelLimit = max_mip_level;
@@ -112,9 +129,9 @@ TextureMipWrapper texture_construct_mip(torch::Tensor tex, int max_mip_level, bo

    // Set mip offsets and calculate total size.
    int mipTotal = calculateMipInfo(NVDR_CTX_PARAMS, p);
-    
+
    // Allocate and set mip tensor.
-    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);    
+    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);
    torch::Tensor mip = torch::empty({mipTotal}, opts);
    p.mip = mip.data_ptr<float>();

@@ -139,7 +156,7 @@ TextureMipWrapper texture_construct_mip(torch::Tensor tex, int max_mip_level, bo
        NVDR_CHECK_CUDA_ERROR(cudaLaunchKernel(build_func_tbl[channel_div_idx], gridSize, blockSize, args, 0, stream));
    }

-    // Return the mip tensor in a wrapper.    
+    // Return the mip tensor in a wrapper.
    TextureMipWrapper mip_wrap;
    mip_wrap.mip = mip;
    mip_wrap.max_mip_level = max_mip_level;
@@ -151,31 +168,46 @@ TextureMipWrapper texture_construct_mip(torch::Tensor tex, int max_mip_level, bo
 //------------------------------------------------------------------------
 // Forward op.

-torch::Tensor texture_fwd_mip(torch::Tensor tex, torch::Tensor uv, torch::Tensor uv_da, TextureMipWrapper mip_wrap, int filter_mode, int boundary_mode)
+torch::Tensor texture_fwd_mip(torch::Tensor tex, torch::Tensor uv, torch::Tensor uv_da, torch::Tensor mip_level_bias, TextureMipWrapper mip_wrap, int filter_mode, int boundary_mode)
 {
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(tex));
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    TextureKernelParams p = {}; // Initialize all fields to zero.
    torch::Tensor& mip = mip_wrap.mip; // Unwrap.
    int max_mip_level = mip_wrap.max_mip_level;
    set_modes(p, filter_mode, boundary_mode, max_mip_level);
+
+    // See if we have these tensors or not.
+    bool has_uv_da = uv_da.defined() && uv_da.nbytes();
+    bool has_mip_level_bias = mip_level_bias.defined() && mip_level_bias.nbytes();
+
    if (p.enableMip)
    {
-        NVDR_CHECK(uv_da.defined(), "mipmapping filter mode requires uv_da input");
+        NVDR_CHECK(has_uv_da || has_mip_level_bias, "mipmapping filter mode requires uv_da and/or mip_level_bias input");
        NVDR_CHECK(mip.defined(), "mipmapping filter mode requires mip tensor input");
    }

    // Check inputs.
+    NVDR_CHECK_DEVICE(tex, uv);
+    NVDR_CHECK_CONTIGUOUS(tex, uv);
+    NVDR_CHECK_F32(tex, uv);
    if (p.enableMip)
    {
-        NVDR_CHECK_DEVICE(tex, uv, uv_da, mip);
-        NVDR_CHECK_CONTIGUOUS(tex, uv, uv_da, mip);
-        NVDR_CHECK_F32(tex, uv, uv_da, mip);
-    }
-    else
-    {
-        NVDR_CHECK_DEVICE(tex, uv);
-        NVDR_CHECK_CONTIGUOUS(tex, uv);
-        NVDR_CHECK_F32(tex, uv);
+        NVDR_CHECK_DEVICE(mip);
+        NVDR_CHECK_CONTIGUOUS(mip);
+        NVDR_CHECK_F32(mip);
+        if (has_uv_da)
+        {
+            NVDR_CHECK_DEVICE(uv_da);
+            NVDR_CHECK_CONTIGUOUS(uv_da);
+            NVDR_CHECK_F32(uv_da);
+        }
+        if (has_mip_level_bias)
+        {
+            NVDR_CHECK_DEVICE(mip_level_bias);
+            NVDR_CHECK_CONTIGUOUS(mip_level_bias);
+            NVDR_CHECK_F32(mip_level_bias);
+        }
    }

    // Sanity checks and state setters.
@@ -205,19 +237,25 @@ torch::Tensor texture_fwd_mip(torch::Tensor tex, torch::Tensor uv, torch::Tensor
    p.texDepth  = tex.size(0);
    if (p.enableMip)
    {
-        if (!cube_mode)
-            NVDR_CHECK(uv_da.sizes().size() == 4 && uv_da.size(0) == p.n && uv_da.size(1) == p.imgHeight && uv_da.size(2) == p.imgWidth && uv_da.size(3) == 4, "uv_da must have shape [minibatch_size, height, width, 4]");
-        else
-            NVDR_CHECK(uv_da.sizes().size() == 4 && uv_da.size(0) == p.n && uv_da.size(1) == p.imgHeight && uv_da.size(2) == p.imgWidth && uv_da.size(3) == 6, "uv_da must have shape [minibatch_size, height, width, 6] in cube map mode");
+        if (has_uv_da)
+        {
+            if (!cube_mode)
+                NVDR_CHECK(uv_da.sizes().size() == 4 && uv_da.size(0) == p.n && uv_da.size(1) == p.imgHeight && uv_da.size(2) == p.imgWidth && uv_da.size(3) == 4, "uv_da must have shape [minibatch_size, height, width, 4]");
+            else
+                NVDR_CHECK(uv_da.sizes().size() == 4 && uv_da.size(0) == p.n && uv_da.size(1) == p.imgHeight && uv_da.size(2) == p.imgWidth && uv_da.size(3) == 6, "uv_da must have shape [minibatch_size, height, width, 6] in cube map mode");
+        }
+        if (has_mip_level_bias)
+            NVDR_CHECK(mip_level_bias.sizes().size() == 3 && mip_level_bias.size(0) == p.n && mip_level_bias.size(1) == p.imgHeight && mip_level_bias.size(2) == p.imgWidth, "mip_level_bias must have shape [minibatch_size, height, width]");
    }

    // Get input pointers.
    p.tex = tex.data_ptr<float>();
    p.uv = uv.data_ptr<float>();
-    p.uvDA = p.enableMip ? uv_da.data_ptr<float>() : NULL;
+    p.uvDA = (p.enableMip && has_uv_da) ? uv_da.data_ptr<float>() : NULL;
+    p.mipLevelBias = (p.enableMip && has_mip_level_bias) ? mip_level_bias.data_ptr<float>() : NULL;

    // Allocate output tensor.
-    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);    
+    torch::TensorOptions opts = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);
    torch::Tensor out = torch::empty({p.n, p.imgHeight, p.imgWidth, p.channels}, opts);
    p.out = out.data_ptr<float>();

@@ -263,8 +301,8 @@ torch::Tensor texture_fwd_mip(torch::Tensor tex, torch::Tensor uv, torch::Tensor
    dim3 blockSize = getLaunchBlockSize(TEX_FWD_MAX_KERNEL_BLOCK_WIDTH, TEX_FWD_MAX_KERNEL_BLOCK_HEIGHT, p.imgWidth, p.imgHeight);
    dim3 gridSize  = getLaunchGridSize(blockSize, p.imgWidth, p.imgHeight, p.n);

-    // Choose kernel based on filter mode, cube mode, and datatype.
-    void* func_tbl[TEX_MODE_COUNT * 3 * 2] = {
+    // Choose kernel based on filter mode, cube mode, bias-only mode, and datatype.
+    void* func_tbl[TEX_MODE_COUNT * 2 * 2 * 3] = {
        (void*)TextureFwdKernelNearest1,
        (void*)TextureFwdKernelNearest2,
        (void*)TextureFwdKernelNearest4,
@@ -289,13 +327,39 @@ torch::Tensor texture_fwd_mip(torch::Tensor tex, torch::Tensor uv, torch::Tensor
        (void*)TextureFwdKernelCubeLinearMipmapLinear1,
        (void*)TextureFwdKernelCubeLinearMipmapLinear2,
        (void*)TextureFwdKernelCubeLinearMipmapLinear4,
+        NULL,
+        NULL,
+        NULL,
+        NULL,
+        NULL,
+        NULL,
+        (void*)TextureFwdKernelLinearMipmapNearestBO1,
+        (void*)TextureFwdKernelLinearMipmapNearestBO2,
+        (void*)TextureFwdKernelLinearMipmapNearestBO4,
+        (void*)TextureFwdKernelLinearMipmapLinearBO1,
+        (void*)TextureFwdKernelLinearMipmapLinearBO2,
+        (void*)TextureFwdKernelLinearMipmapLinearBO4,
+        NULL,
+        NULL,
+        NULL,
+        NULL,
+        NULL,
+        NULL,
+        (void*)TextureFwdKernelCubeLinearMipmapNearestBO1,
+        (void*)TextureFwdKernelCubeLinearMipmapNearestBO2,
+        (void*)TextureFwdKernelCubeLinearMipmapNearestBO4,
+        (void*)TextureFwdKernelCubeLinearMipmapLinearBO1,
+        (void*)TextureFwdKernelCubeLinearMipmapLinearBO2,
+        (void*)TextureFwdKernelCubeLinearMipmapLinearBO4,
    };

    // Function index.
    int func_idx = p.filterMode;
    if (cube_mode)
-        func_idx += TEX_MODE_COUNT;
-    func_idx = func_idx * 3 + channel_div_idx;
+        func_idx += TEX_MODE_COUNT; // Cube variant.
+    if (p.enableMip && !has_uv_da)
+        func_idx += TEX_MODE_COUNT * 2; // Bias-only variant.
+    func_idx = func_idx * 3 + channel_div_idx; // Choose vector size.

    // Launch kernel.
    NVDR_CHECK_CUDA_ERROR(cudaLaunchKernel(func_tbl[func_idx], gridSize, blockSize, args, 0, stream));
@@ -308,37 +372,52 @@ torch::Tensor texture_fwd_mip(torch::Tensor tex, torch::Tensor uv, torch::Tensor
 torch::Tensor texture_fwd(torch::Tensor tex, torch::Tensor uv, int filter_mode, int boundary_mode)
 {
    torch::Tensor empty_tensor;
-    return texture_fwd_mip(tex, uv, empty_tensor, TextureMipWrapper(), filter_mode, boundary_mode);
+    return texture_fwd_mip(tex, uv, empty_tensor, empty_tensor, TextureMipWrapper(), filter_mode, boundary_mode);
 }

 //------------------------------------------------------------------------
 // Gradient op.

-std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> texture_grad_linear_mipmap_linear(torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, torch::Tensor uv_da, TextureMipWrapper mip_wrap, int filter_mode, int boundary_mode)
+std::tuple<torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor> texture_grad_linear_mipmap_linear(torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, torch::Tensor uv_da, torch::Tensor mip_level_bias, TextureMipWrapper mip_wrap, int filter_mode, int boundary_mode)
 {
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(tex));
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    TextureKernelParams p = {}; // Initialize all fields to zero.
    torch::Tensor& mip = mip_wrap.mip; // Unwrap.
    int max_mip_level = mip_wrap.max_mip_level;
    set_modes(p, filter_mode, boundary_mode, max_mip_level);
+
+    // See if we have these tensors or not.
+    bool has_uv_da = uv_da.defined() && uv_da.nbytes();
+    bool has_mip_level_bias = mip_level_bias.defined() && mip_level_bias.nbytes();
+
    if (p.enableMip)
    {
-        NVDR_CHECK(uv_da.defined(), "mipmapping filter mode requires uv_da input in gradient");
-        NVDR_CHECK(mip.defined(), "mipmapping filter mode requires mip input in gradient");
+        NVDR_CHECK(has_uv_da || has_mip_level_bias, "mipmapping filter mode requires uv_da and/or mip_level_bias input");
+        NVDR_CHECK(mip.defined(), "mipmapping filter mode requires mip tensor input");
    }

    // Check inputs.
+    NVDR_CHECK_DEVICE(tex, uv);
+    NVDR_CHECK_CONTIGUOUS(tex, uv);
+    NVDR_CHECK_F32(tex, uv);
    if (p.enableMip)
    {
-        NVDR_CHECK_DEVICE(tex, uv, dy, uv_da, mip);
-        NVDR_CHECK_CONTIGUOUS(tex, uv, uv_da, mip);
-        NVDR_CHECK_F32(tex, uv, dy, uv_da, mip);
-    }
-    else
-    {
-        NVDR_CHECK_DEVICE(tex, uv, dy);
-        NVDR_CHECK_CONTIGUOUS(tex, uv);
-        NVDR_CHECK_F32(tex, uv, dy);
+        NVDR_CHECK_DEVICE(mip);
+        NVDR_CHECK_CONTIGUOUS(mip);
+        NVDR_CHECK_F32(mip);
+        if (has_uv_da)
+        {
+            NVDR_CHECK_DEVICE(uv_da);
+            NVDR_CHECK_CONTIGUOUS(uv_da);
+            NVDR_CHECK_F32(uv_da);
+        }
+        if (has_mip_level_bias)
+        {
+            NVDR_CHECK_DEVICE(mip_level_bias);
+            NVDR_CHECK_CONTIGUOUS(mip_level_bias);
+            NVDR_CHECK_F32(mip_level_bias);
+        }
    }

    // Sanity checks and state setters.
@@ -368,13 +447,18 @@ std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> texture_grad_linear_mipm
    p.texDepth  = tex.size(0);
    if (p.enableMip)
    {
-        if (!cube_mode)
-            NVDR_CHECK(uv_da.sizes().size() == 4 && uv_da.size(0) == p.n && uv_da.size(1) == p.imgHeight && uv_da.size(2) == p.imgWidth && uv_da.size(3) == 4, "uv_da must have shape [minibatch_size, height, width, 4]");
-        else
-            NVDR_CHECK(uv_da.sizes().size() == 4 && uv_da.size(0) == p.n && uv_da.size(1) == p.imgHeight && uv_da.size(2) == p.imgWidth && uv_da.size(3) == 6, "uv_da must have shape [minibatch_size, height, width, 6] in cube map mode");
+        if (has_uv_da)
+        {
+            if (!cube_mode)
+                NVDR_CHECK(uv_da.sizes().size() == 4 && uv_da.size(0) == p.n && uv_da.size(1) == p.imgHeight && uv_da.size(2) == p.imgWidth && uv_da.size(3) == 4, "uv_da must have shape [minibatch_size, height, width, 4]");
+            else
+                NVDR_CHECK(uv_da.sizes().size() == 4 && uv_da.size(0) == p.n && uv_da.size(1) == p.imgHeight && uv_da.size(2) == p.imgWidth && uv_da.size(3) == 6, "uv_da must have shape [minibatch_size, height, width, 6] in cube map mode");
+        }
+        if (has_mip_level_bias)
+            NVDR_CHECK(mip_level_bias.sizes().size() == 3 && mip_level_bias.size(0) == p.n && mip_level_bias.size(1) == p.imgHeight && mip_level_bias.size(2) == p.imgWidth, "mip_level_bias must have shape [minibatch_size, height, width]");
    }
    NVDR_CHECK(dy.sizes().size() == 4 && dy.size(0) == p.n && dy.size(1) == p.imgHeight && dy.size(2) == p.imgWidth && dy.size(3) == p.channels, "dy must have shape [minibatch_size, height, width, channels]");
-        
+
    // Get contiguous version of dy.
    torch::Tensor dy_ = dy.contiguous();

@@ -382,7 +466,8 @@ std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> texture_grad_linear_mipm
    p.tex = tex.data_ptr<float>();
    p.uv = uv.data_ptr<float>();
    p.dy = dy_.data_ptr<float>();
-    p.uvDA = p.enableMip ? uv_da.data_ptr<float>() : NULL;
+    p.uvDA = (p.enableMip && has_uv_da) ? uv_da.data_ptr<float>() : NULL;
+    p.mipLevelBias = (p.enableMip && has_mip_level_bias) ? mip_level_bias.data_ptr<float>() : NULL;
    p.mip = p.enableMip ? (float*)mip.data_ptr<float>() : NULL;

    // Allocate output tensor for tex gradient.
@@ -392,16 +477,28 @@ std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> texture_grad_linear_mipm
    // Allocate output tensor for uv gradient.
    torch::Tensor grad_uv;
    torch::Tensor grad_uv_da;
+    torch::Tensor grad_mip_level_bias;
    if (p.filterMode != TEX_MODE_NEAREST)
    {
        grad_uv = torch::empty_like(uv);
        p.gradUV = grad_uv.data_ptr<float>();

-        // Allocate output tensor for uv_da gradient.
+        // Gradients for things affecting mip level.
        if (p.filterMode == TEX_MODE_LINEAR_MIPMAP_LINEAR)
        {
-            grad_uv_da = torch::empty_like(uv_da);
-            p.gradUVDA = grad_uv_da.data_ptr<float>();
+            // Allocate output tensor for uv_da gradient.
+            if (has_uv_da)
+            {
+                grad_uv_da = torch::empty_like(uv_da);
+                p.gradUVDA = grad_uv_da.data_ptr<float>();
+            }
+
+            // Allocate output tensor for mip_level_bias gradient.
+            if (has_mip_level_bias)
+            {
+                grad_mip_level_bias = torch::empty_like(mip_level_bias);
+                p.gradMipLevelBias = grad_mip_level_bias.data_ptr<float>();
+            }
        }
    }

@@ -457,7 +554,7 @@ std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> texture_grad_linear_mipm
    dim3 blockSize = getLaunchBlockSize(TEX_GRAD_MAX_KERNEL_BLOCK_WIDTH, TEX_GRAD_MAX_KERNEL_BLOCK_HEIGHT, p.imgWidth, p.imgHeight);
    dim3 gridSize  = getLaunchGridSize(blockSize, p.imgWidth, p.imgHeight, p.n);

-    void* func_tbl[TEX_MODE_COUNT * 2] = { 
+    void* func_tbl[TEX_MODE_COUNT * 2 * 2] = {
        (void*)TextureGradKernelNearest,
        (void*)TextureGradKernelLinear,
        (void*)TextureGradKernelLinearMipmapNearest,
@@ -466,12 +563,22 @@ std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> texture_grad_linear_mipm
        (void*)TextureGradKernelCubeLinear,
        (void*)TextureGradKernelCubeLinearMipmapNearest,
        (void*)TextureGradKernelCubeLinearMipmapLinear,
+        NULL,
+        NULL,
+        (void*)TextureGradKernelLinearMipmapNearestBO,
+        (void*)TextureGradKernelLinearMipmapLinearBO,
+        NULL,
+        NULL,
+        (void*)TextureGradKernelCubeLinearMipmapNearestBO,
+        (void*)TextureGradKernelCubeLinearMipmapLinearBO,
    };

    // Function index.
    int func_idx = p.filterMode;
    if (cube_mode)
-        func_idx += TEX_MODE_COUNT;
+        func_idx += TEX_MODE_COUNT; // Cube variant.
+    if (p.enableMip && !has_uv_da)
+        func_idx += TEX_MODE_COUNT * 2; // Bias-only variant.

    // Launch main gradient kernel.
    NVDR_CHECK_CUDA_ERROR(cudaLaunchKernel(func_tbl[func_idx], gridSize, blockSize, args, 0, stream));
@@ -488,14 +595,14 @@ std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> texture_grad_linear_mipm
    }

    // Return output tensors.
-    return std::tuple<torch::Tensor, torch::Tensor, torch::Tensor>(grad_tex, grad_uv, grad_uv_da);
+    return std::tuple<torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor>(grad_tex, grad_uv, grad_uv_da, grad_mip_level_bias);
 }

 // Version for nearest filter mode.
 torch::Tensor texture_grad_nearest(torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, int filter_mode, int boundary_mode)
 {
    torch::Tensor empty_tensor;
-    std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> result = texture_grad_linear_mipmap_linear(tex, uv, dy, empty_tensor, TextureMipWrapper(), filter_mode, boundary_mode);
+    std::tuple<torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor> result = texture_grad_linear_mipmap_linear(tex, uv, dy, empty_tensor, empty_tensor, TextureMipWrapper(), filter_mode, boundary_mode);
    return std::get<0>(result);
 }

@@ -503,14 +610,14 @@ torch::Tensor texture_grad_nearest(torch::Tensor tex, torch::Tensor uv, torch::T
 std::tuple<torch::Tensor, torch::Tensor> texture_grad_linear(torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, int filter_mode, int boundary_mode)
 {
    torch::Tensor empty_tensor;
-    std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> result = texture_grad_linear_mipmap_linear(tex, uv, dy, empty_tensor, TextureMipWrapper(), filter_mode, boundary_mode);
+    std::tuple<torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor> result = texture_grad_linear_mipmap_linear(tex, uv, dy, empty_tensor, empty_tensor, TextureMipWrapper(), filter_mode, boundary_mode);
    return std::tuple<torch::Tensor, torch::Tensor>(std::get<0>(result), std::get<1>(result));
 }

 // Version for linear-mipmap-nearest mode.
-std::tuple<torch::Tensor, torch::Tensor> texture_grad_linear_mipmap_nearest(torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, torch::Tensor uv_da, TextureMipWrapper mip, int filter_mode, int boundary_mode)
+std::tuple<torch::Tensor, torch::Tensor> texture_grad_linear_mipmap_nearest(torch::Tensor tex, torch::Tensor uv, torch::Tensor dy, torch::Tensor uv_da, torch::Tensor mip_level_bias, TextureMipWrapper mip, int filter_mode, int boundary_mode)
 {
-    std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> result = texture_grad_linear_mipmap_linear(tex, uv, dy, uv_da, mip, filter_mode, boundary_mode);
+    std::tuple<torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor> result = texture_grad_linear_mipmap_linear(tex, uv, dy, uv_da, mip_level_bias, mip, filter_mode, boundary_mode);
    return std::tuple<torch::Tensor, torch::Tensor>(std::get<0>(result), std::get<1>(result));
 }


--- a/nvdiffrast/torch/torch_types.h
+++ b/nvdiffrast/torch/torch_types.h
@@ -15,7 +15,7 @@ class RasterizeGLState;
 class RasterizeGLStateWrapper
 {
 public:
-    RasterizeGLStateWrapper     (bool enableDB, bool automatic);
+    RasterizeGLStateWrapper     (bool enableDB, bool automatic, int cudaDeviceIdx);
    ~RasterizeGLStateWrapper    (void);

    void setContext             (void);
@@ -23,6 +23,7 @@ public:

    RasterizeGLState*           pState;
    bool                        automatic;
+    int                         cudaDeviceIdx;
 };

 //------------------------------------------------------------------------