1. 04 Apr, 2025 1 commit
    • Tomas Simon's avatar
      Fix edge grad assert mismatch · 3c14f46c
      Tomas Simon authored
      Summary:
      * The sizes for vi in the edge_grad_estimator_fwd assert were not updated after D68534639 expanded the dimension to 3
      * This updates the size in the assert and adds an explicit call to edge_grad_estimator_fwd (a no-op) in the autograd implementation to make sure the sizes are checked
      
      Reviewed By: HapeMask, phg1024
      
      Differential Revision: D72433642
      
      fbshipit-source-id: 49dd82e0a07fe174c2157b362eedf464984d386d
      3c14f46c
  2. 19 Mar, 2025 1 commit
    • Stanislav Pidhorskyi's avatar
      Added precise division function to fix rounding issues due to `-use_fast_math` · 85f58cf1
      Stanislav Pidhorskyi authored
      Summary:
      This diff fixes some long standing issue with skinning weights some times been negative.
      
      Since initial value of skinning weights are always non-negative, and blending coefficients are supposed to be in range [0..1], and such blending of skinning weight should be non-negative.
      
      Unfortunately that was not the case in practice, despite various clamps.
      
      The issue was hunted down to this part of the code:
      
        c_a = mass_a / mass_ab;
        c_b = 1.0f - c_a;
      
      Even if `mass_a` matches `mass_ab` bit-perfect, `c_a` might not be equal to `1.0`, but some times to `0.999999940395355224609375` and some times to `1.000000119209289550781250`. The later value causes `c_b` to be negative, which leads to negative skinning weights.
      
      tsimk figured out that this behavior is due to the nvcc flag `-use_fast_math`  which makes all devision operators `x/y` to compile to `__fdividef(x, y)` which it turn somehow does not produce exactly 1.0 when dividing same, bit-perfect numbers. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#intrinsic-functions .
      D71423305
      
      Reviewed By: phg1024
      
      Differential Revision: D71436810
      
      fbshipit-source-id: 64c4e6368d07368ee75997da088d3952ed0c36d0
      85f58cf1
  3. 24 Feb, 2025 1 commit
    • Stanislav Pidhorskyi's avatar
      Updated cuda_math_helper · 9fc8931f
      Stanislav Pidhorskyi authored
      Summary:
      Added missing `^=` operator to cuda_math_helper.
      Also added functions `isfinite`, `isinf`, and `isnan`. Also added vector version of `sqrt`.
      
      Reviewed By: tsimk
      
      Differential Revision:
      D70019653
      
      Privacy Context Container: L1258975
      
      fbshipit-source-id: ee7671a73110417bd8a99be87ff6d243c2d07938
      9fc8931f
  4. 23 Jan, 2025 3 commits
    • Kishore Venkateshan's avatar
      4/N Batchify EdgeGrad · 910228e6
      Kishore Venkateshan authored
      Summary:
      # Problem
      In CT / State Encoding, we expect a scenario where we would like to render a batch of topologies where each of them would have different number of vertices and triangles. Currently the only way to support this with DRTK is to iterate over the batch in a for loop for each topology and render it.
      In a series of diffs we would like to solve this issue by making drtk consume a batch of triangles as opposed to just 1 set of triangles. However, we would like to achieve this behavior without affecting the most common single topology case by a lot.
      
      # How do we pass in multiple topologies in a single batch?
      We will provide a TopologyBatch structure in xrcia/lib/graphics/structures where we will provide functionality to create a Batch x MaxTriangles x 3 and Batch x MaxVertices x 3.
      Padded vertices will be 0s and padded triangles will have MaxVertices - 1 as their value. But these will discarded as degenerate in rasterization / rendering.
      
      # In this diff
      - Extend `edge_grad_estimator` to support a batch dimension as default.
      - `edge_grad_kernel` will now unsqueeze the batch dimension when using a single topo
      - We access the vertex indices of triangles by walking an additional `batch stride * n` in the triangles data pointer.
      - Add an extra condition to check to see if the triangles are degenerate; this happens when padding the batch.
      - We show that the optimization continues to produce the same results as in D68538236
      
      Reviewed By: podgorskiy
      
      Differential Revision: D68534639
      
      fbshipit-source-id: 4f0ed24075d71b73b9313ecc61296e9567219b0d
      910228e6
    • Kishore Venkateshan's avatar
      3/N Batchify Interpolate Kernel · bdb1bfdb
      Kishore Venkateshan authored
      Summary:
      # Problem
      In CT / State Encoding, we expect a scenario where we would like to render a batch of topologies where each of them would have different number of vertices and triangles. Currently the only way to support this with DRTK is to iterate over the batch in a for loop for each topology and render it.
      In a series of diffs we would like to solve this issue by making drtk consume a batch of triangles as opposed to just 1 set of triangles. However, we would like to achieve this behavior without affecting the most common single topology case by a lot.
      
      # How do we pass in multiple topologies in a single batch?
      We will provide a TopologyBatch structure in xrcia/lib/graphics/structures where we will provide functionality to create a Batch x MaxTriangles x 3 and Batch x MaxVertices x 3.
      Padded vertices will be 0s and padded triangles will have MaxVertices - 1 as their value. But these will discarded as degenerate in rasterization / rendering.
      
      # In this diff
      - Extend `interpolate` kernel and `interpolate_backward` kernel to support a batch dimension as default.
      - `interpolate` will now unsqueeze the batch dimension when using a single topo
      - We access the vertex indices of triangles by walking an additional `batch stride * n` in the triangles data pointer.
      - Add an extra condition to check to see if the triangles are degenerate; this happens when padding the batch.
      - We show that the we don't cause too much overhead in GPU by introducing these 3 extra operations (Same profiling as in D68529076)
      
      Reviewed By: podgorskiy
      
      Differential Revision: D68400728
      
      fbshipit-source-id: d13dbde5cc379789132953c05f6f9289748d67c7
      bdb1bfdb
    • Kishore Venkateshan's avatar
      2/N batchify render kernel · d4216dd3
      Kishore Venkateshan authored
      Summary:
      # Problem
      In CT / State Encoding, we expect a scenario where we would like to render a batch of topologies where each of them would have different number of vertices and triangles. Currently the only way to support this with DRTK is to iterate over the batch in a for loop for each topology and render it.
      In a series of diffs we would like to solve this issue by making drtk consume a batch of triangles as opposed to just 1 set of triangles. However, we would like to achieve this behavior without affecting the most common single topology case by a lot.
      
      # How do we pass in multiple topologies in a single batch?
      We will provide a TopologyBatch structure in xrcia/lib/graphics/structures where we will provide functionality to create a Batch x MaxTriangles x 3 and Batch x MaxVertices x 3.
      Padded vertices will be 0s and padded triangles will have MaxVertices - 1 as their value. But these will discarded as degenerate in rasterization / rendering.
      
      # In this diff
      - Extend render kernel and render backward kernel to support a batch dimension as default.
      - `render` will now unsqueeze the batch dimension when using a single topo
      - We access the vertex indices of triangles by walking an additional `batch stride * n` in the triangles data pointer.
      - Add an extra condition to check to see if the triangles are degenerate; this happens when padding the batch.
      - We show that the we don't cause too much overhead in GPU by introducing these 3 extra operations (Same profiling as in D68423813)
      
      Reviewed By: podgorskiy
      
      Differential Revision: D68423409
      
      fbshipit-source-id: e1007b9844658ef6e1bb2267b6a94804f3b6d13b
      d4216dd3
  5. 22 Jan, 2025 1 commit
    • Kishore Venkateshan's avatar
      1/N Batchify Rasterization Kernel · e0274716
      Kishore Venkateshan authored
      Summary:
      # Problem
      
      In CT / State Encoding, we expect a scenario where we would like to render a batch of topologies where each of them would have different number of vertices and triangles. Currently the only way to support this with DRTK is to iterate over the batch in a for loop for each topology and render it.
      
      In a series of diffs we would like to solve this issue by making drtk consume a batch of triangles as opposed to just 1 set of triangles. **However, we would like to achieve this behavior without affecting the most common single topology case by a lot**.
      
      # How do we pass in multiple topologies in a single batch?
      - We will provide a `TopologyBatch` structure in xrcia/lib/graphics/structures where we will provide functionality to create a `Batch x MaxTriangles x 3` and `Batch x MaxVertices x 3`.
      - Padded vertices will be 0s and padded triangles will have MaxVertices - 1 as their value. But these will discarded as degenerate in rasterization / rendering.
      
      # In this diff
      - Extend `rasterize_kernel` and `rasterize_lines_kernel` to support a batch dimension as default.
      - `rasterize` will now unsqueeze the batch dimension when using a single topo
      - We access the vertex indices of triangles by walking an additional `batch stride * n` in the triangles data pointer.
      - Add an extra condition to check to see if the triangles are degenerate; this happens when padding the batch.
      - We show that the we don't cause too much overhead in GPU by introducing these 3 extra operations (Same profiling as in D68194200)
      
      Differential Revision: D68388659
      
      fbshipit-source-id: b4f8a7daab8b133b8538f7e5db4f730f70b71deb
      e0274716
  6. 15 Jan, 2025 1 commit
  7. 19 Dec, 2024 1 commit
  8. 26 Sep, 2024 1 commit
    • Stanislav Pidhorskyi's avatar
      Licence change to MIT · 36eb2e83
      Stanislav Pidhorskyi authored
      Summary: Got legal approval 🥳
      
      Reviewed By: una-dinosauria
      
      Differential Revision: D63428775
      
      fbshipit-source-id: 7568ef2861ef10c2bd0367a7195cbbedf96ec8be
      36eb2e83
  9. 12 Aug, 2024 1 commit
    • Stanislav Pidhorskyi's avatar
      grid_scatter · b0810efa
      Stanislav Pidhorskyi authored
      Summary:
      Adds `grid_scatter` op that is similar to `grid_sample` but the grid points to the destination location instead of the source.
      
      `grid_scatter` is indeed dual to `grid_sample`. Forward of `grid_scatter` is backward of `grid_sample` and backward of  `grid_scatter` is forward of `grid_sample` (with the exception for the gradient with respect to grid) which is reflected in the reference implementation in `drtk/grid_scatter.py`.
      
      ```python
      def grid_scatter(
          input: th.Tensor,
          grid: th.Tensor,
          output_height: int,
          output_width: int,
          mode: str = "bilinear",
          padding_mode: str = "border",
          align_corners: Optional[bool] = None,
      ) -> th.Tensor:
      ```
      
      Where :
      * `input` [N x C x H x W]: is the input tensor values from which will be transferred to the result.
      * `grid` [N x H x W x 2]: is the grid tensor that points to the location where the values from the  input tensor should be copied to. The `W`, `H` sizes of grid should match the corresponding sizes of the `input` tensor.
      *  `output_height`, `output_width`: size of the output, where output will be: [N x C x `output_height` x `output_width`]. In contrast to `grid_sample`, we can no longer rely on the sizes of the `grid` for this information.
      * `mode`, `padding_mode`, `align_corners` same as for the `grid_sample`, but now for the reverse operation - splatting (or scattering).
      
      At the moment does not support "nearest" mode, which is rarely needed. Maybe will add later.
      
      Ideally, we would also want to support autocast mode where the `input` and output tensors are float16 while the `grid` is float32. This is not the case at the moment, but I'll add that later.
      
      ## Example usage
      
      Let's assume that we loaded mesh into `v, vi, vt, vti`, have defined `image_width, image_height`, `cam_pos`, `cam_rot`, `focal`, `princpt`, and computed normals for the mesh `normals`. We also define a shading function, e.g.:
      
      ```lang=python
      def shade(
          vn_img: th.Tensor,
          light_dir: th.Tensor,
          ambient_intensity: float = 1.0,
          direct_intensity: float = 1.0,
          shadow_img: Optional[th.Tensor] = None,
      ):
          ambient = (vn_img[:, 1:2] * 0.5 + 0.5) * th.as_tensor([0.45, 0.5, 0.7]).cuda()[
              None, :, None, None
          ]
          direct = (
              th.sum(vn_img.mul(thf.normalize(light_dir, dim=1)), dim=1, keepdim=True).clamp(
                  min=0.0
              )
              * th.as_tensor([0.65, 0.6, 0.5]).cuda()[None, :, None, None]
          )
          if shadow_img is not None:
              direct = direct * shadow_img
          return th.pow(ambient * ambient_intensity + direct * direct_intensity, 1 / 2.2)
      ```
      
      And we can render the image as:
      
      ```lang=python
      v_pix = transform(v, cam_pos, cam_rot, focal, princpt)
      index_img = rasterize(v_pix, vi, image_height, image_width)
      _, bary_img = render(v_pix, vi, index_img)
      
      # mask image
      mask: th.Tensor = (index_img != -1)[:, None]
      
      # compute vt image
      vt_img = interpolate(vt.mul(2.0).sub(1.0)[None], vti, index_img, bary_img)
      
      # compute normals
      vn_img = interpolate(normals, vi, index_img, bary_img)
      
      diffuse = (
          shade(vn_img, th.as_tensor([0.5, 0.5, 0.0]).cuda()[None, :, None, None]) * mask
      )
      ```
      
       {F1801805545}
      
      ## Shadow mapping
      
      We can use  `grid_scatter` to compute mesh visibility from the camera view:
      
      ```lang=python
      texel_weight = grid_scatter(
          mask.float(),
          vt_img.permute(0, 2, 3, 1),
          output_width=512,
          output_height=512,
          mode="bilinear",
          padding_mode="border",
          align_corners=False,
      )
      threshold = 0.1  # texel_weight is proportional to how much pixel are the texel covers. We can specify a threshold of how much covered pixel area counts as visible.
      visibility = (texel_weight > threshold).float()
      ```
       {F1801810094}
      
      Now we can render the scene from different angle and use the visibility mask for shadows:
      
      ```lang=python
      v_pix = transform(v, cam_pos_new, cam_rot_new, focal, princpt)
      index_img = rasterize(v_pix, vi, image_height, image_width)
      _, bary_img = render(v_pix, vi, index_img)
      
      # mask image
      mask: th.Tensor = (index_img != -1)[:, None]
      
      # compute vt image
      vt_img = interpolate(vt.mul(2.0).sub(1.0)[None], vti, index_img, bary_img)
      
      # compute v image (for near-field)
      v_img = interpolate(v, vi, index_img, bary_img)
      
      # shadow
      shadow_img = thf.grid_sample(visibility, vt_img.permute(0, 2, 3, 1), mode="bilinear", padding_mode="border", align_corners=False)
      
      # compute normals
      vn_img = interpolate(normals, vi, index_img, bary_img)
      
      diffuse = shade(vn_img, cam_pos[:, :, None, None] - v_img, 0.05, 0.4, shadow_img) * mask
      ```
       {F1801811232}
      
      ## Texture projection
      
      Let's load a test image:
      
      ```lang=python
      import skimage
      test_image = (
          th.as_tensor(skimage.data.coffee(), dtype=th.float32).permute(2, 0, 1)[None, ...].mul(1 / 255).contiguous().cuda()
      )
      
      test_image = thf.interpolate(test_image, scale_factor=2.0, mode="bilinear", align_corners=False)
      ```
      
      {F1801814094}
      
      We can use `grid_scatter` to project the image onto the uv space:
      
      ```lang=python
      camera_image_extended = (
          th.cat([test_image, th.ones_like(test_image[:, :1])], dim=1) * mask
      )
      
      texture_weight = grid_scatter(
          camera_image_extended,
          vt_img.permute(0, 2, 3, 1),
          output_width=512,
          output_height=512,
          mode="bilinear",
          padding_mode="border",
          align_corners=False,
      )
      
      texture = texture_weight[:, :3] / texture_weight[:, 3:4].clamp(min=1e-4)
      ```
      
      {F1801816367}
      
      And if we render the scene from a different angle using the projected texture:
      
       {F1801817130}
      
      Reviewed By: HapeMask
      
      Differential Revision: D61006613
      
      fbshipit-source-id: 98c83ba4eda531e9d73cb9e533176286dc699f63
      b0810efa
  10. 07 Aug, 2024 1 commit
    • Stanislav Pidhorskyi's avatar
      Fixed uninitialized entries in the gradient of `bary_img` for unused pixels · 3c37fcb2
      Stanislav Pidhorskyi authored
      Summary:
      For unused warps we always write zeros to `bary_img_grad`.
      However it is possible that the warp is used, but only portion of the threads are used. In this case, for unused threads we do not write zeros to `bary_img_grad`.
      
      For efficiency, `bary_img_grad` is created with `at::empty`, thus  those before mentioned entries, will still have uninitialized values.
      
      This is not an issue, because the `render` function will never read those entries, however it is possible that the uninitialized values will coincide with `nan` and it will trigger a false positive detection in auto grad anomaly detection.  Please see more details in D60904848 about the issue.
      
      Differential Revision: D60912474
      
      fbshipit-source-id: 6eda5a07789db456c17eb60de222dd4c7b1c53d2
      3c37fcb2
  11. 08 Jun, 2024 1 commit