Added precise division function to fix rounding issues due to `-use_fast_math`

Summary: This diff fixes some long standing issue with skinning weights some times been negative. Since initial value of skinning weights are always non-negative, and blending coefficients are supposed to be in range [0..1], and such blending of skinning weight should be non-negative. Unfortunately that was not the case in practice, despite various clamps. The issue was hunted down to this part of the code: c_a = mass_a / mass_ab; c_b = 1.0f - c_a; Even if `mass_a` matches `mass_ab` bit-perfect, `c_a` might not be equal to `1.0`, but some times to `0.999999940395355224609375` and some times to `1.000000119209289550781250`. The later value causes `c_b` to be negative, which leads to negative skinning weights. tsimk figured out that this behavior is due to the nvcc flag `-use_fast_math` which makes all devision operators `x/y` to compile to `__fdividef(x, y)` which it turn somehow does not produce exactly 1.0 when dividing same, bit-perfect numbers. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#intrinsic-functions . D71423305 Reviewed By: phg1024 Differential Revision: D71436810 fbshipit-source-id: 64c4e6368d07368ee75997da088d3952ed0c36d0

Added precise division function to fix rounding issues due to `-use_fast_math`
Summary: This diff fixes some long standing issue with skinning weights some times been negative. Since initial value of skinning weights are always non-negative, and blending coefficients are supposed to be in range [0..1], and such blending of skinning weight should be non-negative. Unfortunately that was not the case in practice, despite various clamps. The issue was hunted down to this part of the code: c_a = mass_a / mass_ab; c_b = 1.0f - c_a; Even if `mass_a` matches `mass_ab` bit-perfect, `c_a` might not be equal to `1.0`, but some times to `0.999999940395355224609375` and some times to `1.000000119209289550781250`. The later value causes `c_b` to be negative, which leads to negative skinning weights. tsimk figured out that this behavior is due to the nvcc flag `-use_fast_math` which makes all devision operators `x/y` to compile to `__fdividef(x, y)` which it turn somehow does not produce exactly 1.0 when dividing same, bit-perfect numbers. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#intrinsic-functions . D71423305 Reviewed By: phg1024 Differential Revision: D71436810 fbshipit-source-id: 64c4e6368d07368ee75997da088d3952ed0c36d0
85f58cf1 · Stanislav Pidhorskyi · Facebook GitHub Bot · 9fc8931f · 85f58cf1
Commit 85f58cf1 authored Mar 18, 2025 by Stanislav Pidhorskyi Committed by Facebook GitHub Bot Mar 18, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 28 additions and 0 deletions

src/include/cuda_math_helper.h src/include/cuda_math_helper.h +28 -0

No files found.
--- a/src/include/cuda_math_helper.h
+++ b/src/include/cuda_math_helper.h
@@ -73,6 +73,34 @@ HD_FUNC double saturate(double a) {
  return fmin(fmax(a, 0.0), 1.0);
 }

+// Do IEEE-compliant division even if `-use_fast_math` or `-prec-div=false` is set.
+// Useful when most of the code can be compiled with `-use_fast_math` but individual division
+// operations need to be precise. In particular, when diving a number by itself has to return
+// exactly 1.0 guaranteed
+HD_FUNC float precise_div(float a, float b) {
+  return HOST_DEVICE_DISPATCH(a / b, __fdiv_rn(a, b));
+}
+
+// See function above. This is overload for double. There is no fast division for doubles, but
+// it can be merged with additions into mad operation. Using this function would guarantee
+// that it won't be merged.
+HD_FUNC double precise_div(double a, double b) {
+  return HOST_DEVICE_DISPATCH(a / b, __ddiv_rn(a, b));
+}
+
+// Using this function will always result in using fast division, no matter if  `-use_fast_math` or
+// `-prec-div=false` is set or not. Warning, this might produce result slightly larger or smaller
+// than 1.0 when dividing exactly the same (bit-wise) numbers, which can lead to unexpected results.
+HD_FUNC float approx_div(float a, float b) {
+  return HOST_DEVICE_DISPATCH(a / b, __fdividef(a, b));
+}
+
+// See functions above. This variant is not actually useful since there is no fast division for
+// double. But it exists to enable writing templated code that works with both float and double
+HD_FUNC double approx_div(double a, double b) {
+  return a / b;
+}
+
 // If NVCC then use builtin abs/max/min/sqrt/rsqrt.
 // All of them have overloads for ints, floats, and doubles,defined in
 // `cuda/crt/math_functions.hpp` thus no need for explicit usage of e.g. fabsf