I have two floating point values,
b. I can guarantee they are values in the domain (0, 1). Is there any circumstance where
a * b could equal one? I intend to calculate
1/(1 - a * b), and wish to avoid a divide by zero.
My instinct is that it cannot, because the result should be equal or smaller to
b. But instincts are a poor replacement for understanding the correct behavior.
I do not get to specify the rounding mode, so if there’s a rounding mode where I could get into trouble, I want to know about it.
Edit: I did not specify whether the compiler was IEEE compliant or not because I cannot guarantee that the compiler/CPU running my software will indeed by IEEE compliant.
I have two floating point values,
Since this says we have “values,” not “variables,” it admits a possibility that
1 - a*b may evaluate to 1. When writing about software, people sometimes use names as placeholders for more complicated expressions. For example, one might have an expression
a that is
sin(x)/x and an expression
b that is
1-y*y and then ask about computing
1 - a*b when the code is actually
1 - (sin(x)/x)*(1-y*y). This would be a problem because C++ allows extra precision to be used when evaluating floating-point expressions.
The most common instances of this is that the compiler uses
long double arithmetic while computing expressions containing
double operands or it uses a fused multiply-add instructions while computing an expression of the format
x + y*z.
b have been computed with excess precision and are positive values less than 1 in that excess precision. E.g., for illustration, suppose
double were implemented with four decimal digits but
b were computed with
long double with six decimal digits.
b could both be .999999. Then
a*b is .999998000001 before rounding, .999998 after rounding to six digits. Now suppose that at this point in the computation, the compiler converts from
long double to
double, perhaps because it decides to store this intermediate value on the stack temporarily while it computes some other things from nearby expressions. Converting it to four-digit
double produces 1.000, because that is the four-decimal-digit number nearest .999998. When the compiler later loads this from the stack and continues evaluation, we have
1 - 1.000, and the result is zero.
On the other hand, if
b are variables, I expect your expression is safe. When a value is assigned to a variable or is converted with a cast operation, the C++ standard requires it to be converted to the nominal type; the result must be a value in the nominal type, without any “extra precision.” Then, given 0 <
a < 1 and 0 <
b < 1, the mathematical value (that, without floating-point rounding)
b is less than
a and is less than
b. Then rounding of
b to the nominal type cannot produce a value greater than
b with any IEEE-754 rounding method, so it cannot produce 1. (The only requirement here is that the rounding method never skip over values—it might be constrained to round in a particular direction, upward or downward or toward zero or whatever, but it never goes past a representable value in that direction to get to a value farther away from the unrounded result. Since we know
b is bounded above by both
b, rounding cannot produce any result greater than the lesser of
Formally, the C++ standard does not impose any requirements on the accuracy of floating-point results. So a C++ implementation could use a bonkers rounding mode that produced 3.14 for .9*.9. Aside from implementations flushing subnormals to zero, I am not aware of any C++ implementations that do not obey the requirement above. Flushing subnormals to zero will not affect calculations in
1 - a*b when
b are near 1. (In a perverse floating-point format, with an exponent range narrower than the significand and no subnormal values, .9999 could be representable while .0001 is not because the exponent required for it is out of range. Then
1-.9999*.9999, which would produce .0002 in normal four-digit arithmetic, would produce 0 due to underflow. No such formats are in normal hardware.)
b are variables, 0 <
a < 1 and
b < 1, and your C++ implementation is reasonable (may use extra precision, may flush subnormals, does not use perverse floating-point formats or rounding), then
1 - a*b does not evaluate to zero.