1 $NetBSD: softfloat.txt,v 1.2 2006/11/24 19:46:58 christos Exp $
4 SoftFloat Release 2a General Documentation
10 -------------------------------------------------------------------------------
13 SoftFloat is a software implementation of floating-point that conforms to
14 the IEC/IEEE Standard for Binary Floating-Point Arithmetic. As many as four
15 formats are supported: single precision, double precision, extended double
16 precision, and quadruple precision. All operations required by the standard
17 are implemented, except for conversions to and from decimal.
19 This document gives information about the types defined and the routines
20 implemented by SoftFloat. It does not attempt to define or explain the
21 IEC/IEEE Floating-Point Standard. Details about the standard are available
25 -------------------------------------------------------------------------------
28 SoftFloat is written in C and is designed to work with other C code. The
29 SoftFloat header files assume an ISO/ANSI-style C compiler. No attempt
30 has been made to accommodate compilers that are not ISO-conformant. In
31 particular, the distributed header files will not be acceptable to any
32 compiler that does not recognize function prototypes.
34 Support for the extended double-precision and quadruple-precision formats
35 depends on a C compiler that implements 64-bit integer arithmetic. If the
36 largest integer format supported by the C compiler is 32 bits, SoftFloat is
37 limited to only single and double precisions. When that is the case, all
38 references in this document to the extended double precision, quadruple
39 precision, and 64-bit integers should be ignored.
42 -------------------------------------------------------------------------------
51 Extended Double-Precision Rounding Precision
52 Exceptions and Exception Flags
55 Standard Arithmetic Functions
57 Round-to-Integer Functions
59 Signaling NaN Test Functions
60 Raise-Exception Function
65 -------------------------------------------------------------------------------
68 SoftFloat was written by John R. Hauser. This work was made possible in
69 part by the International Computer Science Institute, located at Suite 600,
70 1947 Center Street, Berkeley, California 94704. Funding was partially
71 provided by the National Science Foundation under grant MIP-9311980. The
72 original version of this code was written as part of a project to build
73 a fixed-point vector processor in collaboration with the University of
74 California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek.
76 THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort
77 has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT
78 TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO
79 PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY
80 AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE.
83 -------------------------------------------------------------------------------
86 When 64-bit integers are supported by the compiler, the `softfloat.h' header
87 file defines four types: `float32' (single precision), `float64' (double
88 precision), `floatx80' (extended double precision), and `float128'
89 (quadruple precision). The `float32' and `float64' types are defined in
90 terms of 32-bit and 64-bit integer types, respectively, while the `float128'
91 type is defined as a structure of two 64-bit integers, taking into account
92 the byte order of the particular machine being used. The `floatx80' type
93 is defined as a structure containing one 16-bit and one 64-bit integer, with
94 the machine's byte order again determining the order of the `high' and `low'
97 When 64-bit integers are _not_ supported by the compiler, the `softfloat.h'
98 header file defines only two types: `float32' and `float64'. Because
99 ISO/ANSI C guarantees at least one built-in integer type of 32 bits,
100 the `float32' type is identified with an appropriate integer type. The
101 `float64' type is defined as a structure of two 32-bit integers, with the
102 machine's byte order determining the order of the fields.
104 In either case, the types in `softfloat.h' are defined such that if a system
105 implements the usual C `float' and `double' types according to the IEC/IEEE
106 Standard, then the `float32' and `float64' types should be indistinguishable
107 in memory from the native `float' and `double' types. (On the other hand,
108 when `float32' or `float64' values are placed in processor registers by
109 the compiler, the type of registers used may differ from those used for the
110 native `float' and `double' types.)
112 SoftFloat implements the following arithmetic operations:
114 -- Conversions among all the floating-point formats, and also between
115 integers (32-bit and 64-bit) and any of the floating-point formats.
117 -- The usual add, subtract, multiply, divide, and square root operations
118 for all floating-point formats.
120 -- For each format, the floating-point remainder operation defined by the
123 -- For each floating-point format, a ``round to integer'' operation that
124 rounds to the nearest integer value in the same format. (The floating-
125 point formats can hold integer values, of course.)
127 -- Comparisons between two values in the same floating-point format.
129 The only functions required by the IEC/IEEE Standard that are not provided
130 are conversions to and from decimal.
133 -------------------------------------------------------------------------------
136 All four rounding modes prescribed by the IEC/IEEE Standard are implemented
137 for all operations that require rounding. The rounding mode is selected
138 by the global variable `float_rounding_mode'. This variable may be set
139 to one of the values `float_round_nearest_even', `float_round_to_zero',
140 `float_round_down', or `float_round_up'. The rounding mode is initialized
144 -------------------------------------------------------------------------------
145 Extended Double-Precision Rounding Precision
147 For extended double precision (`floatx80') only, the rounding precision
148 of the standard arithmetic operations is controlled by the global variable
149 `floatx80_rounding_precision'. The operations affected are:
151 floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt
153 When `floatx80_rounding_precision' is set to its default value of 80, these
154 operations are rounded (as usual) to the full precision of the extended
155 double-precision format. Setting `floatx80_rounding_precision' to 32
156 or to 64 causes the operations listed to be rounded to reduced precision
157 equivalent to single precision (`float32') or to double precision
158 (`float64'), respectively. When rounding to reduced precision, additional
159 bits in the result significand beyond the rounding point are set to zero.
160 The consequences of setting `floatx80_rounding_precision' to a value other
161 than 32, 64, or 80 is not specified. Operations other than the ones listed
162 above are not affected by `floatx80_rounding_precision'.
165 -------------------------------------------------------------------------------
166 Exceptions and Exception Flags
168 All five exception flags required by the IEC/IEEE Standard are
169 implemented. Each flag is stored as a unique bit in the global variable
170 `float_exception_flags'. The positions of the exception flag bits within
171 this variable are determined by the bit masks `float_flag_inexact',
172 `float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and
173 `float_flag_invalid'. The exception flags variable is initialized to all 0,
174 meaning no exceptions.
176 An individual exception flag can be cleared with the statement
178 float_exception_flags &= ~ float_flag_<exception>;
180 where `<exception>' is the appropriate name. To raise a floating-point
181 exception, the SoftFloat function `float_raise' should be used (see below).
183 In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess
184 for underflow either before or after rounding. The choice is made by
185 the global variable `float_detect_tininess', which can be set to either
186 `float_tininess_before_rounding' or `float_tininess_after_rounding'.
187 Detecting tininess after rounding is better because it results in fewer
188 spurious underflow signals. The other option is provided for compatibility
189 with some systems. Like most systems, SoftFloat always detects loss of
190 accuracy for underflow as an inexact result.
193 -------------------------------------------------------------------------------
196 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
199 All conversions among the floating-point formats are supported, as are all
200 conversions between a floating-point format and 32-bit and 64-bit signed
201 integers. The complete set of conversion functions is:
203 int32_to_float32 int64_to_float32
204 int32_to_float64 int64_to_float32
205 int32_to_floatx80 int64_to_floatx80
206 int32_to_float128 int64_to_float128
208 float32_to_int32 float32_to_int64
209 float32_to_int32 float64_to_int64
210 floatx80_to_int32 floatx80_to_int64
211 float128_to_int32 float128_to_int64
213 float32_to_float64 float32_to_floatx80 float32_to_float128
214 float64_to_float32 float64_to_floatx80 float64_to_float128
215 floatx80_to_float32 floatx80_to_float64 floatx80_to_float128
216 float128_to_float32 float128_to_float64 float128_to_floatx80
218 Each conversion function takes one operand of the appropriate type and
219 returns one result. Conversions from a smaller to a larger floating-point
220 format are always exact and so require no rounding. Conversions from 32-bit
221 integers to double precision and larger formats are also exact, and likewise
222 for conversions from 64-bit integers to extended double and quadruple
225 Conversions from floating-point to integer raise the invalid exception if
226 the source value cannot be rounded to a representable integer of the desired
227 size (32 or 64 bits). If the floating-point operand is a NaN, the largest
228 positive integer is returned. Otherwise, if the conversion overflows, the
229 largest integer with the same sign as the operand is returned.
231 On conversions to integer, if the floating-point operand is not already an
232 integer value, the operand is rounded according to the current rounding
233 mode as specified by `float_rounding_mode'. Because C (and perhaps other
234 languages) require that conversions to integers be rounded toward zero, the
235 following functions are provided for improved speed and convenience:
237 float32_to_int32_round_to_zero float32_to_int64_round_to_zero
238 float64_to_int32_round_to_zero float64_to_int64_round_to_zero
239 floatx80_to_int32_round_to_zero floatx80_to_int64_round_to_zero
240 float128_to_int32_round_to_zero float128_to_int64_round_to_zero
242 These variant functions ignore `float_rounding_mode' and always round toward
245 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
246 Standard Arithmetic Functions
248 The following standard arithmetic functions are provided:
250 float32_add float32_sub float32_mul float32_div float32_sqrt
251 float64_add float64_sub float64_mul float64_div float64_sqrt
252 floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt
253 float128_add float128_sub float128_mul float128_div float128_sqrt
255 Each function takes two operands, except for `sqrt' which takes only one.
256 The operands and result are all of the same type.
258 Rounding of the extended double-precision (`floatx80') functions is affected
259 by the `floatx80_rounding_precision' variable, as explained above in the
260 section _Extended_Double-Precision_Rounding_Precision_.
262 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
265 For each format, SoftFloat implements the remainder function according to
266 the IEC/IEEE Standard. The remainder functions are:
273 Each remainder function takes two operands. The operands and result are all
274 of the same type. Given operands x and y, the remainder functions return
275 the value x - n*y, where n is the integer closest to x/y. If x/y is exactly
276 halfway between two integers, n is the even integer closest to x/y. The
277 remainder functions are always exact and so require no rounding.
279 Depending on the relative magnitudes of the operands, the remainder
280 functions can take considerably longer to execute than the other SoftFloat
281 functions. This is inherent in the remainder operation itself and is not a
282 flaw in the SoftFloat implementation.
284 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
285 Round-to-Integer Functions
287 For each format, SoftFloat implements the round-to-integer function
288 specified by the IEC/IEEE Standard. The functions are:
292 floatx80_round_to_int
293 float128_round_to_int
295 Each function takes a single floating-point operand and returns a result of
296 the same type. (Note that the result is not an integer type.) The operand
297 is rounded to an exact integer according to the current rounding mode, and
298 the resulting integer value is returned in the same floating-point format.
300 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
303 The following floating-point comparison functions are provided:
305 float32_eq float32_le float32_lt
306 float64_eq float64_le float64_lt
307 floatx80_eq floatx80_le floatx80_lt
308 float128_eq float128_le float128_lt
310 Each function takes two operands of the same type and returns a 1 or 0
311 representing either _true_ or _false_. The abbreviation `eq' stands for
312 ``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands
313 for ``less than'' (<).
315 The standard greater-than (>), greater-than-or-equal (>=), and not-equal
316 (!=) functions are easily obtained using the functions provided. The
317 not-equal function is just the logical complement of the equal function.
318 The greater-than-or-equal function is identical to the less-than-or-equal
319 function with the operands reversed; and the greater-than function can be
320 obtained from the less-than function in the same way.
322 The IEC/IEEE Standard specifies that the less-than-or-equal and less-than
323 functions raise the invalid exception if either input is any kind of NaN.
324 The equal functions, on the other hand, are defined not to raise the invalid
325 exception on quiet NaNs. For completeness, SoftFloat provides the following
326 additional functions:
328 float32_eq_signaling float32_le_quiet float32_lt_quiet
329 float64_eq_signaling float64_le_quiet float64_lt_quiet
330 floatx80_eq_signaling floatx80_le_quiet floatx80_lt_quiet
331 float128_eq_signaling float128_le_quiet float128_lt_quiet
333 The `signaling' equal functions are identical to the standard functions
334 except that the invalid exception is raised for any NaN input. Likewise,
335 the `quiet' comparison functions are identical to their counterparts except
336 that the invalid exception is not raised for quiet NaNs.
338 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
339 Signaling NaN Test Functions
341 The following functions test whether a floating-point value is a signaling
344 float32_is_signaling_nan
345 float64_is_signaling_nan
346 floatx80_is_signaling_nan
347 float128_is_signaling_nan
349 The functions take one operand and return 1 if the operand is a signaling
352 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
353 Raise-Exception Function
355 SoftFloat provides a function for raising floating-point exceptions:
359 The function takes a mask indicating the set of exceptions to raise. No
360 result is returned. In addition to setting the specified exception flags,
361 this function may cause a trap or abort appropriate for the current system.
363 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
366 -------------------------------------------------------------------------------
369 At the time of this writing, the most up-to-date information about
370 SoftFloat and the latest release can be found at the Web page `http://
371 HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'.