lib/libc/softfloat/softfloat.txt

   1 $NetBSD: softfloat.txt,v 1.2 2006/11/24 19:46:58 christos Exp $
   2 $FreeBSD$
   3
   4 SoftFloat Release 2a General Documentation
   5
   6 John R. Hauser
   7 1998 December 13
   8
   9
  10 -------------------------------------------------------------------------------
  11 Introduction
  12
  13 SoftFloat is a software implementation of floating-point that conforms to
  14 the IEC/IEEE Standard for Binary Floating-Point Arithmetic.  As many as four
  15 formats are supported:  single precision, double precision, extended double
  16 precision, and quadruple precision.  All operations required by the standard
  17 are implemented, except for conversions to and from decimal.
  18
  19 This document gives information about the types defined and the routines
  20 implemented by SoftFloat.  It does not attempt to define or explain the
  21 IEC/IEEE Floating-Point Standard.  Details about the standard are available
  22 elsewhere.
  23
  24
  25 -------------------------------------------------------------------------------
  26 Limitations
  27
  28 SoftFloat is written in C and is designed to work with other C code.  The
  29 SoftFloat header files assume an ISO/ANSI-style C compiler.  No attempt
  30 has been made to accommodate compilers that are not ISO-conformant.  In
  31 particular, the distributed header files will not be acceptable to any
  32 compiler that does not recognize function prototypes.
  33
  34 Support for the extended double-precision and quadruple-precision formats
  35 depends on a C compiler that implements 64-bit integer arithmetic.  If the
  36 largest integer format supported by the C compiler is 32 bits, SoftFloat is
  37 limited to only single and double precisions.  When that is the case, all
  38 references in this document to the extended double precision, quadruple
  39 precision, and 64-bit integers should be ignored.
  40
  41
  42 -------------------------------------------------------------------------------
  43 Contents
  44
  45     Introduction
  46     Limitations
  47     Contents
  48     Legal Notice
  49     Types and Functions
  50     Rounding Modes
  51     Extended Double-Precision Rounding Precision
  52     Exceptions and Exception Flags
  53     Function Details
  54         Conversion Functions
  55         Standard Arithmetic Functions
  56         Remainder Functions
  57         Round-to-Integer Functions
  58         Comparison Functions
  59         Signaling NaN Test Functions
  60         Raise-Exception Function
  61     Contact Information
  62
  63
  64
  65 -------------------------------------------------------------------------------
  66 Legal Notice
  67
  68 SoftFloat was written by John R. Hauser.  This work was made possible in
  69 part by the International Computer Science Institute, located at Suite 600,
  70 1947 Center Street, Berkeley, California 94704.  Funding was partially
  71 provided by the National Science Foundation under grant MIP-9311980.  The
  72 original version of this code was written as part of a project to build
  73 a fixed-point vector processor in collaboration with the University of
  74 California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek.
  75
  76 THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE.  Although reasonable effort
  77 has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT
  78 TIMES RESULT IN INCORRECT BEHAVIOR.  USE OF THIS SOFTWARE IS RESTRICTED TO
  79 PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY
  80 AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE.
  81
  82
  83 -------------------------------------------------------------------------------
  84 Types and Functions
  85
  86 When 64-bit integers are supported by the compiler, the `softfloat.h' header
  87 file defines four types:  `float32' (single precision), `float64' (double
  88 precision), `floatx80' (extended double precision), and `float128'
  89 (quadruple precision).  The `float32' and `float64' types are defined in
  90 terms of 32-bit and 64-bit integer types, respectively, while the `float128'
  91 type is defined as a structure of two 64-bit integers, taking into account
  92 the byte order of the particular machine being used.  The `floatx80' type
  93 is defined as a structure containing one 16-bit and one 64-bit integer, with
  94 the machine's byte order again determining the order of the `high' and `low'
  95 fields.
  96
  97 When 64-bit integers are _not_ supported by the compiler, the `softfloat.h'
  98 header file defines only two types:  `float32' and `float64'.  Because
  99 ISO/ANSI C guarantees at least one built-in integer type of 32 bits,
 100 the `float32' type is identified with an appropriate integer type.  The
 101 `float64' type is defined as a structure of two 32-bit integers, with the
 102 machine's byte order determining the order of the fields.
 103
 104 In either case, the types in `softfloat.h' are defined such that if a system
 105 implements the usual C `float' and `double' types according to the IEC/IEEE
 106 Standard, then the `float32' and `float64' types should be indistinguishable
 107 in memory from the native `float' and `double' types.  (On the other hand,
 108 when `float32' or `float64' values are placed in processor registers by
 109 the compiler, the type of registers used may differ from those used for the
 110 native `float' and `double' types.)
 111
 112 SoftFloat implements the following arithmetic operations:
 113
 114 -- Conversions among all the floating-point formats, and also between
 115    integers (32-bit and 64-bit) and any of the floating-point formats.
 116
 117 -- The usual add, subtract, multiply, divide, and square root operations
 118    for all floating-point formats.
 119
 120 -- For each format, the floating-point remainder operation defined by the
 121    IEC/IEEE Standard.
 122
 123 -- For each floating-point format, a ``round to integer'' operation that
 124    rounds to the nearest integer value in the same format.  (The floating-
 125    point formats can hold integer values, of course.)
 126
 127 -- Comparisons between two values in the same floating-point format.
 128
 129 The only functions required by the IEC/IEEE Standard that are not provided
 130 are conversions to and from decimal.
 131
 132
 133 -------------------------------------------------------------------------------
 134 Rounding Modes
 135
 136 All four rounding modes prescribed by the IEC/IEEE Standard are implemented
 137 for all operations that require rounding.  The rounding mode is selected
 138 by the global variable `float_rounding_mode'.  This variable may be set
 139 to one of the values `float_round_nearest_even', `float_round_to_zero',
 140 `float_round_down', or `float_round_up'.  The rounding mode is initialized
 141 to nearest/even.
 142
 143
 144 -------------------------------------------------------------------------------
 145 Extended Double-Precision Rounding Precision
 146
 147 For extended double precision (`floatx80') only, the rounding precision
 148 of the standard arithmetic operations is controlled by the global variable
 149 `floatx80_rounding_precision'.  The operations affected are:
 150
 151    floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
 152
 153 When `floatx80_rounding_precision' is set to its default value of 80, these
 154 operations are rounded (as usual) to the full precision of the extended
 155 double-precision format.  Setting `floatx80_rounding_precision' to 32
 156 or to 64 causes the operations listed to be rounded to reduced precision
 157 equivalent to single precision (`float32') or to double precision
 158 (`float64'), respectively.  When rounding to reduced precision, additional
 159 bits in the result significand beyond the rounding point are set to zero.
 160 The consequences of setting `floatx80_rounding_precision' to a value other
 161 than 32, 64, or 80 is not specified.  Operations other than the ones listed
 162 above are not affected by `floatx80_rounding_precision'.
 163
 164
 165 -------------------------------------------------------------------------------
 166 Exceptions and Exception Flags
 167
 168 All five exception flags required by the IEC/IEEE Standard are
 169 implemented.  Each flag is stored as a unique bit in the global variable
 170 `float_exception_flags'.  The positions of the exception flag bits within
 171 this variable are determined by the bit masks `float_flag_inexact',
 172 `float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and
 173 `float_flag_invalid'.  The exception flags variable is initialized to all 0,
 174 meaning no exceptions.
 175
 176 An individual exception flag can be cleared with the statement
 177
 178     float_exception_flags &= ~ float_flag_<exception>;
 179
 180 where `<exception>' is the appropriate name.  To raise a floating-point
 181 exception, the SoftFloat function `float_raise' should be used (see below).
 182
 183 In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess
 184 for underflow either before or after rounding.  The choice is made by
 185 the global variable `float_detect_tininess', which can be set to either
 186 `float_tininess_before_rounding' or `float_tininess_after_rounding'.
 187 Detecting tininess after rounding is better because it results in fewer
 188 spurious underflow signals.  The other option is provided for compatibility
 189 with some systems.  Like most systems, SoftFloat always detects loss of
 190 accuracy for underflow as an inexact result.
 191
 192
 193 -------------------------------------------------------------------------------
 194 Function Details
 195
 196 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 197 Conversion Functions
 198
 199 All conversions among the floating-point formats are supported, as are all
 200 conversions between a floating-point format and 32-bit and 64-bit signed
 201 integers.  The complete set of conversion functions is:
 202
 203    int32_to_float32      int64_to_float32
 204    int32_to_float64      int64_to_float32
 205    int32_to_floatx80     int64_to_floatx80
 206    int32_to_float128     int64_to_float128
 207
 208    float32_to_int32      float32_to_int64
 209    float32_to_int32      float64_to_int64
 210    floatx80_to_int32     floatx80_to_int64
 211    float128_to_int32     float128_to_int64
 212
 213    float32_to_float64    float32_to_floatx80   float32_to_float128
 214    float64_to_float32    float64_to_floatx80   float64_to_float128
 215    floatx80_to_float32   floatx80_to_float64   floatx80_to_float128
 216    float128_to_float32   float128_to_float64   float128_to_floatx80
 217
 218 Each conversion function takes one operand of the appropriate type and
 219 returns one result.  Conversions from a smaller to a larger floating-point
 220 format are always exact and so require no rounding.  Conversions from 32-bit
 221 integers to double precision and larger formats are also exact, and likewise
 222 for conversions from 64-bit integers to extended double and quadruple
 223 precisions.
 224
 225 Conversions from floating-point to integer raise the invalid exception if
 226 the source value cannot be rounded to a representable integer of the desired
 227 size (32 or 64 bits).  If the floating-point operand is a NaN, the largest
 228 positive integer is returned.  Otherwise, if the conversion overflows, the
 229 largest integer with the same sign as the operand is returned.
 230
 231 On conversions to integer, if the floating-point operand is not already an
 232 integer value, the operand is rounded according to the current rounding
 233 mode as specified by `float_rounding_mode'.  Because C (and perhaps other
 234 languages) require that conversions to integers be rounded toward zero, the
 235 following functions are provided for improved speed and convenience:
 236
 237    float32_to_int32_round_to_zero    float32_to_int64_round_to_zero
 238    float64_to_int32_round_to_zero    float64_to_int64_round_to_zero
 239    floatx80_to_int32_round_to_zero   floatx80_to_int64_round_to_zero
 240    float128_to_int32_round_to_zero   float128_to_int64_round_to_zero
 241
 242 These variant functions ignore `float_rounding_mode' and always round toward
 243 zero.
 244
 245 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 246 Standard Arithmetic Functions
 247
 248 The following standard arithmetic functions are provided:
 249
 250    float32_add    float32_sub    float32_mul    float32_div    float32_sqrt
 251    float64_add    float64_sub    float64_mul    float64_div    float64_sqrt
 252    floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
 253    float128_add   float128_sub   float128_mul   float128_div   float128_sqrt
 254
 255 Each function takes two operands, except for `sqrt' which takes only one.
 256 The operands and result are all of the same type.
 257
 258 Rounding of the extended double-precision (`floatx80') functions is affected
 259 by the `floatx80_rounding_precision' variable, as explained above in the
 260 section _Extended_Double-Precision_Rounding_Precision_.
 261
 262 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 263 Remainder Functions
 264
 265 For each format, SoftFloat implements the remainder function according to
 266 the IEC/IEEE Standard.  The remainder functions are:
 267
 268    float32_rem
 269    float64_rem
 270    floatx80_rem
 271    float128_rem
 272
 273 Each remainder function takes two operands.  The operands and result are all
 274 of the same type.  Given operands x and y, the remainder functions return
 275 the value x - n*y, where n is the integer closest to x/y.  If x/y is exactly
 276 halfway between two integers, n is the even integer closest to x/y.  The
 277 remainder functions are always exact and so require no rounding.
 278
 279 Depending on the relative magnitudes of the operands, the remainder
 280 functions can take considerably longer to execute than the other SoftFloat
 281 functions.  This is inherent in the remainder operation itself and is not a
 282 flaw in the SoftFloat implementation.
 283
 284 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 285 Round-to-Integer Functions
 286
 287 For each format, SoftFloat implements the round-to-integer function
 288 specified by the IEC/IEEE Standard.  The functions are:
 289
 290    float32_round_to_int
 291    float64_round_to_int
 292    floatx80_round_to_int
 293    float128_round_to_int
 294
 295 Each function takes a single floating-point operand and returns a result of
 296 the same type.  (Note that the result is not an integer type.)  The operand
 297 is rounded to an exact integer according to the current rounding mode, and
 298 the resulting integer value is returned in the same floating-point format.
 299
 300 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 301 Comparison Functions
 302
 303 The following floating-point comparison functions are provided:
 304
 305    float32_eq    float32_le    float32_lt
 306    float64_eq    float64_le    float64_lt
 307    floatx80_eq   floatx80_le   floatx80_lt
 308    float128_eq   float128_le   float128_lt
 309
 310 Each function takes two operands of the same type and returns a 1 or 0
 311 representing either _true_ or _false_.  The abbreviation `eq' stands for
 312 ``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands
 313 for ``less than'' (<).
 314
 315 The standard greater-than (>), greater-than-or-equal (>=), and not-equal
 316 (!=) functions are easily obtained using the functions provided.  The
 317 not-equal function is just the logical complement of the equal function.
 318 The greater-than-or-equal function is identical to the less-than-or-equal
 319 function with the operands reversed; and the greater-than function can be
 320 obtained from the less-than function in the same way.
 321
 322 The IEC/IEEE Standard specifies that the less-than-or-equal and less-than
 323 functions raise the invalid exception if either input is any kind of NaN.
 324 The equal functions, on the other hand, are defined not to raise the invalid
 325 exception on quiet NaNs.  For completeness, SoftFloat provides the following
 326 additional functions:
 327
 328    float32_eq_signaling    float32_le_quiet    float32_lt_quiet
 329    float64_eq_signaling    float64_le_quiet    float64_lt_quiet
 330    floatx80_eq_signaling   floatx80_le_quiet   floatx80_lt_quiet
 331    float128_eq_signaling   float128_le_quiet   float128_lt_quiet
 332
 333 The `signaling' equal functions are identical to the standard functions
 334 except that the invalid exception is raised for any NaN input.  Likewise,
 335 the `quiet' comparison functions are identical to their counterparts except
 336 that the invalid exception is not raised for quiet NaNs.
 337
 338 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 339 Signaling NaN Test Functions
 340
 341 The following functions test whether a floating-point value is a signaling
 342 NaN:
 343
 344    float32_is_signaling_nan
 345    float64_is_signaling_nan
 346    floatx80_is_signaling_nan
 347    float128_is_signaling_nan
 348
 349 The functions take one operand and return 1 if the operand is a signaling
 350 NaN and 0 otherwise.
 351
 352 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 353 Raise-Exception Function
 354
 355 SoftFloat provides a function for raising floating-point exceptions:
 356
 357     float_raise
 358
 359 The function takes a mask indicating the set of exceptions to raise.  No
 360 result is returned.  In addition to setting the specified exception flags,
 361 this function may cause a trap or abort appropriate for the current system.
 362
 363 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 364
 365
 366 -------------------------------------------------------------------------------
 367 Contact Information
 368
 369 At the time of this writing, the most up-to-date information about
 370 SoftFloat and the latest release can be found at the Web page `http://
 371 HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'.
 372
 373