1. Floating points how to represent fraction numbers in binary? generalize decimal representation 1.23 =1*10^0 + 2 * 10^-1 + 3 * 10^-2 to binary: (1.01) = 1*2^0 + 0*2^-1 + 1 * 2 ^-2 = 1.25 finite representation results in limited precision 0.1 = 0.00011001 = 0.09765625 Naive representation : fixed-point representation 32-bit number A naive way to represent fractional numbers in binary 0101....1 . 101010 / | |\ / | | \ 2^15| 2^-1 2^{-2} 2^14 Why not fixed point? Inefficient. Cannot represent very large number limited magnitude 2^15 Have limited precision for small numbers, 2^-16 11. IEEE floating point standard float (single precision, 4-byte) double (double precision, 8-byte) __________________________________________________ |s|__exponet_(8-bit)_____|__frac_(23-bit)__________| -- Scenario I. normalized encoding: exp!=0 && exp!=1111 1111 (-1)^s * 1.frac*2^{exp-127} (127 is referred to as Bias) E = exp -127 E's range:[-126,127] example: 0 0111 1100 01000000000000000000000 1.frac = 1 + 0*2^-1 + 1*2^{-2} + ... = 1.25 exp - 127 = 64 + 32 + 16 + 8 + 4 = 124 -127 = -3 overall value = 1.25 * 2^-3 = ... (0.15625) another example: float f = 65.0; 0100 0001 = 1.000001000... * 2^6 E = 6 ==> exp = 133 = 1000 0101; 0 1000 0101 0000 0100 0000 0000 0000 000 --Scenario II. Denormalized encoding: exp = 0000 0000 (-1)^s*0.frac*2^{1-127} Note E = 1-127 (instead of 0-127) exp 0000 0000 frac = 0000...00 respresents zero Note there's a -0, flips the sign bit -- Scenario III: special values encoding: exp = 1111 1111 case: exp = 1111 1111 , frac = 000...0 represents infinity e.g. 1.0/0.0 case exp = 1111 1111 frac != 000...0 represnets NaN sqrt(-1), \inf*0 Distribution of values: |_____-normalized________|__-denorm__|-0|+0|__denorm_____|____normalized________________| -\inf \inf more precision as (positivie) number approaches 0 denorm numbers are equi-distance 2^{-126}*2^{-23} = 2^{-149} rounding: round either up or down to the nearest representable floating point value e.g. round to nearest 1/4 3/32 0.00011 rounded 0.00 3/16 0.00110 0.01 7/8 0.11100 1.00 5/8 10.10100 0.10 -- double precision 11 bit exponent field (bias = 1023) 52 bit fraction field (-1)^s 1.frac * 2^{exp-1023}