1. Floating points
 how to represent fraction numbers in binary?

 generalize decimal representation
 1.23 =1*10^0 + 2 * 10^-1 + 3 * 10^-2
 to binary:
 (1.01) = 1*2^0 + 0*2^-1 + 1 * 2 ^-2 = 1.25

 finite representation results in limited precision
 0.1 = 0.00011001 = 0.09765625

 Naive representation : fixed-point representation
 32-bit number
 A naive way to represent fractional numbers in binary
   0101....1 . 101010
  / |          |\
 /  |          | \
2^15|       2^-1 2^{-2}
  2^14


 Why not fixed point?
 Inefficient. 
 Cannot represent very large number limited magnitude 2^15
 Have limited precision for small numbers, 2^-16

11.  IEEE floating point standard

  float (single precision, 4-byte)
  double (double precision, 8-byte)
  __________________________________________________
 |s|__exponet_(8-bit)_____|__frac_(23-bit)__________|

-- Scenario I. normalized encoding: exp!=0 && exp!=1111 1111
 (-1)^s * 1.frac*2^{exp-127} (127 is referred to as Bias)
 E = exp -127 E's range:[-126,127]


example: 0 0111 1100 01000000000000000000000

 1.frac = 1 + 0*2^-1 + 1*2^{-2} + ... = 1.25
 exp - 127 = 64 + 32 + 16 + 8 + 4 = 124 -127 = -3

 overall value = 1.25 * 2^-3 = ... (0.15625)

another example: float f = 65.0;
 0100 0001
 = 1.000001000... * 2^6

 E = 6 ==> exp = 133 = 1000 0101;

 0   1000 0101   0000 0100 0000 0000 0000 000

--Scenario II. Denormalized encoding: exp = 0000 0000
 (-1)^s*0.frac*2^{1-127}
 Note E = 1-127 (instead of 0-127)

 exp 0000 0000 frac = 0000...00 respresents zero
 Note there's a -0, flips the sign bit

-- Scenario III: special values encoding: exp = 1111 1111
 case: exp = 1111  1111 , frac = 000...0
 represents infinity
 e.g. 1.0/0.0

 case exp = 1111 1111 frac != 000...0
 represnets NaN
 sqrt(-1), \inf*0

 Distribution of values:
 |_____-normalized________|__-denorm__|-0|+0|__denorm_____|____normalized________________|
 -\inf                                                                                 \inf

 more precision as (positivie) number approaches 0
 denorm numbers are equi-distance 2^{-126}*2^{-23} = 2^{-149}

rounding: round either up or down to the nearest representable floating point value
 e.g. round to nearest 1/4 
 3/32    0.00011 rounded 0.00
 3/16    0.00110         0.01
 7/8     0.11100         1.00
 5/8     10.10100        0.10

-- double precision
 11 bit exponent field (bias = 1023)
 52 bit fraction field
 (-1)^s 1.frac * 2^{exp-1023}