Flat Datacenter Storage Nightingale, Elson, Fan, Hofmann, Howell, Suzue OSDI 2012 Adapted from Robert Morris' why are we looking at this paper? Lab 2 wants to be like this when it grows up though details are all different fantastic performance -- world record cluster sort good systems paper -- details from apps all the way to network what is FDS? a cluster storage system stores giant blobs -- 128-bit ID, multi-megabyte content clients and servers connected by network with high bisection bandwidth for big-data processing (like MapReduce) cluster of 1000s of computers processing data in parallel high-level design -- a common pattern lots of clients lots of storage servers ("tractservers") partition the data master ("metadata server") controls partitioning replica groups for reliability why is this high-level design useful? 1000s of disks of space store giant blobs, or many big blobs 1000s of servers/disks/arms of parallel throughput can expand over time -- reconfiguration large pool of storage servers for instant replacement after failure motivating app: MapReduce-style sort a mapper reads its split 1/Mth of the input file (e.g., a tract) in mapreduce, master schedules workers on machine that are close to the data e.g., in same rack FDS does not care about such locality. Q: is the abstract's 2 GByte/sec per client impressive? what is it bottlenecked by? what should we want to know from the paper? API? layout? finding data? add a server? replication? failure handling? failure model? consistent reads/writes? (i.e. does a read see latest write?) config mgr failure handling? good performance? useful for apps? * API Figure 1 128-bit blob IDs blobs have a length only whole-tract read and write -- 8 MB Q: why are 128-bit blob IDs a nice interface? why not file names? Q: why use 8 MB tracts? Why not pick 100 KB tracts? Why not pick 80MB tracts? Q: what kinds of client applications is the API aimed at? and not aimed at? * Layout: how do they spread data over the servers? Section 2.2 break each blob into 8 MB tracts TLT maintained by metadata server has n entries for blob b and tract t, i = (hash(b) + t) mod n TLT[i] contains list of tractservers w/ copy of the tract clients and servers all have copies of the latest TLT table Example four-entry TLT with no replication: 0: S1 1: S2 2: S3 3: S4 suppose hash(27) = 2 then the tracts of blob 27 are laid out: S1: 2 6 S2: 3 7 S3: 0 4 8 S4: 1 5 ... FDS is "striping" blobs over servers at tract granularity Q: why have tracts at all? why not store each blob on just one server? what kinds of apps will benefit from striping? what kinds of apps won't? Q: how fast will a client be able to read a single tract? Q: where does the abstract's single-client 2 GB number come from? Q: why not the UNIX i-node approach? store an array per blob, indexed by tract #, yielding tractserver so you could make per-tract placement decisions e.g. write new tract to most lightly loaded server Q: why not hash(b + t)? Q: how many TLT entries should there be? how about n = number of tractservers? why do they claim this works badly? Section 2.2 The system needs to choose server pairs (or triplets &c) to put in TLT entries For replication Section 3.3 Q: how about 0: S1 S2 1: S2 S1 2: S3 S4 3: S4 S3 ... Why is this a bad idea? How long will repair take? What are the risks if two servers fail? Q: why is the paper's n^2 scheme better? TLT with n^2 entries, with every server pair occuring once 0: S1 S2 1: S1 S3 2: S1 S4 3: S2 S1 4: S2 S3 5: S2 S4 ... How long will repair take? What are the risks if two servers fail? Q: why do they actually use a minimum replication level of 3? same n^2 table as before, third server is randomly chosen What effect on repair time? What effect on two servers failing? What if three disks fail? * Adding a tractserver To increase the amount of disk space / parallel throughput Metadata server picks some random TLT entries Substitutes new server for an existing server in those TLT entries * How do they maintain n^2 plus one arrangement as servers leave join? Unclear. Q: how long will adding a tractserver take? Q: what about client writes while tracts are being transferred? receiving tractserver may have copies from client(s) and from old srvr how does it know which is newest? Q: what if a client reads/writes but has an old tract table? * Replication A writing client sends a copy to each tractserver in the TLT. A reading client asks one tractserver. Q: why don't they send writes through a primary? Q: what problems are they likely to have because of lack of primary? why weren't these problems show-stoppers? Q: if a tractserver's net breaks and is then repaired, might srvr serve old data? * What happens after a tractserver fails? Metadata server stops getting heartbeat RPCs Picks random replacement for each TLT entry failed server was in New TLT gets a new version number Replacement servers fetch copies Example of the tracts each server holds: S1: 0 4 8 ... S2: 0 1 ... S3: 4 3 ... S4: 8 2 ... Q: why not just pick one replacement server? Q: how long will it take to copy all the tracts? * What happens when the metadata server crashes? Q: while metadata server is down, can the system proceed? Q: is there a backup metadata server? Q: how does rebooted metadata server get a copy of the TLT? Q: does their scheme seem correct? how does the metadata server know it has heard from all tractservers? how does it know all tractservers were up to date? * Performance Q: how do we know we're seeing "good" performance? what's the best you can expect? Q: limiting resource for 2 GB/second single-client? Q: Figure 4a: why starts low? why goes up? why levels off? why does it level off at that particular performance? Q: Figure 4b shows random r/w as fast as sequential (Figure 4a). is this what you'd expect? Q: why are writes slower than reads with replication in Figure 4c?