A case study of distributed storage systems: The Google File System (SOSP 2003) Notes partially borrowed from Robert Morris, 6.824. Motivation for GFS (What problem it tries to solve) Google needs to store search indexes all the HTML files on the web all the images on the web Map/reduce uses GFS for input/output ... GFS Goal: - create a shared file system - high aggregate performance (huge storage, huge throughput) - fault tolerant * w/ 1000 machines, ~3 will fail per day. What's GFS' approach: - Supports a file system like API (but not POSIX) - A single master in charge of directory structure, mapping from a file to its chunks etc. - 1000s chunk servers storing 64MB chunks. - 3X replication for chunks How GFS works at a high level? [draw picture] client read: send file name and offset to master master replies with chunk_handle + location (i.e. set of servers storing that chunk) clients cache that information for a little while read from nearest chunk server client write: ask master where to store, master returns chunk_handle + location. send writes to the 3 chunk servers. need to ask master again whenever crossing 64 MB How does this overall approach achieve GFS' goal? - high performance: single master design is sufficient because of GFS' workloads: Files are huge: Multi-terabyte data sets Many of the files are large: 100MB~several GB Mostly large sequential reads/writes - Workload GFS is not good for? - tons of small file operations (git, compile large source tree) What are devils of the details? (technical challenges) - Consistency (i.e. semantics) - Fault tolerance What is consistency? A correctness condition (defines the set of legal outcomes of a system) There's many different models <--------------------------> weak strong The strongest model: - behave identically to an idealized system in which there is no failure and no replication. (read finished writes, reads are consistent with each other) - we'll formalize this in the next lecture. General models: - strong consistency is easy to use for application writers - weak consistency is simpler to implement, better in performance "Ideal" consistency model A replicated files behaves like as a non-replicated file system picture: many clients on the same machine accessing files on a single disk If one application writes, later reads will observe that write What if two application concurrently write to the same file In file systems often undefined --- file may have some mixed content What if two application concurrently write to the same directory One goes first, the other goes second Sources of inconsistency Concurrency Machine failures Network partitions The concurrency challenge: --------------------- draw picture of clients, 3 chunk servers Naive approach: clients write to chunk servers in arbitary order Say client-A write 1, client-B write 2 concurrently. What are "expected" results? all servers store 1 or all servers store 2. Outcome of naive approach? Can this happen in GFS? GFS' solution to achieving consistency under concurrency: Primary/backup One of the replicas is the primary of the chunk (chosen by master) Primary determines orders operations Clients pushes chunk data to replicas Client sends write request to primary Primary assigns sequence number Primary applies change locally Primary forwards request to replicates Primary responds to client after receiving acks from all replicas Wait, concurrent writes to the same file? - Random writes by multiple writers never occur in real workloads. (They are quite meaningless for applications, even under the ideal model) 0000111222222000 - The practical concurrent write situation: Concurrent append e.g. multiple web server (crawler) appending to the same log file Not supported by regular file system API which only supports write at a particular offset. (hence, two conrrent writes will over-write each other's data) GFS provides atomic record append. Another concurrency example: Say client-A write 1, client-B write 2 while client-C also reads concurrenty from two different replica servers What are "expected" results? client-C reads "1" then "2". "2" is stored at the end. client-C reads "2" then "1". "1" is stored at the end. can this happen in GFS? client-C reads "1" then "2". "1" is stored at the end. Yes. So GFS does not offer strong consistency. Fixes? All reads go to primary as well. The Failure Challenge: -------------------- Append under failure: client sends r1 to three chunk servers: primary, backup-A, backup-B client makes append request at primary primary appends r1 locally at offset x primary notifies backup-A, backup-B to append at offset x. backup-A replies success. backup-B fails to reply. Client retries. sends r1... primary appends r1 at offset x+100 succeeds... client reads offset x at backup-A, different from read at backup-B. The Network Partition challenge: (and the uncertainty of failure) ---------------------- Example: Primary is temporarily partitioned from the rest of the network. Master promotes backup-A to be new primary. Network partition heals client-A sends write to old primary client-B sends write to new primary GFS's solution? Master grants leases to primary: A lease has a timeout of 60 seconds. Primary is only valid when its lease is valid. Other challenges ---------------------------------- Concurrency at Master: - Much simpler because there's only one master - Master performs changes to metadata atomically (proper locking) Master fault tolerance Single master stores metadata in memory: 1. file/chunk namespace 2. mapping from files to chunks 3. chunk location What if the master crashes? - Log changes to metadata type 1. and 2. in a log log is replicated on several backups logs play a central role in many systems we will read about Chunk location is soft-state: recreated by asking chunk servers Some external monitoring service detects failure - if crashed master cannot be restarted, a new master is started. - Recovered or new master replays logged changes. How to avoid the two master situaton? Overall consistency guranteeds of GFS - Strong consistency for directory operations - Weak consistency for chunk operations A failed mutation leaves chunks inconsistent The primary chunk server updated chunk But then failed and the replicas are out of date A client may read an not-up-to-date chunk Authors claims weak consistency is not a big problems for apps Most file updates are append-only updates Application can use UID in append records to detect duplicates Application can use temporary files and atomic rename Performance (Figure 3) What's the likely bottleneck of the system? Explain the theorical limit for reads, writes writes to different files lower than possible maximum concurrent appends to single file limited by the server that stores last chunk Summary GFS: achieves scalable, high performance - simple design single master + chunk servers works because of workloads are dominated by large read/write file-system-like API: -simplicity (does not support all FS operations) -customization for its workload (atomic append) - Weak consistency - leads to good performance + simple implementation.