GFS FAQ Q: Did having a single master turn out to be a good idea? A: That idea simplified initial deployment but was not so great in the long run. This article (GFS: Evolution on Fast Forward, https://queue.acm.org/detail.cfm?id=1594206) says that as the years went by and GFS use grew, a few things went wrong. The number of files grew enough that it wasn't reasonable to store all files' metadata in the RAM of a single master. The number of clients grew enough that a single master didn't have enough CPU power to serve them. The fact that switching from a failed master to one of its secondaries required human intervention made recovery slow. Apparently Google's replacement for GFS, Colossus, splits the master over multiple servers, and has more automated master failure recovery. Q: Why is atomic record append at-least-once, rather than exactly once? Section 3.1, Step 7, says that if a write fails at one of the secondaries, the client re-tries the write. That will cause the data to be appended more than once at the non-failed replicas. A different design could detect duplicate client requests despite arbitrary failures (e.g. a primary failure between the original request and the client's retry). You'll implement such a design in Lab 3, at considerable expense in complexity and performance. Q: How does an application know what sections of a chunk consist of padding and duplicate records? A: To detect padding, applications can put a predictable magic number at the start of a valid record, or include a checksum that will likely only be valid if the record is valid. The application can detect duplicates by including unique IDs in records. Then, if it reads a record that has the same ID as an earlier record, it knows that they are duplicates of each other. GFS provides a library for applications that handles these cases. This aspect of the GFS design effectively moves complexity from GFS to applications, which is perhaps not ideal. Q: How can clients find their data given that atomic record append writes it at an unpredictable offset in the file? A: Append (and GFS in general) is mostly intended for applications that sequentially read entire files. Such applications will scan the file looking for valid records (see the previous question), so they don't need to know the record locations in advance. For example, the file might contain URLs encountered by a set of concurrent web crawlers. The file offset of any given URL doesn't matter much; readers just want to be able to read the entire set of URLs. Q: What's a checksum? A: A checksum algorithm takes a sequence of bytes as input and returns a single number that's a function of that sequence. For example, a simple checksum might be the sum of all the bytes in the input. GFS stores the checksum of each 64 kilobyte "block" in each chunk. When a chunkserver writes a block of data to its disk, it first computes the checksum of the block, and saves the checksum on disk. When a chunkserver reads a block from its disk, it also reads the relevant previously-saved checksum, re-computes a checksum from the data read from disk, and checks that the two checksums match. If the data was corrupted by the disk, the checksums won't match, and the chunkserver will know to return an error. Separately, some GFS applications store their own checksums, over application-defined records, inside GFS files, to distinguish between correct records and padding. CRC32 is an example of a checksum algorithm. Q: The paper mentions reference counts -- what are they? A: They are part of the implementation of copy-on-write for snapshots. When GFS creates a snapshot, it doesn't copy the chunks, but instead increases the reference counter of each chunk. This makes creating a snapshot inexpensive. If a client writes a chunk and the master notices the reference count is greater than one, the master first makes a copy so that the client can update the copy (instead of the chunk that is part of the snapshot). You can view this as delaying the copy until it is absolutely necessary. The hope is that not all chunks will be modified and one can avoid making some copies. Q: If an application uses the standard POSIX file APIs, would it need to be modified in order to use GFS? A: Yes, but GFS isn't intended for existing applications. It is designed for newly-written applications, such as MapReduce programs. Q: How does GFS determine the location of the nearest replica? A: The paper hints that GFS does this based on the IP addresses of the servers storing the available replicas. In 2003, Google must have assigned IP addresses in such a way that if two IP addresses are close to each other in IP address space, then they are also close to each other in machine-room network topology (perhaps plugged into the same Ethernet switch, or into Ethernet switches that are themselves directly connected). Q: What's a lease? A: For GFS, a lease is a period of time for which the master grants a chunkserver the ability to act as the primary for a particular chunk. The master guarantees not to assign a different primary for the duration of the lease, and the primary agrees to stop acting as primary before the lease expires (unless the primary first asks the master to extend the lease). Leases are a way to avoid having the primary have to repeatedly ask the master if it is still primary -- it knows it can act as primary for the next minute (or whatever the lease interval is) without talking to the master again. Q: Suppose S1 is the primary for a chunk, and the network between the master and S1 fails. The master will notice and designate some other server as primary, say S2. Since S1 didn't actually fail, are there now two primaries for the same chunk? A: That would be a disaster, since both primaries might apply different updates to the same chunk. Luckily GFS's lease mechanism prevents this scenario. The master granted S1 a 60-second lease to be primary. S1 knows to stop being primary before its lease expires. The master won't grant a lease to S2 until after the lease to S1 expires. So S2 won't start acting as primary until after S1 stops. Q: 64 megabytes sounds awkwardly large for the chunk size! A: The 64 MB chunk size is the unit of book-keeping in the master, and the granularity at which files are sharded over chunkservers. Clients can issue smaller reads and writes -- they are not forced to deal in whole 64 MB chunks. The point of using such a big chunk size is to reduce the size of the meta-data tables in the master, and to avoid limiting clients that want to do huge transfers to reduce overhead. On the other hand, files less than 64 MB in size do not get much parallelism. Q: Does Google still use GFS? A: Rumor has it that GFS has been replaced by something called Colossus, with the same overall goals, but improvements in master performance and fault-tolerance. In addition, many applications within Google have switched to more database-like storage systems such as BigTable and Spanner. However, much of the GFS design lives on in HDFS, the storage system for the Hadoop open-source MapReduce. https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system Q: How acceptable is it that GFS trades correctness for performance and simplicity? A: This a recurring theme in distributed systems. Strong consistency usually requires protocols that are complex and require communication and waiting for replies (as we will see in the next few lectures). By exploiting ways that specific application classes can tolerate relaxed consistency, one can design systems that have good performance and sufficient consistency. For example, GFS optimizes for MapReduce applications, which need high read performance for large files and are OK with having holes in files, records showing up several times, and inconsistent reads. On the other hand, GFS would not be good for storing account balances at a bank. Q: What if the master fails? A: There are replica masters with a full copy of the master state; the paper's design requires some outside entity (a human?) to decide to switch to one of the replicas after a master failure (Section 5.1.3). We will see later how to build replicated services that automatically switch to a backup server if the main server fails, and you'll build such a thing in Lab 2. Q: Why 3 replicas? A: Perhaps this was the line of reasoning: two replicas are not enough because, after one fails, there may not be enough time to re-replicate before the remaining replica fails; three makes that scenario much less likely. With 1000s of disks, low-probabilty events like multiple replicas failing in short order occur uncomfortably often. Here is a study of disk reliability from that era: https://research.google.com/archive/disk_failures.pdf. You need to factor in the time it takes to make new copies of all the chunks that were stored on a failed disk; and perhaps also the frequency of power, server, network, and software failures. The cost of disks (and associated power, air conditioning, and rent), and the value of the data being protected, are also relevant. Q: What is internal fragmentation? Why does lazy allocation help? A: Internal fragmentation is the space wasted when a system uses an allocation unit larger than needed for the requested allocation. If GFS allocated disk space in 64MB units, then a one-byte file would waste almost 64MB of disk. GFS avoids this problem by allocating disk space lazily. Every chunk is a Linux file, and Linux file systems use block sizes of a few tens of kilobytes; so when an application creates a one-byte GFS file, the file's chunk consumes only one Linux disk block, not 64 MB. Q: What benefit does GFS obtain from the weakness of its consistency? A: It's easier to think about the additional work GFS would have to do to achieve stronger consistency. The primary should not let secondaries apply a write unless all the secondaries will be able to do it. This likely requires two rounds of communication -- one to ask all secondaries if they are alive and are able to promise to do the write if asked, and (if all answer yes) a second round to tell the secondaries to commit the write. If the primary dies, some secondaries may have missed the last few update messages the primary sent. This will cause the remaining secondaries to have slightly differing copies of the data. Before resuming operation, a new primary should ensure that all the secondaries have identical copies. Since clients re-send requests if they suspect something has gone wrong, primaries would need to filter out operations that have already been executed. Clients cache chunk locations, and may send reads to a chunkserver that holds a stale version of a chunk. GFS would need a way to guarantee that this cannot succeed.