"Wide-area cooperative with CFS" --- Dabek, Kaashoek, Karger, Morris, Stoica Let's use the efficient location primitives (Chord, CAN, PAST) to build a file system to distribute a large, popular software distribution. Assume this API for the lookup primitives: IP_adddress lookup (key k); where key is a large integer What are we trying to accomplish? -------------------------------- - load balance Q: Why distribute? Why not just buy a single machine? A: Single machine must have the resources (bw/cpu) to serve peak load. As a result, average utilization will be low. Most investment wasted. Q: What can we do to improve the utilization of this machines resources? A: Serve multiple content sources w/ independent load (in time) + each source creator pays for the resources to serve fractional load + problem.... content creators must have administrative connections We want: many machines, each serving some fraction of the _average_ load. also: what about balancing space? between hosts w/ different capacities? - performance Same as TCP to (median server? mean?) - reliability - Tolerate node failures (at what rate?) - lookup failures (handled by primitive) - data loss --> arrange for 'good enough' reliability at reasonable cost - what is enough: at least as good as single server? better? + power outage at LCS gives server: 99 percent uptime + 'five nines (99.999%)' -> five minutes down per year - anonymity/authentication? - anonymity ignored by CFS - conflicts with performance / reliability goals of primitives - indexing/keyword search? - different class of application ==> Load balance is main force in designing this system How will we achieve these goals? ------------------------------- - Need a different API - CFS provides DHASH layer: void insert (key k, void *data); void *data = lookup (key k); - really a layer? Example of why it isn't a layer: successor placement, lookup really performed in DHASH --> as we design CFS we are really building DHASH: CFS is really all about DHASH. Chord's main job is maintaining finger tables. - basic DHASH operation * to insert data d under key k 1. get IP = lookup (k); 2. contact IP, send d,k * to fetch 1. get IP = lookup (k); 2. contact IP, send k, get d - we will use this interface and extend it to meet our design goals - load balance: - striping + employed by CFS/Mojonation. Large files distributed across large number of nodes + imposes cost of one lookup per block (i.e. many lookups per file) + insert is: for (all blocks) insert (NAME (block[i]), block[i]); + how to track stripes? akin to forming files from disk sectors --- SFSRO file system layout: merkle tree + can add block caching for small files - how to modify DHASH to do this? --> along path, DHASH must do lookups/be informed of path to cache files - recursive v. iterative operation? - (whole file) caching + employed by PAST/Freenet/Napster + distributes files around the network as they are requested + only one lookup per file + insert is: insert (NAME (file), file); ---> now we just need to get DHASH right (add reliability) and we have a file system --> which is better (whole file/block oriented)? - striping imposes more lookups (requires prefetch) uses less disk space to achieve same level of load balance - whole file caching reduces lookup overhead - dealing w/ heterogeneous node capacities: - virtual servers (CFS) - redirection (PASTRY) - reliability - need replication if we will tolerate failures - where to place replicas? - need: replica failure independent - successors (chord specific) - rehashing: replica 2 = SHA(SHA(key)); - how to modify DHASH to provide reliablity? - needs to place replicas - needs to recogonize failure and know how to recover --> must be invovled in the lookup (DHASH knows where replicas are located) --> Chord only provides routing table information to DHASH which does the lookups - how does striping affect reliability? - all blocks of a file must be found to reconstruct file p (finding a block) = 1 - p (servers are down) [assume no lookup failures] = 1 - l^r where l is prob a server is down, r is replication degree p (finding a file) = p (finding a block)^B where B is the # of blocks = (1 - l^r)^B approx = 1 - B*(l^r) ==> easy to drive this to one by increasing r - performance - how does performance depend on load balance scheme? - whole file: speed will be equal to that of TCP between client/server - striping: problem: lookup latency leads to poor performance w/ prefetch, lookup latency hidden by block fetch - interaction w/ network protocol? - what about server selection? Choose 'close' servers to find blocks (see replication below) - explain chord server selection? - what about authentication/naming? - possible primitives for authentication: - PK - MAC - content hash - CFS uses SFSRO style auth: merkle tree [picture of SFSRO auth tree] - nit: how to handle ".." (show circular dependency) - anonymity? - open question - is system useful w/o anonymity?