6.824 2013 Lecture 1: Introduction and lab overview 6.824: Distributed Systems Engineering What is a distributed system? multiple networked cooperating computers Examples: Internet E-Mail, Athena file server, Google MapReduce Why distribute? to connect physically separate entities to achieve security via physical isolation to tolerate faults via replication at separate sites to increase performance via parallel CPUs/mem/disk/net But: complex, hard to debug new classes of problems, e.g. partial failure (did he accept my e-mail?) Lamport: A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. advice: don't distribute if a central system will work Why take this course? interesting -- hard problems, non-obvious solutions active research area -- lots of progress + big unsolved problems used by real systems -- unlike 10 years ago driven by the rise of big Web sites hands-on -- you'll build a real system in the labs COURSE STRUCTURE http://pdos.csail.mit.edu/6.824 Course components: Lectures about big ideas, papers, labs Readings: research papers as case studies please read papers before class otherwise boring, and you can't pick it up by listening each paper has a question (see web site) submit answer before class, one or two paragraphs Mid-term quiz in class, and final exam Labs: build increasingly sophisticated fault-tolerant services First lab is due on Monday Project: design and build a distributed system of your choice in the last month of the course teams of two or three project meetings with course staff demo in last class meeting Neha Narula is TA, office hours on Web. MAIN TOPICS Example: a shared file system, so users can cooperate, like AFS lots of client computers [diagram: clients, network, vague set of servers] Topic: architecture Choice of interfaces Monolithic file server? Block server(s) -> FS logic in clients? Separate naming + file servers? Separate FS + block servers? Single machine room or unified wide area system? Wide-area dramatically more difficult. Client/server or peer-to-peer? Interact w/ performance, security, fault behavior. Topic: implementation How do clients/servers communicate? Direct network communication is pretty painful Want to hide network stuff from application logic Most systems organize distribution with some structuring framework(s) RPC, RMI, DSM, MapReduce, &c Topic: performance Distribution can hurt: network b/w and latency bottlenecks Lots of tricks, e.g. caching, threaded servers Distribution can help: parallelism, pick server near client Idea: scalable design Nx servers -> Nx total performance Need a way to divide the load by N == divide the state by N Split by user Split by file name "Sharding" or "partitioning" Rarely perfect -> only scales so far Global operations, e.g. search Load imbalance One very active user One very popular file -> one server 100%, added servers mostly idle -> Nx servers -> 1x performance Topic: fault tolerance Can I use my files if there's a failure? Some part of network, some set of servers Maybe: replicate the data on multiple servers Perhaps client sends every operation to both Maybe only needs to wait for one reply Opportunity: operate from two "replicas" independently if partitioned? Opportunity: can 2 servers yield 2x availability AND 2x performance? Topic: consistency == contract w/ apps/users about meaning of operations e.g. "read yields most recently written value" hard due to partial failure, replication/caching, concurrency Problem: keep replicas identical If one is down, it will miss operations Must be brought up to date after reboot If net is broken, *both* replicas maybe live, and see different ops Delete file, still visible via other replica "split brain" -- usually bad Problem: clients may see updates in different orders Due to caching or replication I make grades.txt unreadable, then TA writes grades to it What if the operations run in different order on different replicas? Consistency often hurts performance (communication, blocking) Many systems cut corners -- "relaxed consistency" Shifts burden to applications LABS focus: fault tolerance and consistency -- central to distrib sys lab 1: simple replicated lock server labs 2/3/4: storage servers progressively more sophisticated (tolerate more kinds of faults) patterened after real systems, e.g. MongoDB Lab 4 has core of a real-world design for 1000s of servers what you'll learn from the labs easy to listen to lecture / read paper and think you understand building forces you to really understand you'll have to do some design yourself we supply skeleton, requirements, and tests but we leave you substantial scope to solve problems your own way you'll get experience debugging distributed systems we've tried to ensure that the hard problems have to do w/ distrib sys not e.g. fighting against language, libraries, &c thus Go (type-safe, garbage collected, slick RPC library) thus fairly simple services (locks, k/v store) what fault-tolerance properties might we want? available durable consistent what kinds of faults might we want to tolerate? network: lost packets duplicated packets temporary network failure server disconnected network partitioned server: server crash+restart server fails permanently all servers fail simultaneously -- power/earthquake bad case: crash mid-way through complex operation bugs -- but not in this course malice -- but not in this course client fails tools for dealing with faults? retry -- e.g. if pkt is lost, or server crash+restart replicate -- e.g. if one server or part of net has failed replace -- for long-term health Lab 1 is not a very robust distributed system but it will help you get up to speed on Go and distributed programming and it will help you understand problems later labs will solve the lab 1 lock service multiple clients named locks only one client at a time can hold a lock -- e.g. to edit grades file Lock(lockname) -- returns true/false Unlock(lockname) -- returns true/false app code: while Lock("grades") == false: sleep a bit ... Unlock("grades") diagram: clients, primary, backup, Lock, forward, &c fault-tolerance scheme: replication via "primary/backup" replicate the service state for each lock, locked or not one copy on primary server, one on backup server clients send operations to primary primary forwards to backup so backup can update its state primary replies to client after backup replies lab 1 "failure model": single fail-stop failure ONLY failure: one server halts NO network failures NO re-start of servers thus: no response means that the server has halted thus: at least one of primary/backup guaranteed to be alive fail-stop reasonable for tightly coupled primary+backup fail-stop not a good model for biggish internet-based systems due to network failures client Lock(l): send msg to primary if no response, send to backup return server response server Lock(l) handler: if locked(l) return false else forward to backup return true Q: does primary have to wait for reply from backup? Q: what should primary do if no response from backup? Q: what if primary gets Lock(a) / Lock(a) at same time from diff clients primary processes them in some order (one wins, other loses) forwards to backup backup receives in the other order is that OK? Q: how to ensure backup sees operations in same order as primary? Q: is it OK that client might send same request to both primary and backup? Q: what happens if the primary fails just after forwarding to the backup? service must act like a single copy! this is what makes lab 1 interesting! Q: what if primary is up but network isn't delivering msgs? you do not have to cope with this scenario for Lab 1 (later...) you'll be writing client/server software using Remote Procedure Call (RPC) idea: applications can just say x = Lock("x") RPC library turns that into messages w/ server RPC usually a library that client/server application code use client app stubs srvr handlers RPC --> RPC client stub e.g. Lock(l): put arguments in packet send to server wait for reply packet extract return value from reply, return server: wait for next client request extract arguments call handler put return value in reply packet send reply pkt to client server/net failure -> client gets no reply we give you lab 1 skeleton code -- let's have a look handout (l01.go) is an abridged version of lab skeleton diagram of lab 1 s/w: app Clerk primary backup RPC RPC RPC net --> net --> net (note multiple concurrent clients) lockmain.go the client has to know how to contact primary and backup "ports" -- network names client.go struct Clerk why an object: client side needs to keep state client may want more than one Clerk add state for backup to detect duplicate how servers are named Lock() application calls Lock; Lock returns yes/no arguments; space for reply call() sends, waits, fills in reply srv, rpcname call() knows how to send/recv lots of arg types if error, maybe never got reply!!! thus ok && reply.OK you must fix to try primary, then backup server.go LockServer why an object: holds server's state so you can have more than one in a process Go basically demands it Lock handler go RPC library calls Lock() when RPC arrives in a new thread mutex locks spring into life fix Lock to forward RPC to the backup, wait for reply StartServer register named handlers -- so Call() can find them create socket on which to listen for RPC requests create thread (why?) to: loop: connection, start Go library *thread*