6.824 Lecture 1: Introduction and lab overview Introduction 6.824 is about building distributed systems big ideas, useful techniques, implementation, case studies lectures on ideas, papers for case studies you'll build a real system (yfs, inspired by Frangipani) why take the course? synthesize many different areas in order to build working systems Internet has made area much more attractive and timely hard/unsolved: not many deployed sophisticated distrib systems background: 6.033 ideas and readings much more depth: build a real distributed system What is a distributed system? multiple connected computers cooperate to provide some service Examples: Internet E-Mail, Athena file server, Google MapReduce What is the point of distributing a system? communication, among distant components Web reliability, from unreliable components Argus, BFT, Frangipani performance, sum of many components aggregate cycles+memory: (ThreadMarks, Dryad) aggregate bw (Coral, Shark) aggregate disks (Frangipani) isolation, to increase security and failure tolerance authentication server backup server Challenges system design how to split functions among computers? (e.g. clients, servers) who talks to whom? what info do they exchange? today's example: file servers performance how to divide work? to maximize parallelism, minmize interaction load balance avoid bottlenecks like network failures replication usually used to cope with failures how to tell which replicas are live? or is network down? which replica has the freshest data? consistency how to keep replicas identical? how to sort out many concurrent clients using shared data in server? security adversary may compromise machines or manipulate messages network properties clusters -- high b/w, low latency, high reliability wide-area -- low b/w, high latency, low reliability advice: solutions are complex and hard to get right easy to make distributed system slower, less reliable, than a centralized system. Lamport's definition: a distributed system is one where a computer you don't know about renders your own useless. use a central system if you can build a distributed system only if you're forced to 6.824 lecture topics: infrastructure for building servers and clients: threads+RPC distributed programming consistency fault tolerance peer-to-peer/decentralized security case studies [continuous thread: system design] Example: file system (like NFS or AFS) lots of client computers they all want to see the same file system, so they can cooperate Simple System Design One server w/ disk to store directories and files Clients send high-level ops, like create(filename), mkdir, &c [picture: server, clients, read file, write file, create, etc.] Topic: implementation Server needs to execute multiple operations concurrently From multiple clients To get good disk and multicore performance Solution: threads Server threads may modify shared state (e.g., file cache) concurrently Avoid race conditions! Threads often need to wait for each other Avoid deadlocks! Topic: consistency An operation may update multiple servers, e.g. mv from dir1 to dir2 Will clients see intermediate state? What if something fails partway through? Multiple copies of each piece of data Server replicas, caches in clients When does a modification appear in all these copies? Same behavior as single-machine file system? Topic: scalability What if the load is too high for one server? Idea: Multiple servers Split load among servers We hope that 2 servers will provide 2x performance I.e. that performance will scale w/ # of servers How to balance the load? By file ID? By user? What if one user suddenly generates a lot of load? What if some files are much more popular than others? Topic: fault tolerance Can I get to my files when some servers are down or network fails? Yes: replicate the data on multiple servers Problem: replica consistency (delete file, may still be visible in other replica) Problem: failure agreement, especially if network partitioned Opportunity: operate from two "replicas" independently if partitioned Opportunity: can 2 servers yield 2x availability AND 2x performance? Topic: security Threats: corrupt employees targeted external attack spammers collecting botnets How does the server know that a request is from me? How much do I have to trust the system administors? What if server code has bugs? (one per 1000 lines...) Topic: system design What if we want to provide an Internet-wide file system? aggregate servers at many sites into a unified file system More failures, longer delays, multiple administrative domains, security We want to understand the individual techniques, and how to assemble them COURSE STRUCTURE All announcements and info at http://pdos.csail.mit.edu/6.824 look at the first lab, due end of next week Meetings: 1/2 lectures, 1/2 paper discussions (or lab help) Research papers on working systems, starting today must read papers before class otherwise boring, and you can't pick it up by listening in future, we will post paper questions 24 hours in advance hand in answer on paper in class, one or two paragraphs Mid-term quiz in class, and final exam Labs: build a real cluster file system a la Frangipani Labs are due on Fridays Project: extend lab in any way you like. alone or teams of two one-page report demo in last class meeting Yandong Mao is TA, office hours on Web. LAB: YET ANOTHER FILE SYSTEM (YFS) Lab is inspired by Frangipani, a scalable distributed file system (see paper). Designed/built at a research lab, not a product. You will build a simplified version of Frangipani. Frangipani goals scalable storage -- add disk servers scalable file service -- add file servers consistency, much like single file server adaptive load balance across file servers tolerates and recover from server, network, and disk failures Frangipani design diagram: client workstations Frangipani servers Pegal servers lock servers Petal looks like a single huge disk interface: put and get (pretty low-level) replicates blocks/extents add petal servers to increase storage capacity and throughput Frangipani file server knows about file names, directories, inodes, &c uses Petal to store blocks all Frangipani servers serve same file system communicate only via Petal add Frangipani servers to handle more client workstations Lock servers Frangipani servers uses lock server to provide consistency e.g., when creating a file, lock directory locks servers are replicated No security beyond traditional file system security intended for a cluster, not wide-area This is a common architecture many clients many front ends to provide CPU to process client requests shared back-end that *only* provides storage front-ends independent, communicate indirectly via back-end storage only back-end needs to be fault-tolerant since no real state in front-end servers, no problem if they crash easier to provide fault-tolerance for storage than for general computation When might Frangipani *not* scale well? i.e. when might we not be able to get more performance by adding servers? if shared LAN is a bottleneck if all clients are modifying the same set of files if there's a single demanding user Frangipani scales well in common case of independent users each user's Frangipani caches that user's files Frangipani servers don't need to wait for each other for locks so each added server doesn't slow down other servers => scalable YFS Same basic structure as Frangipani, but single extent server i.e. you don't have to implement Petal. Diagram: many clients, one extent_server, multiple lock_servers Client diagram: app, kernel fuse, fuse.cc, yfs_client Each program written in C++ and pthreads, our own RPC library Next two lectures cover infrastructure in detail Labs: build YFS incrementally L1: simple lock server threads, locks, condition variables then at-most-once RPC L2: extent_server, yfs_client in-memory store for extent_server basic yfs_cilent: create, lookup, readdir, write, read L3: yfs_client + extent_server + lock_server mkdir, unlink, and sharing/locks L4: caching lock_server add revocation to lock server L5: caching extent_server consistency using lock_server L6: paxos library agreement protocol (who is the master lock server?) L7: fault tolerant lock server replicated state machine using paxos L8: your choice e.g. distributed extent server for performance or fault tolerance Lab 1: simple lock server what does a lock server do? handles acquire and release RPCs from clients supports multiple locks, each lock has a number acquire(num): if any client holds lock num, wait for release(num) make lock num as held then server sends "OK" reply to client release(num): mark lock num as not held wake up any waiting acquires we supply you with an RPC library and some demo client/server code you have to add locking RPCs -- acquire and release RPC library greatly simplifies client/server communication over net overall structure: client app client stubs rpcc (one for each server client talks to) RPC library ... network RPC library rpcs (just one) server handlers lock_demo.cc new lock_client -- create client stubs lc->stat(1) -- looks like an ordinary C++ method call lock_client.cc creates rpcc -- tells it what server to connect (bind) to each stub method: cl->call(...) part of RPC library sends procedure # to server, and arguments waits for reply you need to fill in acquire(), release() call cl->call() lock_smain.cc create rpcs registers lock_server's handler functions you will have to register two new handlers lock_server.cc handlers you will have to add two new handlers a table of locks (use C++ STL map class) for each one: held or not held