6.828 2005 Lecture 18: Coordination introduction JOS and v6 are meant to run on single-CPU machines most high-end machines, and soon most desktops, will have multiple CPUs diagram: CPU1, CPU2, bus, memory what would have to change? easy plan: both CPUs can run user programs only one CPU can run in the kernel if CPU1 in kernel (syscall or interrupt), and user program on CPU2 makes a system call, CPU2 must wait until CPU1 returns to user space also may need to force re-schedule, TLB flush on other CPU inter-processor interrupts hard plan: let both CPUs run in kernel, to get higher performance many workloads are kernel-intensive JOS would need one kernel stack per CPU would JOS otherwise be correct? Look at Example 1, would it work correctly in multi-CPU JOS? what does "correct" mean? As if insert()s occured one at a time, in some order. e.g. two insert()s yield a list that has two more elements What could go wrong with two CPUs? two CPUs call insert() at about the same time, for same head both read the old head both set their nl->next to old head both write the head one insert()'s write goes second other insert() is effectively ignored so two insert()s could yield a list with only one more element How can we ensure only one CPU is in lines A/B? Borrow B_BUSY flag idea from v6? Does Example 2 work? No. There are still two operations that we want to be indivisible: MOV busy, %eax MOV $1, busy But we're in luck! x86 gives us the indivisible instruction we need XCHG %eax, (busy) "exchange" 1. freeze other CPUs' memory activity 2. temp := busy 3. busy := %eax 4. %eax := temp 5. un-freeze other CPUs steps 1 and 5 make XCHG special: it is "locked" special signal lines on the inter-CPU bus, bus arbitration We really want "test-and-set-locked" (TSL) rather than XCHG always set the lock variable tell us its old value if it was zero, then we hold the lock if it was one, it's still one, but someone else holds it Example 3 wraps TSL around XCHG wraps acquire/release around TSL Our release/acquire is ideal for short critical sections increment a counter, search in i-node cache, allocate a free buffer often called "spin locks" What are spin locks not so great for? long critical sections may waste waiters' CPU time bad to sleep while holding locks so locks can't replace v6 ILOCK and B_BUSY mechanisms though they can ensure that those flags are set correctly acquire doesn't suppress interrupts what if interrupt code itself tries to take a held lock? deadlock! could have spl+acquire could not take interrupts in kernel, like JOS other CPU may take interrupts OK: no deadlock even if interrupt needs a lock Lock granularity my list_lock protects all lists inserts to different lists are blocked lock per list would waste less time spinning so you might want "fine-grained" locks, one for every object BUT acquire/release are expensive (500 cycles on my 3 ghz machine?) because they need to talk off-chip AND "correctness" is not that simple what if invariant is: "every buffer must be on exactly one of free list and device list"? per-list locks are irrelevant! so you might want "large-grained" locks a lock around the whole file system? a lock around each block device and its buffers? tension: concurrency vs lock overhead this is hard to get right, Linux fiddles with this all the time Lock-free data structures XCHG made two critical instructions atomic could we use atomic instructions directly? w/o locks? then we would not need to spin and wouldn't need to worry about interrupt in insert() Can we implement insert() directly with XCHG? nl->next = nl; XCHG(&head, &nl->next); Too bad XCHG won't take two memory operands! but CMPXCHG will check that head hasn't changed before exchanging... CMPXCHG %old, %new, addr also LOCK ADD %eax, addr **** Mellor-Crummey / Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors TOCS Feb 1991 all about cost and scalability of locking what if you have 10 CPUs waiting for the same lock? we need to know more about how the memory system works mostly interested in how per-CPU caching works, since that's where we get speed two architectures: bus-based and directory we'll only look at bus-based bus-based shared memory with caching Sequent Symmetry CPUs, caches, bus, memory each CPU's cache controller "snoops" every bus transaction each cache line: address value S shared / clean X exclusive / dirty - not valid invariant: addr X in exactly one cache or no X, same value in all S caches and mem CPU can read from its own cache if S or X CPU can write to its own cache (no bus) if X otherwise, CPU must broadcast request on bus if writing, our cache X, other caches to - if reading, our cache S, any cached X to S we fetch from X cache or memory cached reads/writes 100x faster than bus what's the cost of our simple spinning acquire/release? (their Algorithm 1 *without* the delays) each of the 10 CPUs gets the lock in turn meanwhile, remaining CPUs in XCHG on lock lock must be X in cache to run XCHG otherwise all might read, then all might write so bus is busy all the time with XCHGs! can we avoid constant XCHGs while lock is held? test-and-test-and-set only run expensive TSL if not locked spin on ordinary load instruction, so cache line is S acquire(l) while(1){ while(l->locked != 0) { } if(TSL(&l->locked) == 0) return; } suppose 10 CPUs are waiting let's count cost in total bus transactions CPU1 gets lock in one cycle sets lock's cache line to - in other CPUs 9 CPUs each use bus once in XCHG then everyone has the line S, so they spin locally CPU1 release the lock CPU2 gets the lock in one cycle 8 CPUs each use bus once... So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs! Look at "test-and-test-and-set" in Figure 6 Can we have n CPUs acquire a lock in O(n) time? What is the point of the exponential backoff in Algorithm 1? Does it buy us O(n) time for n acquires? Is there anything wrong with it? may not be fair exponential backoff may increase delay after release What's the point of the ticket locks? Algorithm 2 one interlocked instruction to get my ticket number then I spin on now_serving with ordinary load release() just increments now_serving why is that good? fair no exponential backoff overshoot no spinning on XCHG but what's the cost, in bus transactions? while lock is held, now_serving is S in all caches release makes it - in all caches then each waiters uses a bus transaction to get new value so still O(n^2) What's the point of the array-based queuing locks? Algorithm 3 algorithm: a lock has an array of "slots" waiter allocates a slot, spins on that slot release wakes up just next slot so O(n) bus transactions to get through n waiters: good! anderson lines in Figure 4 and 6 are flat-ish they only go up because lock data structures protected by simpler lock but O(n) space *per lock*! What's the point of the MCS lock, Algorithm 5? constant space per lock, rather than O(n) one "qnode" per thread, used for whatever lock it's waiting for lock holder's qnode points to start of list lock variable points to end of list acquire adds your qnode to end of list then you spin on your own qnode release wakes up next qnode /* * Example 1: linked list insert. */ struct List { int data; struct List *next; }; insert(List **head, int data) { List *nl = new List; nl->data = data; nl->next = *head; /* line A */ *head = nl; /* line B */ } /* * Example 2: Insert with busy-waiting. */ int busy = 0; insert(List **head, int data) { List *nl = new List; nl->data = data; while(busy == 1) { } busy = 1; /* critical section: */ nl->next = *head; *head = nl; busy = 0; } /* * Example 3: acquire() and release() with test-and-set-locked. */ struct Lock { unsigned int locked; }; acquire(Lock *lck) { while(TSL(&(lck->locked)) != 0) ; } release(Lock *lck) { lck->locked = 0; } #ifdef i386 int TSL(int *addr) { register int content = 1; // xchgl content, *addr // xchgl exchanges the values of its two operands, while // locking the memory bus to exclude other operations. asm volatile ("xchgl %0,%1" : "=r" (content), "=m" (*addr) : "0" (content), "m" (*addr)); return(content); } #endif Lock list_lock; insert(List **head, int data) { List *nl = new List; nl->data = data; acquire(&list_lock); nl->next = *head; *head = nl; release(&list_lock); }