6.828 2011 Lecture 16: Scalable Locks Plan: unscalable spinlocks (xv6/JOS style, ticket) scalable locks (Anderson, MCS, ..) Why are we reading this paper? multi-processor systems are common; hard to buy a single-core laptop want to use multiple CPUs to improve performance synchronization between CPUs often expensive (performance-wise) and tricky scalability limitations of kernel-provided services passed onto all applications let's look at the x86 memory system to understand performance back in the locking lecture, we had a fairly simple model of multiple CPUs shared bus memory connected to bus to implement acquire, x86's xchg instruction locked the bus provided atomicity for xchg in reality things are a bit different bus, memory quite slow compared to CPU speed each CPU has a few levels of cache to compensate L1 hit: a few cycles RAM: 100s of cycles how to ensure caches aren't stale? CPU 1 reads x=10, CPU 2 writes x=11, CPU 2 reads x=? locks don't fix this problem, occurs even if not concurrent "cache coherence protocol" how does cache coherence work? many schemes, here's a simple one each cache line: state, address, 64 bytes of data states: Modified, Shared, Invalid [MSI] "snooping" idea: CPUs watch each others bus reads/writes CPU 1 has x cached, sees w x on bus -> invalidate x CPU 1 reads x on bus, CPU 2 has dirty in cache -> CPU 2 supplies data how do the CPUs coordinate with each other? I + local read -> S I + local write -> M S + local read -> S S + local write -> M S + remote write -> I M + remote write -> I (+ give him our content) M + remote read -> S (+ give him our content) can read w/o bus traffic if already S can write w/o bus traffic if already M "write-back" this is a "write-invalidate" scheme alternative: "distributed-write" compatibility of states between 2 CPUs: CPU1 M S I M - - + CPU2 S - + + I + + + Cache coherence without snooping Bus is a bottleneck Use a network to connect processors Who knows about the state of a cache line? Each block of has an owner; the owner keep information See 5 and 6 of table Directory-based cache coherence Paper builds locks out of atomic operations: CompareAndExchange(addr, val1, val2) TestAndSet(addr) ReadAndIncrement(addr) how does h/w make these operations atomic? on bus: lock bus on mesh: get the line in exclusive mode Keep it in exclusive mode for the duration of the instruction what is performance of locks? assume N CPUs want the lock at the same time how long until each CPU gets the lock once? measure time in protocol messages since interconnect is slow (relatively) test&set spinlock (xv6/jos) constant bus traffic while spinning we don't care if waiting CPUs waste their own time we do care if waiting CPUs slow lock holder! single holding time = O(N) waiters time for all N to get lock: O(N^2) how to use less bus traffic? test-and-set with exponential backoff (t_s_exp_acquire): goal: avoid everyone jumping in at once space out attempts to acquire lock simultaneous attempts were reason for N^2 w/ t-s if total rate of tries is low, only one CPU will attempt per release why not constant delay? each CPU re-tries after random delay with constant average hard to choose delay time too large: waste too small: all N CPUs probe mult times/crit, so N^2 why exponential backoff? i.e. why start with small delay, double it? try to get lucky at first (maybe only a few CPUs attempting) doubling means takes only a few attempts until delay >= N * crit section i.e. just one attempt per release illustration: 8 CPUs, critical section lasts one time unit watch as density of attempts decreases goal is just one between each pair of releases can we analyze # of probes? not that easy suppose takes time O(N) for all CPUs to succeed how many probes does each CPU make in time N? logN so total probes: N*logN traffic cc: N*logN problem: unlikely to be fair! some CPUs will have much lower delays than others will win, and come back, and win again some CPUs with have huge delays, will sit idle doing no harm, but doing no work ticket locks (ticket_acquire): (Linux uses ticket locks) goal: fairer than exp backoff, cheaper than t-t-s idea: assign numbers, wake up one at a time avoid repeated t-s after every release why is it fair? time analysis: what happens in acquire? N ReadAndIncrement though spread out in steady state each is just one bus xaction (no re-loads after invalidate) what happens after release? invalidate N re-loads so N^2 traffic cc: N^2 problem: N^2 still blows up rapidly paper: how rapidly can it blow up? See Figure 2; Oops. Why so dramatic and sudden? markov chain time to get a lock increases with N probability of contention goes up quickly move to the left end Figure 9 long serial sections don't blow up as bad but short ones do! counter intuitive? anderson: goal: avoid N^2 and be fair what if each core spins on a *different* cache line? acquire cost? N ReadAndIncrements, then local spin on per-CPU cache line release cost? invalidate someone else's slots[] element only they have to re-load no other CPUs involved so O(N) problem: high space cost often much more than size of protected object MCS one "qnode" per thread, used for whatever lock it's waiting for lock holder's qnode points to start of list lock variable points to end of list acquire() adds your qnode to end of list then you spin on your own qnode release() wakes up next qnode space O(1) change in API (need to pass qnode to acquire and release to qnode allocation) results on 48-CPU modern c.c. AMD, slowest to fastest: see figure 10 and figure 11 notice they are all super fast for one cpu atomic instructions generate no interconnect traffic if cache line M state Concl. use scalable locks even better: fix the underlying problem!