6.828 2010 Lecture 17: Scalable Locks Thomas Anderson, The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors, IEEE TOPDS, Jan 1990. why are we reading this paper? 20 years old; still useful? multi-processor systems are common; hard to buy a single-core laptop want to use multiple CPUs to improve performance synchronization between CPUs often expensive (performance-wise) and tricky scalability limitations of kernel-provided services passed onto all applications let's look at the x86 memory system to understand performance back in the locking lecture, we had a fairly simple model of multiple CPUs shared bus memory connected to bus to implement acquire, x86's xchg instruction locked the bus provided atomicity for xchg in reality things are a bit different bus, memory quite slow compared to CPU speed each CPU has a few levels of cache to compensate L1 hit: a few cycles RAM: 100s of cycles how to ensure caches aren't stale? CPU 1 reads x=10, CPU 2 writes x=11, CPU 2 reads x=? locks don't fix this problem, occurs even if not concurrent "cache coherence protocol" how does cache coherence work? many schemes, here's a simple one each cache line: state, address, 64 bytes of data states: Modified, Shared, Invalid [MSI] "snooping" idea: CPUs watch each others bus reads/writes CPU 1 has x cached, sees w x on bus -> invalidate x CPU 1 reads x on bus, CPU 2 has dirty in cache -> CPU 2 supplies data how do the CPUs coordinate with each other? I + local read -> S I + local write -> M S + local read -> S S + local write -> M S + remote write -> I M + remote write -> I (+ give him our content) M + remote read -> S (+ give him our content) can read w/o bus traffic if already S can write w/o bus traffic if already M "write-back" this is a "write-invalidate" scheme alternative: "distributed-write" compatibility of states between 2 CPUs: CPU1 M S I M - - + CPU2 S - + + I + + + modern cache coherence is more complex than this! paper builds locks out of atomic operations: TestAndSet(addr) ReadAndIncrement(addr) how does h/w make these operations atomic on single bus? reserve bus read mem (or snoop from a cache) [add] write mem (+invalidate caches) release bus so how to implement locks using atomic operations? how well do they perform? what is performance? assume N CPUs want the lock at the same time how long until each CPU gets the lock once? measure time in bus operations since bus is slow and a shared bottleneck test-and-set (t_s_acquire): constant bus traffic while spinning we don't care if waiting CPUs waste their own time we do care if waiting CPUs slow lock holder! if bus fair, single holding time = O(N) waiters since holder only gets 1/Nth of bus capacity time for all N to get lock: O(N^2) why is this bad? how to have less bus traffic from waiters' TAS instructions? test-and-set (t_t_s_acquire): idea: spin locally in cache using ordinary read, not TestAndSet to avoid hogging bus and slowing the holder while holder holds lock: waiters spin with cache line in S state on release: waiters invalidate, re-read, one runs TAS and wins should be much better than t+s what happens during release? invalidate everyone sees "unlocked" everyone is going to run t+s first core to run t+s wins 2nd core runs t+s, loses, goes to top 3rd core t+s, invalidates, goes to top 2nd core re-loads 4th core t+s, invalidates, goes to top 2nd and 3rd re-load Nth core t+s, invalidates, goes to top N-2 cores re-load there were about N^2 re-loads for a single release! so maybe N*N*N for all N CPUs to acquire best case: winner runs t+s before any core sees unlock so N-1 cores re-load and see locked=1 so N^2 in total problem: N^3 is bad! test-and-set-and-set with exponential backoff (t_s_exp_acquire): goal: avoid everyone jumping in at once space out attempts to acquire lock simultaneous attempts were reason for N^3 or N^2 w/ t-t-s if total rate of tries is low, only one CPU will attempt per release why not constant delay? each CPU re-tries after random delay with constant average hard to choose delay time too large: waste too small: all N CPUs probe mult times/crit, so N^2 why exponential backoff? i.e. why start with small delay, double it? try to get lucky at first (maybe only a few CPUs attempting) doubling means takes only a few attempts until delay >= N * crit section i.e. just one attempt per release illustration: 8 CPUs, critical section lasts one time unit watch as density of attempts decreases goal is just one between each pair of releases can we analyze # of probes? not that easy suppose takes time O(N) for all CPUs to succeed how many probes does each CPU make in time N? logN so total probes: N*logN bus cc: N*logN problem: unlikely to be fair! some CPUs will have much lower delays than others will win, and come back, and win again some CPUs with have huge delays, will sit idle doing no harm, but doing no work ticket locks (ticket_acquire): (Linux uses ticket locks) goal: fairer than exp backoff, cheaper than t-t-s idea: assign numbers, wake up one at a time avoid repeated t-s after every release why is it fair? time analysis: what happens in acquire? N ReadAndIncrement though spread out in steady state each is just one bus xaction (no re-loads after invalidate) what happens after release? invalidate N re-loads no atomic instructions, so no repeated re-loads as in t-t-s so N^2 bus cc: N^2 (but NOT N^3) problem: N^2 still blows up rapidly anderson: can we do better than N^2? can we get rid of the N re-loads after release? what if each core spins on a *different* cache line? acquire cost? N ReadAndIncrements, then local spin on per-CPU cache line release cost? invalidate someone else's slots[] element only they have to re-load no other CPUs involved so O(N) problem: high space cost often much more than size of protected object what results does the paper show? are they what we expect? Figure 1: time to do 1,000,000 acquires why does it go down? why does it then go up? we're happy that spin-on-read has lower cost per-acquire time is linear with N so total time is O(N^2), as we predicted note takes 10 to 20 microseconds to acquire/release Figure 3: not much to say: anderson wins he's expensive with small N b/c faking ReadAndIncrement my results on 48-CPU modern c.c. AMD, slowest to fastest: ticket (N^2 w/ high slope: fetch+add expensive???, same n=2 as anderson) t-t-s (only looks good b/c grossly unfair) t-s (drifts up, unfair but not as bad as t-t-s) anderson (flat, start higher than t-s b/c fetch+add expensive??) t-s-exp (flat. wins b/c unfair? some shut out, so N lower) notice they are all super fast for one cpu atomic instructions generate no interconnect traffic if cache line M state if you think you won't have contention use t-s but if you later get contention anderson? maybe not -- space costs? ticket? slow but fair, must be desirable since Linux uses them best: fix the underlying problem!