The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors

Size: px

Start display at page:

Download "The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors"

Dayna McDonald
6 years ago
Views:

1 The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T. Clements Thesis advisors: M. Frans Kaashoek Nickolai Zeldovich Robert Morris Eddie Kohler

2 x86 CPU trends

3 x86 CPU trends 2005

4 x86 CPU trends 100,000 10,000 Clock speed (MHz) 1, Sources: Stanford CPUDB, Intel ARK

x86 CPU trends 100,000 10,000 Clock speed (MHz) Power (watts) 1,000 100

5 x86 CPU trends 100,000 10,000 Clock speed (MHz) Power (watts) 1, Sources: Stanford CPUDB, Intel ARK

6 x86 CPU trends 100,000 10,000 Clock speed (MHz) Power (watts) 1, Sources: Stanford CPUDB, Intel ARK

7 x86 CPU trends 100,000 10,000 Clock speed (MHz) Power (watts) Cores per socket 1, Sources: Stanford CPUDB, Intel ARK

8 x86 CPU trends 100,000 10,000 Clock speed (MHz) Power (watts) Cores per socket Total megacycles/sec 1, Sources: Stanford CPUDB, Intel ARK

9 Parallelize or perish Software must be increasingly parallel to keep up with hardware, but scaling with parallelism is notoriously hard

10 Parallelize or perish Software must be increasingly parallel to keep up with hardware, but scaling with parallelism is notoriously hard 10k Exim mail server Messages/second 8k 6k 4k 2k Cores

11 Parallelize or perish Software must be increasingly parallel to keep up with hardware, but scaling with parallelism is notoriously hard 10k Exim mail server Messages/second 8k 6k 4k 2k Cores Problem lies in the OS kernel

12 OS kernel scalability Kernel scalability is important Many applications depend on the OS kernel If the kernel doesn't scale, many applications won't scale And hard kernel threads > application threads Diverse and unknown workloads

13 Current approach to scalable software development Corey OSDI ' Linux scalability OSDI ' Bonsai VM ASPLOS ' RadixVM EuroSys '

14 Current approach to scalable software development Corey OSDI ' Linux scalability OSDI ' Workload 2012 Bonsai VM ASPLOS ' RadixVM EuroSys '

15 Current approach to scalable software development Corey OSDI ' Linux scalability OSDI '10 Plot scalability Workload 2012 Bonsai VM ASPLOS ' RadixVM EuroSys '

16 Current approach to scalable software development Corey OSDI ' Linux scalability OSDI '10 Plot scalability Workload Differential profile x() 2012 Bonsai VM ASPLOS ' RadixVM EuroSys '

17 Current approach to scalable software development Corey OSDI ' Linux scalability OSDI '10 Plot scalability Workload Differential profile x() 2012 Bonsai VM ASPLOS '12 Fix top bottleneck RadixVM EuroSys '

18 Current approach to scalable software development Corey OSDI ' Linux scalability OSDI '10 Plot scalability Workload Differential profile x() 2012 Bonsai VM ASPLOS '12 Fix top bottleneck RadixVM EuroSys '

19 Current approach to scalable software development Successful in practice because it focuses developer effort Disadvantages Requires huge amounts of effort New workloads expose new bottlenecks More cores expose new bottlenecks The real bottlenecks may be in the interface design

20 Current approach to scalable software development Successful in practice because it focuses developer effort Disadvantages Requires huge amounts of effort New workloads expose new bottlenecks More cores expose new bottlenecks The real bottlenecks may be in the interface design

21 Interface scalability example creat("x") creat("y") creat("z")

22 Interface scalability example creat("x") creat("y") creat("z") stdin stdout stderr

23 Interface scalability example creat("x") creat("y") creat("z") stdin stdout stderr Solution: Change the interface?

24 Interface scalability example creat("x") creat("y") creat("z") stdin stdout stderr Solution: Change the interface?

25 Approach: Interface-driven scalability The scalable commutativity rule Whenever interface operations commute, they can be implemented in a way that scales.

26 Approach: Interface-driven scalability The scalable commutativity rule Whenever interface operations commute, they can be implemented in a way that scales. creat with lowest FD Commutes? Scalable implementation exists

27 Approach: Interface-driven scalability The scalable commutativity rule Whenever interface operations commute, they can be implemented in a way that scales. creat with lowest FD Commutes? creat 3 creat 4 Scalable implementation exists

28 Approach: Interface-driven scalability The scalable commutativity rule Whenever interface operations commute, they can be implemented in a way that scales. creat with lowest FD Commutes Scalable implementation exists

29 Approach: Interface-driven scalability The scalable commutativity rule Whenever interface operations commute, they can be implemented in a way that scales. creat with lowest FD creat with any FD Commutes? creat 42 creat 17 Scalable implementation exists

30 Approach: Interface-driven scalability The scalable commutativity rule Whenever interface operations commute, they can be implemented in a way that scales. creat with lowest FD Commutes Scalable implementation exists rule creat with any FD

31 Advantages of interface-driven scalability The rule enables reasoning about scalability throughout the software design process Design Guides design of scalable interfaces Implement Sets a clear implementation target Test Systematic, workload-independent scalability testing

32 Contributions The scalable commutativity rule Formalization of the rule and proof of its correctness State-dependent, interface-based commutativity Commuter: An automated scalability testing tool sv6: A scalable POSIX-like kernel

33 Outline Defining the rule Definition of scalability Intuition Formalization Applying the rule Commuter Evaluation

34 A scalability bottleneck Normalized throughput gmake Exim Cores

35 A scalability bottleneck Normalized throughput gmake Exim Cores One contended cache line A single contended cache line can wreck scalability

36 Cost of a contended cache line 3.5k 3k Cycles to read 2.5k 2k 1.5k 1k writer + N readers

37 Cost of a contended cache line 3.5k 3k Cycles to read 2.5k 2k 1.5k 1k 500 open writer + N readers

38 What scales on today's multicores? Core Y W R - Core X W R - -

39 What scales on today's multicores? Core Y W R - Core X W R - -

40 What scales on today's multicores? Core Y W R - Core X W R - -

41 What scales on today's multicores? Core Y W R - Core X W R - - We say two or more operations are scalable if they are conflict-free. Good approximation of current hardware.

42 The intuition behind the rule Whenever interface operations commute, they can be implemented in a way that scales. Operations commute results independent of order communication is unnecessary without communication, no conflicts

43 Example: Reference counter T1 T2 T3 T4 T5 iszero() F iszero() F dec() 2 dec() 1 dec() 0

44 Example: Reference counter T1 T2 T3 T4 T5 iszero() F R1 iszero() F dec() 2 dec() 1 dec() 0

45 Example: Reference counter T1 T2 T3 T4 T5 iszero() F R1 iszero() F dec() 2 dec() 1 dec() 0 R1 commutes; conflict-free implementation: shared counter

46 Example: Reference counter T1 T2 T3 T4 T5 iszero() F R1 iszero() F dec() 2 dec() 1 R2 dec() 0 R1 commutes; conflict-free implementation: shared counter

47 Example: Reference counter T1 T2 T3 T4 T5 iszero() F R1 iszero() F dec() 2 dec() 1 R2 dec() 0 R1 commutes; conflict-free implementation: shared counter R2 does not commute because dec() returns counter value

48 Example: Reference counter T1 T2 T3 T4 T5 iszero() F R1 iszero() F dec() ok 2 dec() 1ok R2' dec() 0ok R1 commutes; conflict-free implementation: shared counter R2 does not commute because dec() returns counter value

49 Example: Reference counter T1 T2 T3 T4 T5 iszero() F R1 iszero() F dec() ok 2 dec() 1ok R2' dec() 0ok R1 commutes; conflict-free implementation: shared counter R2 does not commute because dec() returns counter value R2' does commute; conflict-free implementation: per-core counter

50 Example: Reference counter T1 iszero() F T2 iszero() F T3 dec() ok 2 T4 dec() 1ok T5 dec() 0ok R1 R2' R3 R1 commutes; conflict-free implementation: shared counter R2 does not commute because dec() returns counter value R2' does commute; conflict-free implementation: per-core counter

51 Example: Reference counter T1 iszero() F T2 iszero() F T3 dec() ok 2 T4 dec() 1ok T5 dec() 0ok R1 R2' R3 R1 commutes; conflict-free implementation: shared counter R2 does not commute because dec() returns counter value R2' does commute; conflict-free implementation: per-core counter R3 depends on state Initial value > 3 Initial value 3

52 Example: Reference counter T1 iszero() F T2 iszero() F T3 dec() ok 2 T4 dec() 1ok T5 dec() 0ok R1 R2' R3 R1 commutes; conflict-free implementation: shared counter R2 does not commute because dec() returns counter value R2' does commute; conflict-free implementation: per-core counter R3 depends on state Initial value > 3 Initial value 3

53 Formalizing the rule Definitions History Reordering Commutativity Formal scalable commutativty rule

54 Histories capture state and arguments A history H is a sequence of invocations and responses on threads. T1 T2 inc() iszero() T ok T1 inc() ok iszero() T

55 Histories capture state and arguments A history H is a sequence of invocations and responses on threads. T1 T2 inc() iszero() T ok T1 inc() ok iszero() T Legal history Illegal history A specification S defines an interface. S is the set of legal histories giving the allowed behavior of an interface. [Herlihy & Wing, '90]

56 Histories capture state and arguments A history H is a sequence of invocations and responses on threads. T1 T2 inc() iszero() T ok T1 inc() ok iszero() T Legal history Illegal history A specification S defines an interface. S is the set of legal histories giving the allowed behavior of an interface. [Herlihy & Wing, '90] Lets us talk about interfaces, arguments, and state without specifying an implementation or a state representation.

57 Reorderings A reordering H' is a permutation of H that maintains operation order for each individual thread (H t = H' t for all t).

58 Reorderings A reordering H' is a permutation of H that maintains operation order for each individual thread (H t = H' t for all t). T1 T2 inc() iszero() T ok T1 T2 iszero() inc() T ok T1 T2 iszero() inc() ok T T1 inc() ok T2 iszero() T

59 Reorderings A reordering H' is a permutation of H that maintains operation order for each individual thread (H t = H' t for all t). T1 T2 inc() iszero() T ok T1 T2 iszero() inc() T ok T1 T2 iszero() inc() ok T T1 inc() ok T2 iszero() T

60 Commutativity A region Y of a legal history XY SIM-commutes if every reordering Y' of Y also yields a legal history and every legal extension Z of XY is also a legal extension of XY'. (And this must be true for every prefix of every reordering of Y.)

61 Commutativity T1 T2 I 3 () R 3 I 4 () R 4 A region Y of a legal history XY SIM-commutes if every reordering Y' of Y also yields a legal history and every legal extension Z of XY is also a legal extension of XY'. (And this must be true for every prefix of every reordering of Y.) Y

62 Commutativity T1 T2 I 1 () I 2 () R 2 R 1 I 3 () R 3 I 4 () R 4 X Y A region Y of a legal history XY SIM-commutes if every reordering Y' of Y also yields a legal history and every legal extension Z of XY is also a legal extension of XY'. (And this must be true for every prefix of every reordering of Y.)

63 Commutativity T1 T2 I 1 () I 2 () R 2 R 1 I 3 () R 3 I 4 () R 4 X Y A region Y of a legal history XY SIM-commutes if every reordering Y' of Y also yields a legal history and every legal extension Z of XY is also a legal extension of XY'. (And this must be true for every prefix of every reordering of Y.)

64 Commutativity T1 T2 I 1 () I 2 () R 2 R 1 I 3 () R 3 I 4 () R 4 I 5 () R 5 I 6 () R 6 X Y Z A region Y of a legal history XY SIM-commutes if every reordering Y' of Y also yields a legal history and every legal extension Z of XY is also a legal extension of XY'. (And this must be true for every prefix of every reordering of Y.)

65 The formal scalable commutativity rule Let S be a specification with a reference implementation M. Consider a history XY where Y commutes in XY and M can generate XY. There exists a correct implementation of S whose execution of XY is conflict-free in the commutative region Y.

66 The formal scalable commutativity rule Let S be a specification with a reference implementation M. Consider a history XY where Y commutes in XY and M can generate XY. There exists a correct implementation of S whose execution of XY is conflict-free in the commutative region Y. Emulate M M' X Emulate M Y

67 The formal scalable commutativity rule Let S be a specification with a reference implementation M. Consider a history XY where Y commutes in XY and M can generate XY. There exists a correct implementation of S whose execution of XY is conflict-free in the commutative region Y. Emulate M M' X Emulate M Y

68 The formal scalable commutativity rule Let S be a specification with a reference implementation M. Consider a history XY where Y commutes in XY and M can generate XY. There exists a correct implementation of S whose execution of XY is conflict-free in the commutative region Y. Emulate M M' X Emulate M Y

69 Applying the rule to real systems

70 Applying the rule to real systems Commuter

71 Applying the rule to real systems Interface specification (e.g., POSIX) Implementation (e.g., Linux) Commuter All scalability bottlenecks

72 Input: Symbolic model SymInode = tstruct(data = tlist(symbyte), nlink = SymInt) SymIMap = tdict(symint, SymInode) SymFilename = tuninterpreted('filename') SymDir = tdict(symfilename, SymInt) Symbolic model class POSIX: def init (self): self.fname_to_inum = SymDir.any() self.inodes = dst=symfilename) def rename(self, src, dst): if src not in self.fname_to_inum: return (-1, errno.enoent) if src == dst: return 0 if dst in self.fname_to_inum: self.inodes[self.fname_to_inum[dst]].nlink -= 1 self.fname_to_inum[dst] = self.fname_to_inum[src] del self.fname_to_inum[src] return 0

73 def init (self): self.fname_to_inum = SymDir.any() self.inodes = SymIMap.any() Commutativity dst=symfilename) def rename(self, src, dst): if src not in self.fname_to_inum: return (-1, errno.enoent) if src == dst: return 0 if dst in self.fname_to_inum: self.inodes[self.fname_to_inum[dst]].nlink -= 1 self.fname_to_inum[dst] = self.fname_to_inum[src] del self.fname_to_inum[src] return 0 Symbolic model Analyzer Commutativity conditions rename(a, b) and rename(c, d) commute if: Both source files exist and all names are different Neither source file exists Important a xor c exists, to have and discriminating it not the other commutativity rename's destination conditions states, Both calls rename are self-renames almost never commutes More One call commutative is a self-rename cases of an more existing opportunities file and a to cscale Captures a and c are more hard operations links to the applications same inode, actually a c, and do b = d

74 if src not in self.fname_to_inum: return (-1, errno.enoent) if src == dst: return 0 if dst in self.fname_to_inum: self.inodes[self.fname_to_inum[dst]].nlink -= 1 self.fname_to_inum[dst] = self.fname_to_inum[src] del self.fname_to_inum[src] return 0 Commutativity conditions Symbolic model Analyzer rename(a, b) and rename(c, d) commute if: Both source files exist and all names are different Neither source file exists a xor c exists, and it is not the other rename's destination Both calls are self-renames One call is a self-rename of an existing file and a c a and c are hard links to the same inode, a c, and b = d Commutativity conditions Important to have discriminating commutativity conditions states, rename almost never commutes More commutative cases more opportunities to scale Captures more operations applications actually do

75 if src not in self.fname_to_inum: return (-1, errno.enoent) if src == dst: return 0 if dst in self.fname_to_inum: self.inodes[self.fname_to_inum[dst]].nlink -= 1 self.fname_to_inum[dst] = self.fname_to_inum[src] del self.fname_to_inum[src] return 0 Commutativity conditions Symbolic model Analyzer rename(a, b) and rename(c, d) commute if: Both source files exist and all names are different Neither source file exists a xor c exists, and it is not the other rename's destination Both calls are self-renames One call is a self-rename of an existing file and a c a and c are hard links to the same inode, a c, and b = d Commutativity conditions Important to have discriminating commutativity conditions states, rename almost never commutes More commutative cases more opportunities to scale Captures more operations applications actually do

76 if src not in self.fname_to_inum: return (-1, errno.enoent) if src == dst: return 0 if dst in self.fname_to_inum: self.inodes[self.fname_to_inum[dst]].nlink -= 1 self.fname_to_inum[dst] = self.fname_to_inum[src] del self.fname_to_inum[src] return 0 Commutativity conditions Symbolic model Analyzer rename(a, b) and rename(c, d) commute if: Both source files exist and all names are different Neither source file exists a xor c exists, and it is not the other rename's destination Both calls are self-renames One call is a self-rename of an existing file and a c a and c are hard links to the same inode, a c, and b = d Commutativity conditions Important to have discriminating commutativity conditions states, rename almost never commutes More commutative cases more opportunities to scale Captures more operations applications actually do

77 def init (self): self.fname_to_inum = SymDir.any() self.inodes = SymIMap.any() Commutativity dst=symfilename) def rename(self, src, dst): if src not in self.fname_to_inum: return (-1, errno.enoent) if src == dst: return 0 if dst in self.fname_to_inum: self.inodes[self.fname_to_inum[dst]].nlink -= 1 self.fname_to_inum[dst] = self.fname_to_inum[src] del self.fname_to_inum[src] return 0 Symbolic model Analyzer Commutativity conditions rename(a, b) and rename(c, d) commute if: Both source files exist and all names are different Neither source file exists Important a xor c exists, to have and discriminating it not the other commutativity rename's destination conditions states, Both calls rename are self-renames almost never commutes More One call commutative is a self-rename cases of an more existing opportunities file and a to cscale Captures a and c are more hard operations links to the applications same inode, actually a c, and do b = d

78 del self.fname_to_inum[src] return 0 Test cases rename(a, b) and rename(c, d) commute if: Both source files exist and all names are different Neither source file exists a xor c exists, and it is not the other rename's destination Both calls are self-renames One call is a self-rename of an existing file and a c a and c are hard links to the same inode, a c, and b = d Symbolic model Analyzer Commutativity conditions void setup() { close(creat("f0", 0666)); close(creat("f2", 0666)); } void test_opa() { rename("f0", "f1"); } void test_opb() { rename("f2", "f3"); } + 26 more Testgen Test cases

79 One call is a self-rename of an existing file and a c a and c are hard links to the same inode, a c, and b == d Output: Conflicting cache lines void setup() { close(creat("f0", 0666)); close(creat("f2", 0666)); } void test_opa() { rename("f0", "f1"); } void test_opb() { rename("f2", "f3"); } Symbolic model Analyzer Commutativity conditions test_opa test_opb Testgen Test cases d_entry.d_lock inode_cache +17 more conflicts Linux Mtrace/QEMU Conflicting cache lines

80 Evaluation Does the rule help build scalable systems?

81 Commuter finds non-scalable cases in Linux open link unlink rename stat fstat lseek close pipe read write pread pwrite mmap munmap mprotect memread memwrite memwrite memread mprotect munmap mmap pwrite pread write read pipe close lseek fstat stat rename unlink link open (Linux 3.8, ramfs) All tests conflict-free All tests conflicted 13,664 total test cases 68% are conflict-free

82 Commuter finds non-scalable cases in Linux open link unlink rename stat fstat lseek close pipe read write pread pwrite mmap munmap mprotect memread memwrite memwrite memread mprotect munmap mmap pwrite pread write read pipe close lseek fstat stat rename unlink link open 13,664 total test cases 68% are conflict-free (Linux 3.8, ramfs) All tests conflict-free Directory-wide locking File descriptor reference counts Address space-wide locking All tests conflicted

83 Commuter finds non-scalable cases in Linux open link unlink rename stat fstat lseek close pipe read write pread pwrite mmap munmap mprotect memread memwrite memwrite memread mprotect munmap mmap pwrite pread write read pipe close lseek fstat stat rename unlink link open (Linux 3.8, ramfs) All tests conflict-free All tests conflicted 13,664 total test cases 68% are conflict-free Many potential future bottlenecks

84 sv6: A scalable OS POSIX-like operating system File system and virtual memory system follow commutativity rule Implementation using standard parallel programming techniques, but guided by Commuter

85 Commutative operations can be made to scale open link unlink rename stat fstat lseek close pipe read write pread pwrite mmap munmap mprotect memread memwrite memwrite memread mprotect munmap mmap pwrite pread write read pipe close lseek fstat stat rename unlink link open Zero cache lines shared All tests conflict-free All tests conflicted 13,664 total test cases 99% are conflict-free Remaining 1% are mostly "idempotent updates"

86 Commutative operations can be made to scale open link unlink rename stat fstat lseek close pipe read write pread pwrite mmap munmap mprotect memread memwrite memwrite memread mprotect munmap mmap pwrite pread write read pipe close lseek fstat stat rename unlink link open Zero cache lines shared All tests conflict-free Two lseeks of same FD to the same offset Two pwrites of same data to same offset All tests conflicted 13,664 total test cases 99% are conflict-free Remaining 1% are mostly "idempotent updates"

87 Refining POSIX with the rule Lowest FD versus any FD stat versus xstat Unordered sockets Delayed munmap fork+exec versus posix_spawn

88 Commutative operations matter to app scalabiliy 70k qmail-like multithreaded mail server 60k Total s/sec 50k 40k 30k 20k 10k # cores Non-commutative APIs: Lowest FD Ordered sockets fork+exec

Commutative operations matter to app scalabiliy Total emails/sec 70k 60k 50k 40k 30k 20k qmail-like multithreaded mail server Commutative

89 Commutative operations matter to app scalabiliy Total s/sec 70k 60k 50k 40k 30k 20k qmail-like multithreaded mail server Commutative APIs: Any FD Unordered sockets posix_spawn 10k # cores Non-commutative APIs: Lowest FD Ordered sockets fork+exec

90 Related work Commutativity and concurrency [Bernstein '81] [Weihl '88] [Steele '90] [Rinard '97] [Shapiro '11] Laws of Order [Attiya '11] Disjoint-access parallelism [Israeli '94] Scalable locks [MCS '91] Scalable reference counting [Ellen '07, Corbet '10]

91 Conclusion Whenever interface operations commute, they can be implemented in a way that scales. Design Implement Test

92 Conclusion Whenever interface operations commute, they can be implemented in a way that scales. Design Implement Test Check out the code at

The original version of this paper was published in the Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP 13).

The original version of this paper was published in the Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP 13). DOI:10.1145/3068914 The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors By Austin T. Clements, M. Frans Kaashoek, Eddie Kohler, Robert T. Morris, and Nickolai Zeldovich