Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization

Size: px

Start display at page:

Download "Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization"

Mark Peters
5 years ago
Views:

1 Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization Aritra Sengupta, Man Cao, Michael D. Bond and Milind Kulkarni 1 PPPJ 2015, Melbourne, Florida, USA

2 Programming Language Semantics? Data Races Java provides weak semantics 2

3 Weak Semantics T1 A a = null; boolean init = false; T2 a = new A(); init = true; if (init) a.field++; 3

4 Weak Semantics T1 a = new A(); init = true; No data dependence T2 if (init) a.field++; 4

5 Weak Semantics T1 A a = null; boolean init = false; T2 a = new A(); init = true; if (init) a.field++; 5

6 Weak Semantics T1 T2 init = true; if (init) a.field++; a = new A(); 6

7 Weak Semantics T1 T2 Null dereference init = true; if (init) a.field++; a = new A(); 7

8 Java Memory Model JMM (Manson et al., POPL, 2005) variant of DRF0 (Adve and Hill, ISCA, 1990) Atomicity of synchronization-free regions for datarace-free programs Data races: weak semantics 8

9 Need for Stronger Memory Models The inability to define reasonable semantics for programs with data races is not just a theoretical shortcoming, but a fundamental hole in the foundation of our languages and systems Give better semantics to programs with data races Stronger memory models Adve and Boehm, CACM,

10 Run-time cost Memory Models: Run-time cost vs Strength Synchronization-free region serializability 1 SC Statically Bounded Region Serializability DRF0 Strength 1. Ouyang et al.... and region serializability for all. In HotPar,

11 Run-time cost Memory Models: Run-time cost vs Strength Synchronization-free region serializability SC Statically Bounded Region Serializability 2 DRF0 Strength Sengupta et al. Hybrid Static-Dynamic Analysis for Statically Bounded Region Serializability. ASPLOS, 2015.

12 Statically Bounded Region Serializability (SBRS) Compiler demarcated regions execute atomically Execution is an interleaving of these regions Sengupta et al., ASPLOS,

13 Statically Bounded Region Serializability (SBRS) acq(lock) Synchronization operations Method calls Loop backedges methodcall() rel(lock) 13

14 Statically Bounded Region Serializability (SBRS) acq(lock) methodcall() rel(lock) 14

15 Statically Bounded Region Serializability (SBRS) Statically and dynamically bounded Loop backedges 15

16 Overview Enforcement of SBRS with dynamic locks Precise dynamic locks: EnfoRSer-D (our prior work), low contention Enforcement of SBRS with static locks Imprecise static locks: EnfoRSer-S, low instrumentation overhead Hybridization of locks EnfoRSer-H: static and dynamic locks, right locks for right sites? Results EnfoRSer-H does at least as well as either. Some cases significant benefit 16

17 Enforcement of SBRS Prevent two concurrent accesses to the same memory location where one is a write 17

18 Enforcement of SBRS Prevent two Prevent concurrent regions accesses that to have the same memory location races where from one running is a write concurrently! 18

19 Enforcement of SBRS Acquire locks before each memory access Acquire locks at the start of the region 19

20 Enforcement of SBRS Acquire locks before each memory access Precise object locks: dynamic locks Acquire locks at the start of the region 20

21 Enforcement of SBRS Acquire locks before each memory access Acquire locks at the start of the region Region level locks: statically chosen for an access site 21

22 EnfoRSer-D Precise dynamic locks Per-access locks with retry mechanism Compiler Transformations: Speculative execution 22

23 SBRS with Dynamic Locks Y = = X Dynamic per-access locks Z = 23

24 SBRS with Dynamic Locks Y = Program access = X Z = 24

25 SBRS with Dynamic Locks Y = = X Ownership transferred Z = 25

26 SBRS with Dynamic Locks Y = Dynamic locks: precise location, hence no reasoning about races statically! = X Ownership transferred Z = 26

27 SBRS with Dynamic Locks Precise conflict detectionefficient for high-conflicting regions Y = High instrumentation cost at each access = X High overhead for lowconflicting programs Ownership transferred Z = 27

28 Experimental Methodology Benchmarks DaCapo 2006, 9.12-bach Fixed-workload versions of SPECjbb2000 and SPECjbb2005 Platform Intel Xeon system: 32 cores 28

Implementation and Evaluation Developed in Jikes RVM 3.1.

29 Implementation and Evaluation Developed in Jikes RVM Code publicly available on Jikes RVM Research Archive 29

30 %overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 30

31 %overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D % overhead on average hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 31

32 %overhead over unmodified JVM Run-time Performance EnfoRSer-D Precise conflict detection but high instrumentation overhead! Complex compiler transformations (additional code) Can we do better? 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 32

33 EnfoRSer with Static Locks Reduce the instrumentation overhead of EnfoRSer-D Less complex code generation 33

34 EnfoRSer-S Static region level locks Racing sites acquire same lock Coarsened locks to reduce instrumentation overhead 34

35 SBRS with Locks on Static Sites L0 Static locks: racy sites protected by same lock L1 L2 Static Region Locks Y = = X Z = 35

36 SBRS with Locks on Static Sites L0 L1 L2 All locks acquired before access Y = = X Z = 36

37 SBRS with Locks on Static Sites L0 L1 L2 Y = Ownership transferred = X Z = 37

38 SBRS with Locks on Static Sites L0 L1 Imprecise and does not lower instrumentation overhead! L2 Y = Ownership transferred = X Z = 38

39 SBRS with Locks on Static Sites L012 Coarsened single lock Y = = X Z = 39

40 SBRS with Locks on Static Sites Low instrumentation L012 overhead Coarsened single lock Imprecise Y = conflict detection = X Z = 40

41 EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 41

42 EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 42

43 EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 RACE 43

44 EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 RACE Naik et al. s 2006 race detection algorithm, Chord 44

45 EnfoRSer-S R1 R2 R3 R4 L3 s0 s2 s5 s7 s3 L2 s8 s1 s4 s6 L4 L0 L1 45

46 EnfoRSer-S R1 R2 R3 R4 L0 L0 L1 L3 s0 s2 s5 s7 s3 s8 s1 s4 s6 46

47 EnfoRSer-S R1 R2 R3 R4 L0 L0 L1 L3 L1 L1 L2 L2 L4 s0 s2 s5 s7 s3 s8 s1 s4 s6 47

48 EnfoRSer-S R1 R2 R3 R4 L0 L0 L1 L3 L1 L1 L2 L2 L4 s0 s2 Does not reduce s3 instrumentation overhead! s5 s7 s8 s1 s4 s6 48

49 EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 49

50 EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 Same Region Static Locks (SRSL) RACE 50

51 EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 Same Region Static Locks (SRSL) RACE L0 L1 51

52 EnfoRSer-S R1 R2 R3 R4 L0 L0 L0 L1 s0 s2 s5 s7 s3 s8 s1 s4 s6 52

53 EnfoRSer-S R1 R2 R3 R4 s0 L0 Reduces instrumentation L0 L0 L1 overhead s2 s5 s7 Increases contention s3 s8 s1 s4 s6 53

54 %overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S 11,000 3,500 12,000 33,000 10,000 4,300 3, , hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 54

55 %overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S 11,000 3,500 12,000 33,000 10,000 4,300 3, , % overhead on average hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 55

56 %overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S 11,000 3,500 12,000 33,000 10,000 4,300 3, , EnfoRSer-S performs better 2600% overhead on average 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 56

57 Hybridizing Locks High contention sites: Precise dynamic locks (precise conflict detection) Low contention sites: Single static lock (low instrumentation overhead) 57

58 EnfoRSer-H Static locks to reduce instrumentation Dynamic locks for precise conflict detection Correctly and efficiently combine: best of both 58

59 %overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S EnfoRSer-H 11,000 3,500 12,000 33,000 10,000 4,300 3, , % overhead on average hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 59

60 %overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S EnfoRSer-H 11,000 3,500 12,000 33,000 10,000 4,300 3, , Provides nearly the best of either approaches! % reduction in overhead 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 60

61 %overhead over unmodified JVM Run-time Performance EnfoRSer-D EnfoRSer-S EnfoRSer-H 11,000 3,500 12,000 33,000 10,000 4,300 3, ,600 49% reduction in lock acquires and not increasing contention! % reduction in overhead hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 61

62 %overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S EnfoRSer-H 11,000 3,500 12,000 33,000 10,000 4,300 3, , % reduction in lock acquires Dramatically improves on both when hybridization helps! 87% reduction in overhead hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 62

63 Hybridizing Locks Right synchronization for right program sites Combining different synchronization mechanisms Best of different synchronization mechanisms 63

64 EnfoRSer-H Costbenefit model Profiling Right locks for right sites Assignment algorithm Combine static and dynamic locks 64

65 Two-iteration Methodology Program execution First iteration Profiling (use RACE relation) Assignment Algorithm Between iterations Compute SRSL and estimate cost Program exectuion Second iteration Use hybridized instrumentation 65

66 Assignment Algorithm 66

67 Profiled data Compute initial cost Change a static lock to dynamic lock Change all its racing sites to dynamic locks Profiled data Recompute cost current cost < previous cost No Yes Retain state Revert to previous state 67

68 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 estimatedcost = N i estimatecost(ri) ; 68

69 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 estimatedcost = N i estimatecost(ri) ; 1. Conflicts on each site 2. Lock acquires on each site 69

70 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 70

71 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 71

72 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 If (current cost < previous cost) // retain state 72

73 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 Switch on next site 73

74 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 74

75 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 If (current cost < previous cost) // retain state 75

76 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 Switch on next site 76

77 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 estimatedcost = N i estimatecost(ri) ; current cost > previous cost // revert state 77

78 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 78

79 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 s1, s9 79

80 Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 Final State! 80

81 Lock Assignment R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 L1 81

82 Lock Assignment R1 R2 R3 R4 L1 L1 L1 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 L1 82

83 Lock Assignment R1 R2 R3 R4 L1 L1 L1 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 L1 83

84 Lock Assignment R1 R2 R3 R4 L1 L1 L1 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 84

85 Lock Assignment R1 R2 R3 R4 L1 L1 L1 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 Per-object locks and transformations 85

86 Related Work Use of static locks Chimera, Lee et al., PLDI 2012 Use of static analysis Static Conflict Analysis for Multi-Threaded Object-Oriented Programs, Von Praun and Gross, PLDI, Goldilocks, Elmas et al., PLDI Red Card, Flanagan and Freund, ECOOP 2013 Hybridizing locks Hybrid Tracking, Cao et al., WODET

87 Conclusion Combine synchronization mechanisms Best of different kinds of synchronization Efficient enforcement of SBRS Static locks Dynamic locks Precise conflict detection Low instrumentation overhead Select the best for a region 87

Hybrid Static-Dynamic Analysis for Statically Bounded Region Serializability

Hybrid Static-Dynamic Analysis for Statically Bounded Region Serializability Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Programming