CASPAR: BREAKING SERIALIZATION IN LOCK- FREE MULTICORE SYNCHRONIZATION

Size: px

Start display at page:

Download "CASPAR: BREAKING SERIALIZATION IN LOCK- FREE MULTICORE SYNCHRONIZATION"

Tyrone Golden
6 years ago
Views:

1 CASPAR: BREAKING SERIALIZATION IN LOCK- FREE MULTICORE SYNCHRONIZATION Tanmay Gangwani, Adam Morrison *, Josep Torrellas University of Illinois, Urbana- Champaign Technion Israel InsBtute of Technology *

2 Concurrent Data Structures push/pop Core stack- top

3 Concurrent Data Structures push/pop Core stack- top Lock- based SynchronizaJon Core Lock Stack Core Core Core stack- top Time stack- top Unlock Stack

4 Concurrent Data Structures push/pop Core stack- top Lock- free SynchronizaJon X Core 1. read stack- top 2. read- modify- write stack- top CAS (A, X) stack- top stack- top A X A Time Core Core Core

5 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } new_top A new_top B

6 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } new_top A old_top A = stack- top new_top B old_top B = stack- top

7 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } new_top A old_top A = stack- top stack- top new_top A new_top B old_top B = stack- top new_top B

8 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } CAS FAILURE new_top A old_top A = stack- top new_top A stack- top stack- top new_top A new_top B old_top B = stack- top new_top B new_top B

9 Load- to- CAS Atomicity Adding node to shared stack void push () { Node *new_top = malloc(); while(true) { ATOMIC old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) } } return; Cache Line Lock NO CAS FAILURES

10 Scourge of SerializaJon Atomic Compute new Load old CAS(old, new) Core A Compute new Core B Compute new Core C Compute new new top old top new old CAS Stall Stall Time CAS CAS

11 ContribuJon Atomic new Compute new Load old CAS(old, new) CASPAR reduces stalls in lock- free programs using two new top top techniques : old new ü Eager old Forwarding ü Group ValidaUon Core A Compute new CAS Core B Compute new CAS Stall Core C Compute new Stall Time CAS

12 CASPAR Fights SerializaJon Baseline (Prior work) ContribuBons Hardware Queueing Directory chains cores in hardware ContribuUon I : Eager Forwarding Directory facilitates passing of specula9ve values between cores ContribuUon II : Parallel ValidaUon Directory facilitates group- valida9on of mul9ple cores q Uses cache line bit for locking q Dynamic detecbon of contended CAS No source code modificabons q Deadlock- free

13 SerializaJon in Hardware Queueing Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E ld old A ld old B ld old C ld old D ld old E ld old F D C Serial passing of cache line between cores B A Directory

14 SerializaJon in Hardware Queueing Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E D Serial passing of cache line between cores Line C B A Directory

15 SerializaJon in Hardware Queueing Atomic Compute new Load old CAS(old, new) Serial passing of cache line between cores Core A Core B Core C Core D Core E Line Core F F E F D E F D C E F D B C E F D A B C E F Directory

16 ContribuJon I : Eager Forwarding - Insight Adding node to shared stack Core A Core B Core C void push () { Node *new_top = malloc(); while(true) { Compute new Compute new Compute new Atomic old_top = stack; new_top- >next = old_top; if(cas(&stack, old_top, new_top)) CAS Stall Stall Time } } return; new_top generated early on, independent of old_top CAS CAS

17 CASPAR Eager Forwarding Adding node to shared stack Core A Core B Core C void push () { Node *new_top = malloc(); while(true) { Compute new new A Compute new new B Compute new Atomic old_top = stack; new_top- >next = old_top; if(cas(&stack, old_top, new_top)) } } return; new_top generated early on, independent of old_top CAS ValidaUon CAS ValidaUon CAS Time Out- of- window SpeculaBve ExecuBon azer register check- poinbng

18 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E ld old A ld old B ld old C ld old D ld olde ld old F D C B A Directory

19 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E D Line C B A Directory

20 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E new A new B new C new D new E Core F F E D C B A new F new E new D new C new B new A Directory

21 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Serial ValidaBon Core A Core B Core C Core D Core E Line Spec. Core F F E F D E F D C E F D B C E F A D B C E F new F new E F new D E F new C D E F new B D C E F new A D B C E F Directory

22 LimitaJons of Eager Forwarding Core A Core B Core C Compute new Compute new Compute new new A new B Long speculabon Time CAS CAS CAS ValidaUon Compute new SpeculaBve execubon not possible beyond this point ValidaUon Stall

23 ContribuJon II : Parallel ValidaJon Core A Compute new Core B Compute new Core C Compute new TWO- PHASE COMMIT (2PC) new A new B PROTOCOL TO PERFORM PARALLEL VALIDATION Time CAS CAS CAS ValidaUon ValidaUon Idea: Validate in the directory without ever sending the full line to the core Directory

24 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E Parallel passing of speculabve values between cores ld old A ld old B ld old C ld old D ld olde ld old F D C B A Directory

25 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E new A new B new C new D new E Core F F E D C B F A E new F new E new D new C new B F new A E Directory

26 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Line Line == new A?? Core F F E D C B F A E new F new E new D new C new B F new A E Directory

27 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Parallel ValidaBon Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Core F F F E D E D C C B F B A E new F new F E new D E new D C new C B F new B A E Directory

28 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Parallel ValidaBon Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Prep, new B Prep, new C Prep, new D Prep, new E Core F Prep, new F F F E D E D C C B F B A E new F new F E new D E new D C new C B F new B A E Directory

29 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Parallel ValidaBon Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Ack Ack Ack Nack Ack Core F F F E D E D C C B F B A E new F new F E new D E new D C new C B F new B A E Directory

30 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Spec. Core A Core B Core C Core D Core E Spec. Core F Parallel passing of speculabve values between cores Commit Commit Commit Resume Ack ing cores halt. Resume re- starts execubon Parallel ValidaBon F E new F new E Directory

31 Outcome ü Fully Parallel Data Transfer ü Fully Parallel De- queue

32 EvaluaJon Sniper simulator, 64 cores, OOO Kernels : FIFO, LIFO, LIFO- push- only, Mandelbrot, Larson ApplicaUons : Galois graph analybcs suite (3x2), Barcelona OpenMP Tasks Suite (1x2) 100% 100% 30% 30% 10% 10% 5% 5% L

33 EvaluaJon 100% 100% 30% 30% 10% 10% 5% 5% L LF

34 EvaluaJon 100% 100% 30% 30% 10% 10% 5% 5% L LF Q

35 EvaluaJon 100% 100% 30% 30% 10% 10% 5% 5% L LF Q EF

36 EvaluaJon Independent Set (IS) gains from Eager Forwarding since early data transfer exposes work that could be done speculabvely and in parallel 100% 100% 30% 30% 10% 10% 5% 5% L LF Q EF

37 EvaluaJon Connected Components (CC) and Delaunay TriangulaBon (DT) gain from Parallel ValidaBon as it removes stalls in EF using group commit 100% 100% 30% 30% 10% 10% 5% 5% 2.5x L LF Q EF C

38 Also in the paper ü InformaBon on proposed hardware changes ü Handling memory consistency issues ü Detailed working and correctness of Eager Forwarding and 2PC protocols ü Scalability plots, cycle- level breakdown for applicabons ü SupporBng other primibves (e.g. LL/SC)

39 Conclusions CASPAR enables true parallelism in codes using lock- free synchronizabon - Eager Forwarding : Parallel transfer of data - Parallel ValidaBon : Parallel de- queue No source code modificabons Future Work - Can be used to break conflict serializabon in TM designs

40 CASPAR: BREAKING SERIALIZATION IN LOCK- FREE MULTICORE SYNCHRONIZATION Tanmay Gangwani, Adam Morrison *, Josep Torrellas University of Illinois, Urbana- Champaign Technion Israel InsBtute of Technology *

41 CASPAR BACKUP

42 Maintaining Consistency Core A Core B ld old new A ld old {new value forwarded to core B} Z = 1 CAS(old, new).. = Z (read 0) Time ValidaUon

43 Maintaining Consistency Core A Core B ld old new A ld old {new value forwarded to core B} Z = 1 CAS(old, new) Coherence mechanism Forwarding causes no consistency violauons.. = Z (read 0) Squash and roll- back Time

44 EvaluaJon (kernels) CAS throughput improvements - EF over Base : 53% - C over Base : 83% - EF over Queue : 10% - C over Queue : 32%

45 EvaluaJon (scalability) LIFO- push- only Mandelbrot LIFO CASPAR greatly improves scalability with number of cores

46 CASPAR in EnJrety EffecBve SerializaBon Module I : Hardware Queueing Directory chains cores in hardware Core A Core B Core C Module II : Eager Forwarding Directory facilitates passing of specula9ve values between cores Core A Core B Core C Breaking SerializaBon Module III : Two- Phase Group Commit Passing SpeculaBve Value Directory facilitates collec9ve- valida9on of mul9ple cores Directory Core A Core B Core C Group ValidaBon

c 2016 Tanmay Gangwani

c 2016 Tanmay Gangwani BREAKING SERIALIZATION IN LOCK-FREE MULTICORE SYNCHRONIZATION BY TANMAY GANGWANI THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in