EXPLOITING SEMANTIC COMMUTATIVITY IN HARDWARE SPECULATION

Size: px

Start display at page:

Download "EXPLOITING SEMANTIC COMMUTATIVITY IN HARDWARE SPECULATION"

Primrose Cummings
5 years ago
Views:

1 EXPLOITING SEMANTIC COMMUTATIVITY IN HARDWARE SPECULATION GUOWEI ZHANG, VIRGINIA CHIU, DANIEL SANCHEZ MICRO 2016

2 Executive summary 2 Exploiting commutativity benefits update-heavy apps Software techniques that exploit commutativity incur high runtime overheads (STM is 2-6x slower than HTM) Prior hardware exploits only single-instruction commutative operations (e.g., addition) CommTM exploits multi-instruction commutativity Extends coherence protocol to perform commutative operations locally and concurrently Leverages HTM to support multi-instruction updates Benefits speculative execution by reducing conflicts Accelerates full applications by up to 3.4x at 128 cores

3 Commutativity Commutative operations produce equivalent results when reordered No true data dependence No need for communication Software exploits commutativity but incurs high run-time overheads Single-instruction commutativity ADD MIN OR Coup [Zhang et al, MICRO 2015] Multi-instruction commutativity Ordered put Set insertion Top-K insertion 3

4 Commutativity 4 Commutative operations produce equivalent results when reordered No true data dependence No need for communication Software exploits commutativity but incurs high run-time overheads Multi-instruction example: set (linked-list) insertion head null insert(a); insert(b); head b a null insert(b); insert(a); head a b null Different but semantically equivalent states

5 Commutativity Commutative operations produce equivalent results when reordered No true data dependence No need for communication Software exploits commutativity but incurs high run-time overheads Single-instruction commutativity ADD MIN OR Coup [Zhang et al, MICRO 2015] Multi-instruction commutativity Ordered put Set insertion Top-K insertion CommTM 5

6 Example: addition in conventional HTM 6 read A: 20 read add(a, 1): Txn 0 load A add(a, 1): Txn 1 load A void add (int* counter, int delta) { int v = load(counter); store(counter, nv);

7 Example: addition in conventional HTM 6 add(a, 1): Txn 0 load A store A A: 20 read write read Conflict! load A add(a, 1): Txn 1 void add (int* counter, int delta) { int v = load(counter); store(counter, nv);

8 Example: addition in conventional HTM 6 read write add(a, 1): Txn 0 load A store A A: 20 load A add(a, 1): Txn 1 void add (int* counter, int delta) { int v = load(counter); store(counter, nv);

9 Example: addition in conventional HTM 6 add(a, 1): Txn 0 load A add(a, 1): Txn 2 restart store A load A A: 24 load A load A restart store A load A store A add(a, 1): Txn 1 load A add(a, 1): Txn 3 store A load A restart void add (int* counter, int delta) { int v = load(counter); store(counter, nv);

10 Example: addition in conventional HTM 6 add(a, 1): Txn 0 load A add(a, 1): Txn 2 restart store A load A A: 24 load A load A restart store A load A store A add(a, 1): Txn 1 load A add(a, 1): Txn 3 store A load A restart void add (int* counter, int delta) { int v = load(counter); store(counter, nv); Traffic Serialization Wasted transactional work

11 Example: addition in CommTM 7 A: 20 void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv);

12 Example: addition in CommTM 7 ADD A: 20 ADD A: 0 void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv);

13 Example: addition in CommTM 7 ADD add(a, 1): Txn0 A: 20 ADD A: 0 read read load[add] A load[add] A add(a, 1): Txn1 void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv);

14 Example: addition in CommTM 7 ADD add(a, 1): Txn0 A: 20 read write load[add] A ADD A: 0 read write load[add] A add(a, 1): Txn1 void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv);

15 Example: addition in CommTM 7 ADD A: 21 add(a, 1): Txn0 load[add] A ADD A: 1 load[add] A add(a, 1): Txn1 void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv);

16 Example: addition in CommTM 7 ADD A: 22 add(a, 1): Txn0 load[add] A add(a, 1): Txn2 load[add] A ADD A: 2 load[add] A load[add] A add(a, 1): Txn1 add(a, 1): Txn3 void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv);

17 Example: addition in CommTM 7 ADD add(a, 1): Txn0 A: 22 load[add] A add(a, 1): Txn2 load[add] A ADD A: 2 load[add] A load[add] A add(a, 1): Txn1 add(a, 1): Txn3 void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv); load A

18 Example: addition in CommTM 7 add(a, 1): Txn0 A: 24 reduction load[add] A add(a, 1): Txn2 load[add] A load[add] A load[add] A add(a, 1): Txn1 add(a, 1): Txn3 void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv); load A User-defined reduction

19 Example: addition in CommTM 7 add(a, 1): Txn0 A: 24 reduction load[add] A add(a, 1): Txn2 load[add] A load[add] A load[add] A add(a, 1): Txn1 add(a, 1): Txn3 void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv); load A User-defined reduction Less traffic Concurrent updates Less wasted transactional work Less run-time/memory overheads than STM

20 CommTM

21 Programming interface 9 Transactional update void add (int* counter, int delta) { int v = load(counter); store(counter, nv);

22 Programming interface 9 Transactional update void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv); Labeled loads/stores

23 Programming interface 9 Transactional update void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv); Labeled loads/stores Non-transactional reduction handler void reduce[add] (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv);

24 Programming interface 9 Transactional update void add (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv); Non-transactional reduction handler void reduce[add] (int* counter, int delta) { int v = load[add](counter); store[add](counter, nv); Labeled loads/stores counter delta reduce[add] counter 36

25 Handling arbitrary object sizes 10 For objects smaller than a cache line, assume lines are full of aligned elements and reduce all of them void reduce[add] (int* counterline, int[] deltas) { for (int i = 0; I < intspercacheline; i++) { int v = load[add](counterline[i]); int nv = v + deltas[i]; store[add](counterline[i], nv); counterline deltas reduce[add] counterline For objects larger than a cache line, use a level of indirection

26 Example: set (linked-list) insertion void insert (SetDesc* s, Node* n) { Node* head = load[insert](s); n->next = head; store[insert](s, n); if (head == nullptr) store[insert](s+sizeof(node*), n); Struct SetDesc { Node* head; Node* tail; ; Struct Node { Node* next; ; 11 void reduce[insert] (SetDesc* s0, SetDesc* s1) { if (s1->head == nullptr) return; Node* head0 = load[insert](s0); if (head0 == nullptr) { store[insert](s0, s1->head); else { Node* tail0 = load[insert](s0+sizeof(node*)); tail0->next = s1->head; store[insert](s0+sizeof(node*), s1->tail);

27 Implementation 12 Baseline HTM MESI coherence protocol Eager conflict detection Timestamp-based conflict resolution Lazy version management (buffer speculative data in L1s) Shared L3 / directory L1I L2 L1D L1I L2 L1D Core Core CommTM can be applied to other HTMs and hardware speculation techniques

28 Coherence protocol 13 M MSI M CommTM-MSI W R W R L W, R W R S W W W, R S R W, L L W, R L U label W I I Legend Initiated by own core (gain permissions) Transitions Initiated by others (lose permissions) States Modified Shared Invalid User-defined (read-only) reducible Requests Read Write Labeled load/store

29 Reducible-state transitions 14 Entering U state triggered by labeled load/store Leaving U state with nontransactional reductions load[add](a) L2 L1 Shared cache A: 0 A: 0 L2 L1 A: 20 Hardware shadow thread/interrupt Cannot access other reducible data load(a) L2 L1 Shared cache A: -- 2 A: 24 2 L2 L1 A: 22 Core A: 241 Core 2 add_reduce Legend Modified A: 20 User-defined reducible A: 3

30 Transactional execution 15 Speculative value management for U state is analogous to M state L2 L1 A: 3 A: A: Core tx_begin() ld[add](a) st[add](a) L2 A: 3 L1 A: 4 11 TX (ts:0) Core tx_end() L2 L1 A: 3 A: 4 00 Core tx_begin() ld[add](a) st[add](a) L2 A: 4 L1 A: 5 11 TX (ts:1) Core speculatively-read speculatively-written Invalidation to speculatively accessed data in U state triggers a conflict

31 Gather requests allows more concurrency 16 Motivation Conditional commutativity: Operations commute only when reducible data meets some conditions Frequent reductions triggered by condition checks limit concurrency Gather requests allow partial updates to move across caches without leaving the reducible state Achieves higher concurrency (e.g., for reference counting)

32 Evaluation

33 Evaluation on microbenchmarks 17 Baseline CommTM counter set insertion ordered put top-k insertion Speedup Threads Threads Threads Threads Up to 128x speedup over baseline TM

34 Evaluation on full applications 18 Speedup Baseline CommTM boruvka kmeans ssca2 genome vacation x Threads Threads Threads Speedup Breakdown of core cycles at 128 threads (lower is better) Threads Threads Non-transactional Transactional, ted Transactional, aborted Normalized core cycles Normalized core cycles Baseline CommTM 0.0 Baseline CommTM 0.0 Baseline CommTM 0.0 Baseline CommTM 0.0 Baseline CommTM

35 Conclusions 19 Leverages HTM to support multi-instruction operations Extends coherence protocol to allow local and concurrent updates Bridges the gap between software and hardware speculation Reduces conflicts and serialized transactions significantly Accelerates challenging workloads by up to 3.4x at 128 cores

36 THANKS FOR YOUR ATTENTION! QUESTIONS ARE WELCOME!

Exploiting Semantic Commutativity in Hardware Speculation

Appears in the Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 16 Exploiting Semantic Commutativity in Hardware Speculation Guowei Zhang Virginia Chiu Daniel