Using Software Logging To Support Multi-Version Buffering in Thread-Level Spulation María Jesús Garzarán, M. Prvulovic, V. Viñals, J. M. Llabería *, L. Rauchwerger ψ, and J. Torrellas U. of Illinois at Urbana-Champaign U. of Zaragoza, Spain * U.P.C., Spain ψ Texas A&M University
Thread-Level Spulation (TLS) Exute potentially-dependent tasks in parallel Assume no cross-task dependence will be violated Track memory accesses; buffer unsafe state Dett any dependence violation Squash offending tasks, repair polluted state, restart tasks Example time for (i=0;i<n;i++){ = A[B[i]] A[C[i]] = } Task J = A[4] A[5] = Task J+1 = A[2] RAW A[2] = Task J+2 = A[5] A[6] = 2
State Buffering in TLS Spulative tasks generate spulative memory state Must buffer and manage this spulative state 3
Approaches to Buffering [HPCA-03] Archittural Main Memory (AMM): Spulative task state kept in caches Main memory only keeps safe program state Cached state merges with memory when a task commits Future Main Memory (FMM): Spulative task state can merge with memory at any time Main memory keeps the future state (unsafe) Previous state is saved into Undo Log 4
Our Contributions Software-only design for undo-log system in FMM Simplify the hardware implementation ti Very cost-efftive approach On average, it only introduces a 10% slow-down vs hardwareonly 5
Roadmap Taxonomy of Buffering: AMM vs FMM Hardware Support Software Support Evaluation Conclusions 6
Task Exution under TLS Non Sp Non Sp Sp Non Sp Sp Processor 1 2 3 time task 1 Commit task 2 task 3 Token task 4 task 5 Commit Token 7
Archittural Main Memory [HPCA03] Tasks 3 4 5 Caches Non Sp Main memory Archittural state Main memory keeps archittural or safe state Caches keep spulative state 8
Future Main Memory [HPCA03] Tasks 3 4 5 Log 6 Caches Main memory Future state Main memory keeps future state Logs keep previous state 9
Future Main Memory value address Task i writes 2 to 0x400 Task i+j writes 10 to 0x400 Archittural MM Task ID Tag Data i 0x400 2 i+j 0x400 10 Task ID Tag Data i 0x400 2 Future MM Cache Cache 10
Future Main Memory value address Task i writes 2 to 0x400 Task i+j writes 10 to 0x400 Archittural MM Future MM Producer Task ID Tag Data i 0x400 2 i+j 0x400 10 Cache Task ID Tag Data i+j 0x400 10 Cache Task ID Overwriting Task ID Tag Data i+j i 0x400 2 Log (in cache or memory) Perf Cost Faster commit but slower version rovery Need log 11
Roadmap Taxonomy of Buffering Hardware Support Software Support Evaluation Conclusions 12
Hardware Supports Existing TLS protocol Add hooks to support Software Logging 13
TLS protocol [Zhang99] Processor Cache RW X Network Local Memory MaxR MaxW X Task IDs 14
Hooks to Support Software Logging 1) Make Task ID pages visible to the software Cache Processor RW X Task ID Network Local Memory 2) Cache Task IDs MaxR MaxW X Task IDs Software Log Producer Task ID Data 15
Accessing Task IDs in Software Fixed offset between mapping of data pages and corresponding Task ID pages Memory Data pages Task ID pages pg Fixed offset Use two different loads: ld variable ld_tid variable 16
Accessing Data Virtual page Physical DataPage TLB ld variable To cache (data access) 17
Accessing Task IDs Virtual page Physical DataPage TLB Fixed offset ld_tid variable To cache (task ID access) 18
Caching Task IDs Bring Task ID to cache with ld_tid instruction as regular data Task IDs can be reused from the cache Task kids in memory are updated din hardware r by the TLS protocol To keep cached Task IDs up-to-date in software: st_tid instruction 19
Roadmap Taxonomy of Buffering Hardware Support Software Support Evaluation Conclusions 20
Software Logs A compiler instruments the application: Insert entry in log: before a store operation, add instructions to save previous value, address and Task ID in log Rycle log entries: free up the log created when a task commits Interrupt handler: Rovery : In case of a o-o-o RAW and squash Undo the modifications using data from log 21
Software Data Structures Logs are allocated locally before spulation starts Task Pointer Table Log Buffer Valid Task ID End Next Task ID Vaddr i j Value 22
Instructions to insert entry in log addu r4, r3, offset ; address of the variable sw r4, 0(r2) ; store in the log ld_tid r4, offset(r3) ; load Task ID Logging sw r4, 4(r2) ; store in the log instr lw r4, offset(r3) ; load value of variable sw r4, 8(r2) ; store in the log addu r2, r2, log_rord_size st_tid ttid r5, offset t( (r3) ; store Task kid sw r5, offset(r3) 23
Reducing Overheads Only first-stores in the task need to create a log entry Solution: At run-time, chk if store is first-store Use cached Task ID to filter: If (Cached Task ID == Current Task ID) Skip Logging Else Log entry 24
Filtering First Spulative Store Logging instr no_insert: ld_tid r6, offset (r3) ; load Task ID beq r6, r5, no_insert ; first store? addu r4, r3, offset ; insert as usual sw r4, 0(r2)... addu r2, r2, log_rord_size st_tid r5, offset (r3) ; store Task ID sw r5, offset (r3) 25
Roadmap Taxonomy of Buffering Hardware Support Software Support Evaluation Conclusions 26
Evaluation Environment Exution-driven simulator Multiprocessor with 16 processors 4 issue o-o-o superscalar processor + 2 levels of cache Compare against FMM with Hardware Logging [Zhang99] Advanced AMM system [HPCA03] 27
Applications Numerical applications: Apsi (Spfp2000) Dsmc3d and Euler (HPF-2) P3m (NCSA) Tree (Univ. of Hawaii) Track (Perft) The non-analyzable loops account on average for 61% of the serial exution time - Non-analyzable loops are identified with the Polaris parallelizing compiler - Speed-ups shown for the non-analyzable loops only 28
Exution Time Comparison Exut tion Time 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 P3m Tree Apsi Track Dsmc3d Euler Average 4.7 5.3 10 2 HW SW 32 3.2 3 3.3 3.5 14.5 14.4 14.2 3.8 4 5 6 HW SW 7 8 9 10 HW SW 11 12 9.5 9.4 9.5 7.3 7.1 7.5 0.8 13 14 HW SW 15 16 17 18 HW SW 19 20 21 0.6 22 HW SW 1.1 23 24 25 26 HW SW 27 FMM AMM FMM AMM FMM AMM FMM AMM FMM AMM FMM AMM FMM AMM On average, FMM Sw only introduces a 10% slow-down over FMM Hw On average, FMM Sw is similar to the advanced AMM 29
First Store Filtering 2 1.75 1.5 1.25 1 0.75 0.5 0.25 0 P3m Tree Apsi Track Dsmc3d Euler FMM.Sw NoFilter FMM.Sw NoFilter FMM.Sw NoFilter 10 FMM.Sw NoFilter NoFilter NoFilter 1 2 3 4 5 6 7 11 12 13 FMM.Sw 14 15 16 FMM.Sw 17 8 9 Exution Time Significant impact in the exution time in Apsi Since filtering does not hurt, we rommend using it 30
Other Results in Paper Studied design tradeoffs: Log accesses bypass/do not bypass L1 cache Log space is / is not rycled Do not afft the performance of FMM.Sw when filtering is used 31
Conclusions FMM.Sw is a cost-efftive solution: Simplified design relative to FMM.Hw Introduce low exution overhead (10% over FMM.Hw) Filtering first stores is beneficial 32
Using Software Logging To Support Multi-Version Buffering in Thread-Level Spulation María Jesús Garzarán, M. Prvulovic, V. Viñals, J. M. Llabería *, L. Rauchwerger ψ, and J. Torrellas U. of Illinois at Urbana-Champaign U. of Zaragoza, Spain * U.P.C., Spain ψ Texas A&M University
FMM.Sw versus Advanced AMM Advanced AMM Needs Version Combining i Logic Collt all the versions that are committed Selt the youngest one and invalidate the others FMM.Sw Needs Task ID in main memory Comparator to pick up the youngest version 34
Problem: Address tak IDs in software TLB Non existent Physical addres Cache Data & ld/st bits Cache tagged with Non-existent physical addr Local Memory task ID Add a fixed offset 35
Problem: Address tak IDs in software TLB Non existent Physical addres Cache Data & ld/st bits NIC Physical addrs local shared Local Memory task ID Network Shared Data 36
Problem: Address time stamp in software OS Allocates a page of task IDs Map the virtual address to a non existent physical page TLB Cache Data & ld/st bits Non existent Physical addres NIC Physical addrs local shared Local Memory taskid Network Shared Data 37
Spulative protocol Cache: load and store bits per word in cache Local Memory: task ID per word ISA: new ld/st instructions 38
Problem: Address task IDs in Software The task ID is not mapped in virtual space How to make visible the task ID to the sw? Logging inst load r3, addr_task ID? Undo Log Vaddr task ID Value sw r5, offset(r3) 39
Problem: Address task IDs in Software The task ID is not mapped in virtual space How to make visible the task ID to the sw? Use spial instruction lh_tid Logging inst lh_tid r3, offset(r3) Undo Log Vaddr task ID Value sw r5, offset(r3) 40
Accessing Task IDs in Software Option 2) Map to a non-existent physical address Virtual page Physical Non existent DataPage Fixed offset lh_tid r3, offset(r3) To cache (task ID access) 41
Reducing Overheads Only first-stores in the task need to create a log entry Solution: At run-time, chk if store is first-store Use cached Task ID to filter: If Others Instrumented First stores Spulative Non spulative 42
Filtering first spulative store Using extended loads load store tag data 0 1 xlw r6, r1, offset (r3) ; Store bit goes to r6 bgtz r6, no_insert ; first store? addu r4, r3, offset ; insert as usual sw r4, 0(r2) Logging lh_ts r4, offset(r3) instr addu r2, r2, log_rord_size sh_tid r5, offset (r3) no_insert: sw r5, offset(r3) 43
Software handlers Rovery : Out-of order RAW Undo the modifications i using data from log Retrieval : Some in-order RAWs The exposed load needs dig version from log 44
2.25 2 P3m Tree Apsi Track Dsmc3d Euler 1.75 1.5 1.25 1 0.75 0.5 0.25 0 2 Exution Time By. RBy.R By.NoR NoBy.R By. R NoBy.NoR By.NoR By.R By.NoR NoBy.R NoBy.NoR By.R By.NoR NoBy.R NoBy.NoR By.R By.NoR NoBy.R NoBy.NoR By.R By.NoR NoBy.R NoBy.NoR By.R By.NoR NoBy.R NoBy.NoR By.R By.NoR NoBy.R NoBy.NoR c Filter Filter Filter All Filter Filter Filter Filter 45
Accessing Task IDs in Software Naïve solution: Add extra field in TLB Virtual Page Physical Data Page Physical Task ID Page ld var To cache (data access) 46
Accessing Task IDs in Software Naïve solution: Add extra field in TLB Virtual Page Physical Data Page Physical Task ID Page ld_tid var To cache (task ID access) 47
Accessing Task IDs in Software Fixed offset between mapping of data pages and corresponding Task ID pages No TLB modifications In ld_tid instruction, the hardware subtracts the offset to obtain the Physical Task ID page 48