Assisting Cache Replacement by Helper-Threading for MPSoCs

Size: px

Start display at page:

Download "Assisting Cache Replacement by Helper-Threading for MPSoCs"

Abigail Osborne
5 years ago
Views:

1 Assisting Cache Replacement by Helper-Threading for MPSoCs Masaaki Kondo Graduate School of Information Science and Technology, The University of Tokyo MPSoC Background Increasing number of cores in MPSoCs No more increase in clock speed Performance improvement by parallel processing Not easy to fully utilize all the available cores Typically with a shared last-level cache Effective use in cache area Destructive performance degradation by cache conflict Multiple threads contend cache area due to difference of their access patterns Aggravated with increasing C number of cores MPSoC2015 2

Lines with low-locality keep them in cache evict them from cache ASAP Managing cache lines by helper threading Flexible cache line management by software Locality prediction

2 Cache Conflict Example memory access sequence Core-0: A Core-1: B C D E A Core-0 Core-1 Core-0 data with high temporal locality is evicted!! A AB AC B DB AC CA ED B Shared Cache : Core-0 data : Core-1 data MPSoC Objective and Strategy Alleviating the effect of conflict in shared LLC Lines with high-locality Lines with low-locality keep them in cache evict them from cache ASAP Managing cache lines by helper threading Flexible cache line management by software Locality prediction and replacement control for cache lines Making good use of idle cores with helper threading Some cores tend to be idle in manycore SoCs due to lack of parallelism or smaller number of tasks than cores MPSoC2015 4

3 Helper Thread Assisted Cache Replacement Required steps Helper thread (HT) obtains cache miss info. from cache Access MSHR: Miss State Holding Register via Memory-mapped I/O HT predicts data locality for each missed line Flexibly implemented by software demonstrate 3 prediction methods HT stores predicted result and HW manages replacement based on it RDAT: Reusable Data Address Table Keeps addresses (pages) with high data locality Insert lines to position for next line-fill NDAT: Non-Reusable Data Address Table Keeps addresses (pages) with no data locality Insert lines to position for next line-fill RDAT NDAT MPSoC Baseline Architecture and Control Flow Prediction of data locality HT Private cache is uses as SPM to prevent LLC pollution by HT CT CT 1 Instruct cache management policy Obtaining cache miss info. Replacement Stores cache miss info. Cache misses CT: Computing Thread HT: Helper Thread MSHR: Miss State Holding Register MPSoC2015 6

4 Replacement Control Selecting line insert position based on RDAT and NDAT Miss address page Memory Access v page # RDAT Data Fill v page # NDAT Memory MPSoC Locality Prediction Algorithms Profile-based method Data-locality is profiled in advance Stores only non-reusable data addresses to NDAT Stream-based method Detect stream accesses dynamically by access history Stores addresses of stream data to NDAT (data with stream accesses is not likely to have locality) Virtual-set monitoring method Chasing reusability of evicted lines using virtual-set If an evicted line is accessed shortly, stores its address to RDAT If evicted even from virtual set, stores its address to NDAT MPSoC2015 8

5 Virtual-Set Monitoring Method virtually hit in this range to RDAT virtual-sets evicted from virtual- set virtually hit in this range (delete from NDAT/RDAT) to NDAT Monitors evicted lines from LLC For only a few sample sets HT emulates virtually extended stacks for the sample sets Check the reusability of lines by the virtual-set 1. // allocating virtual-set structure in the memory 2. VirtualSet = Create_VirtualSet(SIZE); 3. while (computing-threads-running) { 4. // obtain cache miss info. from MSHR 5. curmshr = Read_current_MSHR(); 6. if!(curmshr) 7. continue; 8. // check if it hits in the virtual set 9. waynum = ishit(curmshr, VirtualSet); 10. // check its locality and set page num. to NDAT/RDAT 11. if (!waynum) 12. set_ndat_entry(get_page(curmshr)); 13. else if isreusablewayno(wayno) 14. set_rdat_entry(get_page(curmshr)); 15. else 16. delete_ndat_rdat(get_page(curmshr)); 17. } Virtual hit at the middle of stack: high data-locality Register entire-page to RDAT Virtual hit at the end of stack: middle data-locality Delete corresponding page from NDAT and RDAT Evicted from virtual-set: low data-locality Register entire-page to NDAT MPSoC Evaluation Environment MARSSx86 multicore simulator Job mixes with SPEC CPU2006 (1-program/core) MA-high: job mix of memory intensive applications MA-low: job mix of CPU intensive applications MA-mix: job mix of both memory and CPU intensive applications Configuration Parameter L1 D/I-Cache L2 Cache Main Memory Executing instructions number of cores Value 32K, 8-way, 64Bline, 2-cycle latency 1MB,8-way, 64Bline, 9-cycle latency 200-cycle latency 100M inst. for all threads (FastForward:1B inst.) 5 (4-application threads + 1-helper thread) MPSoC

6 Result (4-Compute + 1-Helper Threads) 11.5% 10.6% 5.3% MA-high MA-low MA-mix MPSoC Summary Cache data management by helper threading HT predicts data-locality and set hints to cache controller Very flexible since we can implement multiple management algorithms by software modification Performance (weighted speedup) improvement up to 11.5% Future work Develop more accurate locality prediction algorithm Select appropriate algorithm depending on a job mix Compare with purely hardware-based counterpart MPSoC

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each