Improving multicore memory systems

Size: px

Start display at page:

Download "Improving multicore memory systems"

Jayson McBride
5 years ago
Views:

1 1 Improving multicore memory systems and some thoughts on chip multiprocessor programming NIK MULTICORE TECHNOLOGY WORKSHOP 19. Nov NTNU Computer Architecture Research group 2 Motivation Performance Processor-memory peformance gap Moore s Law CPU 60%/year P-M gap grows 50% / year DRAM 9%/year Three important trends in computer architecture Increasing processor-memory gap (memory wall) Increasing memory hierarchy complexity Registerfile + L1-cache + L2-cache on a single chip Complicates understanding of performance cache usage has an increasing influence on performance Multicores are here to stay Limitations in instruction level parallelism (ILP) More power efficient (Green IT) Reduced design complexity our goal: Improving multicore memory systems

3 Improving multicore memory systems Overview Haakon Dybdahl: Architectural Techniques to Improve Cache Utilization Marius Grannæs: Bandwidth aware prefetching Magnus Jahre: Miss handling

2 3 Improving multicore memory systems Overview Haakon Dybdahl: Architectural Techniques to Improve Cache Utilization Marius Grannæs: Bandwidth aware prefetching Magnus Jahre: Miss handling architecture Research method Find new, preferrably simple, cache/memory techniques that can improve performance Evaluate these by extensive simulations (Simplescalar, M5) Mainstream research theme (High industrial relevance) Goal: Publications in top conferences (ISCA, HPCA, PACT, ICS, Micro) Really tough competition! 4 Improving Cache Utilization Architectural Techniques to Improve Cache Utilization Haakon Dybdahl, PhD thesis, May 07 [Dybdahl 07] Main results An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors (HPCA 07) [DS07]. A Cache-Partitioning Aware Replacement Policy for Chip Multiprocessors, HiPC [DSL06] (Best paper award). An LRU-based Replacement Algorithm Augmented with Frequency of Access in Shared Chip-Multiprocessor Caches, Comp. Arch. News, 2007 [DSN07]. Index Partitioning a cache (4-way) (HPCA 07) Core 1 Core 2

degree Relative off-chip bandwidth 6 Miss handling architecture Magnus Jahre Too large miss-level parallelism can

3 5 Bandwidth aware prefetching Marius Grannæs Off-chip bandwidth is a bottleneck the problem is expected to increase Prefetching can be more agressive in phases with low traffic and vice versa Diploma thesis [Grannæs06] IPC Prefetching degree Relative off-chip bandwidth 6 Miss handling architecture Magnus Jahre Too large miss-level parallelism can reduce performance NIK presentation at Wednesday Diploma thesis focusing more on multicore interconnection networks [Jahre07]

7 EU 7th Framework Programme Objective ICT-2007.3.

4 7 EU 7th Framework Programme Objective ICT : Computing Systems, Call 1, deadline 8 May 2007 a) Novel architectures for multi-core computing systems: New architectures and the corresponding system-level software and programming environments advancing from single to multi-core scalable and customisable on-chip systems incorporating multiple, networked, symmetric or heterogeneous, fixed or reconfigurable processing elements. Priorities include: (1) versatility in terms of performance, power and coping with the requirements of entire classes of applications and markets, ranging from lowend consumer electronics to high-end computing architectures and applications; (2) programmability to allow harvesting the full potential of the hardware at reasonable effort; and (3) reliability and availability. This includes interconnection (from bus to network-onchip), memory hierarchies, security, operating systems and run-time tools, languages and resource/domain-aware compilers supporting parallelism and concurrency. b) Reference architectures for generic embedded platforms: (text deleted) 8 Multi-core Computing: Holistic view Need for holistic view of the research challenges involving both software and hardware aspects versatility programmability Parallel & concurrent programming Operating System & system software Reliability & availability Multicore, interconnect, memory

5 9 Our role: Linking high-level programming to HW-performance if interested, contact 10 Parallel programming in the future: homogeneous or heterogeneous? To develop programs is difficult (I.e. the software crisis ) To develop parallel programs is even more difficult To develop efficient parallel programs is even more difficult To develop scalable and efficient parallel programs is even more difficult To develop portable, scalable, and efficient parallel programs is even more difficult,... but necessarry However some people still need maximal perfomance they don t want to rewrite programs (too often/at all) Shared memory simpler than message passing!(?) Homogeneous simpler than heterogeneous! Krste Asanovic presenting the Landscape report (ACACES July 2007).: Parallel programming should be divided into a programmability layer and an efficiency layer (disclaimer: as I remember it )

6 11 Conclusion, Questions? More information Multicore programming / HPC The Landscape of Parallel Computing Research: A View from Berkeley [Berkeley06] Our course: TDT6 Heterogeneous and Reconfigurable Parallel Computing Multicore architecture The HiPEAC Roadmap Document [HiPEAC07] Our course: TDT1 - Multicore architectures and chip multiprocessors We are open for new ideas, projects and collaboration Becomes member of HiPEAC-2 NoE, European Network of Excellence on High-Performance Embedded Architecture and Compilation from February 2008 Questions? and remember MuCoCoS 08 deadline 20 November... Also recommend HiPEAC 08 Conference (Gøteborg late January) HiPEAC summerschool (ACACES) (Italy July 2008) 12 References [Berkeley 06] The Landscape of Parallel Computing Research: A View from Berkeley, December [DS07] An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors, H. Dybdahl and P. Stenström, HPCA 07 [DSN06] A Cache-Partitioning Aware Replacement Policy for Chip Multiprocessors, H. Dybdahl, P. Stenström, and L. Natvig, 13'th Int'l Conf. of High Perform. Comput., HiPC Best paper award. [DSN07] An LRU-based Replacement Algorithm Augmented with Frequency of Access in Shared Chip-Multiprocessor Caches, H. Dybdahl, P. Stenström, and L. Natvig, in Computer Architecture News, [Dybdahl07] Architectural Techniques to Improve Cache Utilizations, PhD thesis, IDI, NTNU, May [Grannæs06] Bandwidth-Aware Prefetching in Chip Multiprocessors, Marius Grannæs, diploma thesis, IDI, NTN, june [HiPEAC07] The HiPEAC Roadmap, [Jahre07] Improving the Performance of Parallel Applications in Chip Multiprocessors with Architectural Techniques, Magnus Jahre, diploma thesis IDI, NTNU, june MuCoCoS 08 Int l workshop on MultiCore Computing Systems (Barcelona, March 2008) NTNU-IDI Courses TDT1 and TDT6

TDT 4260 lecture 3 spring semester 2015

1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,