Programmable Near-Memory Acceleration on ConTutto

Size: px

Start display at page:

Download "Programmable Near-Memory Acceleration on ConTutto"

Katrina Mosley
5 years ago
Views:

1 Programmable Near- Acceleration on ConTutto Jan van Lunteren, IBM Research Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit

2 IBM Zurich (CH) Team Jan van Lunteren, Christoph Hagleitner IBM Dwingeloo (NL) Leandro Fiorin, Erik Vermij IBM Boeblingen (DE) Angelo Haller, Jörg-Stephan Vogt, Harald Huels IBM Burlington, Poughkeepsie, Rochester, Yorktown (US) Thomas Roewer, Bharat Sukhwani, Adam McPadden, Dean Sanner, Dave Cadigan, Sameh Asaad 2

3 POWER8 TM System POWER8 TM processor 3

4 POWER8 TM System ConTutto FPGA ConTutto FPGA POWER8 TM processor 3

5 POWER8 TM System ConTutto FPGA New Technologies ConTutto FPGA POWER8 TM processor 3

6 POWER8 TM System ConTutto FPGA New Technologies POWER8 TM processor Near- Acceleration 3

Borkar, Exascale Computing - a fact or a fiction?

7 Trends Power consumption is increasingly dominated by data transfer and memory Chip-level energy trends Source: S. Borkar, Exascale Computing - a fact or a fiction?, IPDPS, HPC system-level power break-down Source: R. Nair, Active Cube, 2 nd Workshop on Near-Data Processing,

Solutions Specialization Workload-optimized

stack General-purpose accelerators: GPUs,

periphery/array) Reduce power-expensive data

8 Solutions Specialization Workload-optimized systems: holistic optimization of HW/SW stack General-purpose accelerators: GPUs, FPGAs, DSPs Reduced programmability: fixed-function accelerators (ASICs) orders-of-magnitude performance/power improvements for selected workloads Near-memory computing Bring computation closer to the data (e.g., card, package, chip, memory periphery/array) Reduce power-expensive data transfers by moving from compute-centric to data-centric model Near-memory computing in 3D stack 5 Data-centric computing

size, associativity, replacement policy, etc. interleaving, refresh, fer hits, etc.

9 Can we combine Workload optimization and Near-memory computing? performance and power consumption depend on a complex interaction between workload and memory system locality of reference, access patterns/strides, etc. size, associativity, replacement policy, etc. interleaving, refresh, fer hits, etc. system typically is a black box Challenges system operation is mostly fixed providing no or very limited options for adaptation to the workload characteristics opposite happens: bare metal programming to adapt workload to memory system Can we make the memory system programmable/adaptive? How can we integrate programmable compute capabilities to achieve substantial performance and power gains for a wide range of workloads 6

10 Programmable Near- Acceleration Conventional computer architecture system is a slave of the host processor shared L3 memory controller(s) Main 7

11 Programmable Near- Acceleration Conventional computer architecture system is a slave of the host processor Novel approach system actively participates to ensure that data is stored, accessed and transferred in the most (power-) efficient way resulting in the highest performance/watt system integrates compute capabilities shared L3 memory controller(s) Main 7

12 Programmable Near- Acceleration Conventional computer architecture system is a slave of the host processor Novel approach system actively participates to ensure that data is stored, accessed and transferred in the most (power-) efficient way resulting in the highest performance/watt system integrates compute capabilities Controller Access Processor Novel programmable architecture Enabling/differentiating technologies: programmable state machine technology programmable address mapping scheme power-efficient self-running instructions Near-memory accelerators attach to Access Processor 7 Near- Accelerator Accelerator Accelerators shared L3 Access Processor Main

13 Access Processor (AP) Basic memory controller functions Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.) are exposed to AP program shared L3 Near- Accelerator Accelerator Accelerators Access Processor Main 8

14 Access Processor (AP) Basic memory controller functions Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.) are exposed to AP program Near- Accelerator (NMA) support AP-NMA interface types L1: tightly coupled, AP generates addresses L2: loosely coupled, AP generates addresses L3: loosely coupled, NMA generates addresses Near- Accelerator Accelerator Accelerators shared L3 Access Processor Main 8

Access Processor (AP) Basic memory controller functions Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.

15 Access Processor (AP) Basic memory controller functions Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.) are exposed to AP program Near- Accelerator (NMA) support AP-NMA interface types L1: tightly coupled, AP generates addresses L2: loosely coupled, AP generates addresses L3: loosely coupled, NMA generates addresses Arbitration of processor and NMA accesses fine-grained access bandwidth control Near- Accelerator Accelerator Accelerators Processor shared L3 NMA Access Processor Main 8

16 Basic memory controller functions Access Processor (AP) Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.) are exposed to AP program Near- Accelerator (NMA) support AP-NMA interface types L1: tightly coupled, AP generates addresses L2: loosely coupled, AP generates addresses L3: loosely coupled, NMA generates addresses Arbitration of processor and NMA accesses fine-grained access bandwidth control Interception/redirection/copy of processor accesses to enable on-the-fly processing, snooping/caching address translation tables (virtual/physical) 8 Near- Accelerator Accelerator Accelerators Processor NMA Processor shared L3 NMA Access Processor Main

Basic memory controller functions Access Processor (AP) Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.

17 Basic memory controller functions Access Processor (AP) Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.) are exposed to AP program Near- Accelerator (NMA) support AP-NMA interface types L1: tightly coupled, AP generates addresses L2: loosely coupled, AP generates addresses L3: loosely coupled, NMA generates addresses Arbitration of processor and NMA accesses fine-grained access bandwidth control Interception/redirection/copy of processor accesses to enable on-the-fly processing, snooping/caching address translation tables (virtual/physical) 8 On-the-Fly Processing Near- Accelerator Accelerator Accelerators Processor Processor NMA Processor shared L3 NMA Access Processor Main

18 Access Processor (AP) Near- Accelerator support (continued) Applications executed on host processor interact with AP through special instructions (e.g., PowerEN icswx) and/or special data structures mapped on AP command port AP can be dynamically (re)programmed during runtime binary loaded from or main memory AP is multi-threaded, provides multi-session support AP manages NMA configuration configures execution pipelines, loads parameters, constants, etc. dynamic reconfiguration of FPGA-based NMAs controls storage, access and transfer of configuration data from main memory to NMAs Performance monitoring Multiple APs interconnect to scale to larger systems Near- Accelerator Accelerator Accelerators shared L3 Access Processor Main 9

19 Access Processor (AP) Near- Accelerator support (continued) Applications executed on host processor interact with AP through special instructions (e.g., PowerEN icswx) and/or special data structures mapped on AP command port AP can be dynamically (re)programmed during runtime binary loaded from or main memory AP is multi-threaded, provides multi-session support AP manages NMA configuration configures execution pipelines, loads parameters, constants, etc. dynamic reconfiguration of FPGA-based NMAs controls storage, access and transfer of configuration data from main memory to NMAs Performance monitoring Multiple APs interconnect to scale to larger systems Accelerator Accelerator Near- Accelerator Accelerator Accelerators shared L3 Accelerator Access Accelerator Processor Accelerator Accelerator Main 9

20 Near- Acceleration on ConTutto ConTutto Ideal platform to investigate and experiment with Near- Acceleration on a commercial OpenPOWER server, addressing multiple aspects: design of near-memory accelerator devices integration into computer system architecture use of multiple devices to scale to larger storage and processing capabilities programming of a hybrid system based on near-memory computing applications Demonstration of initial implementation of Programmable Near- Accelerator concept on ConTutto for FFT computation at the IBM booth Ongoing work design space exploration covering device, system and application levels development of near-memory computing tool set and ecosystem including compiler, debugger, performance analysis, and run-time optimization tools 10

21 Concluding remarks This work has been initiated as part of the DOME project, in which IBM and the Netherlands Institute for Radio Astronomy (ASTRON) jointly perform fundamental research on large-scale green Exascale computing for the Square Kilometre Array (SKA), which will become the largest and most sensitive radio telescope in the world Three PhD positions available as part of European Union Horizon 2020 / Marie Curie ITN-EID program NeMeCo which is aimed at developing power-efficient HPC systems for Big-data processing based on the exploitation of near-memory computing topics: run-time optimization compiler technologies near-memory accelerator architecture more information at keyword: NeMeCo 11

22 Backup Material 12

B-FSM Technology Programmable state machine Efficient multi-way branches involving evaluation of many (combinations of) conditions in parallel: loop

Compact data structure Fast deterministic reaction time dispatch instructions within 2 cycles (@ > 2 GHz) Multi-threaded operation B-FSM Successful

23 B-FSM Technology Programmable state machine Efficient multi-way branches involving evaluation of many (combinations of) conditions in parallel: loop conditions, counters, timers, data arrival, etc. Compact data structure Fast deterministic reaction time dispatch instructions within 2 cycles (@ > 2 GHz) Multi-threaded operation B-FSM Successful application to a range of accelerators Regular expression scanners, protocol engines, XML parsers, near-memory accelerators Processing rates of ~20 Gbit/s per B-FSM in 45 nm Small area cost enables scaling to extremely high aggregate processing rates Access Processor 13

24 Near- Acceleration in 3D Stack 14

ConTutto - A flexible memory interface in the OpenPOWER ecosystem OpenPOWER Foundation

ConTutto - A flexible memory interface in the OpenPOWER ecosystem 2016 OpenPOWER Foundation P8 Memory Sub-System 8 DMI links available on a P8 Dual-Chip-Module Differential Memory Interface (DMI) high-speed