Programmable Near-Memory Acceleration on ConTutto
|
|
- Katrina Mosley
- 5 years ago
- Views:
Transcription
1 Programmable Near- Acceleration on ConTutto Jan van Lunteren, IBM Research Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit
2 IBM Zurich (CH) Team Jan van Lunteren, Christoph Hagleitner IBM Dwingeloo (NL) Leandro Fiorin, Erik Vermij IBM Boeblingen (DE) Angelo Haller, Jörg-Stephan Vogt, Harald Huels IBM Burlington, Poughkeepsie, Rochester, Yorktown (US) Thomas Roewer, Bharat Sukhwani, Adam McPadden, Dean Sanner, Dave Cadigan, Sameh Asaad 2
3 POWER8 TM System POWER8 TM processor 3
4 POWER8 TM System ConTutto FPGA ConTutto FPGA POWER8 TM processor 3
5 POWER8 TM System ConTutto FPGA New Technologies ConTutto FPGA POWER8 TM processor 3
6 POWER8 TM System ConTutto FPGA New Technologies POWER8 TM processor Near- Acceleration 3
7 Trends Power consumption is increasingly dominated by data transfer and memory Chip-level energy trends Source: S. Borkar, Exascale Computing - a fact or a fiction?, IPDPS, HPC system-level power break-down Source: R. Nair, Active Cube, 2 nd Workshop on Near-Data Processing,
8 Solutions Specialization Workload-optimized systems: holistic optimization of HW/SW stack General-purpose accelerators: GPUs, FPGAs, DSPs Reduced programmability: fixed-function accelerators (ASICs) orders-of-magnitude performance/power improvements for selected workloads Near-memory computing Bring computation closer to the data (e.g., card, package, chip, memory periphery/array) Reduce power-expensive data transfers by moving from compute-centric to data-centric model Near-memory computing in 3D stack 5 Data-centric computing
9 Can we combine Workload optimization and Near-memory computing? performance and power consumption depend on a complex interaction between workload and memory system locality of reference, access patterns/strides, etc. size, associativity, replacement policy, etc. interleaving, refresh, fer hits, etc. system typically is a black box Challenges system operation is mostly fixed providing no or very limited options for adaptation to the workload characteristics opposite happens: bare metal programming to adapt workload to memory system Can we make the memory system programmable/adaptive? How can we integrate programmable compute capabilities to achieve substantial performance and power gains for a wide range of workloads 6
10 Programmable Near- Acceleration Conventional computer architecture system is a slave of the host processor shared L3 memory controller(s) Main 7
11 Programmable Near- Acceleration Conventional computer architecture system is a slave of the host processor Novel approach system actively participates to ensure that data is stored, accessed and transferred in the most (power-) efficient way resulting in the highest performance/watt system integrates compute capabilities shared L3 memory controller(s) Main 7
12 Programmable Near- Acceleration Conventional computer architecture system is a slave of the host processor Novel approach system actively participates to ensure that data is stored, accessed and transferred in the most (power-) efficient way resulting in the highest performance/watt system integrates compute capabilities Controller Access Processor Novel programmable architecture Enabling/differentiating technologies: programmable state machine technology programmable address mapping scheme power-efficient self-running instructions Near-memory accelerators attach to Access Processor 7 Near- Accelerator Accelerator Accelerators shared L3 Access Processor Main
13 Access Processor (AP) Basic memory controller functions Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.) are exposed to AP program shared L3 Near- Accelerator Accelerator Accelerators Access Processor Main 8
14 Access Processor (AP) Basic memory controller functions Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.) are exposed to AP program Near- Accelerator (NMA) support AP-NMA interface types L1: tightly coupled, AP generates addresses L2: loosely coupled, AP generates addresses L3: loosely coupled, NMA generates addresses Near- Accelerator Accelerator Accelerators shared L3 Access Processor Main 8
15 Access Processor (AP) Basic memory controller functions Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.) are exposed to AP program Near- Accelerator (NMA) support AP-NMA interface types L1: tightly coupled, AP generates addresses L2: loosely coupled, AP generates addresses L3: loosely coupled, NMA generates addresses Arbitration of processor and NMA accesses fine-grained access bandwidth control Near- Accelerator Accelerator Accelerators Processor shared L3 NMA Access Processor Main 8
16 Basic memory controller functions Access Processor (AP) Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.) are exposed to AP program Near- Accelerator (NMA) support AP-NMA interface types L1: tightly coupled, AP generates addresses L2: loosely coupled, AP generates addresses L3: loosely coupled, NMA generates addresses Arbitration of processor and NMA accesses fine-grained access bandwidth control Interception/redirection/copy of processor accesses to enable on-the-fly processing, snooping/caching address translation tables (virtual/physical) 8 Near- Accelerator Accelerator Accelerators Processor NMA Processor shared L3 NMA Access Processor Main
17 Basic memory controller functions Access Processor (AP) Address mapping, access scheduling, refresh, page open/close, etc., are programmable details ( organization, retention times, etc.) are exposed to AP program Near- Accelerator (NMA) support AP-NMA interface types L1: tightly coupled, AP generates addresses L2: loosely coupled, AP generates addresses L3: loosely coupled, NMA generates addresses Arbitration of processor and NMA accesses fine-grained access bandwidth control Interception/redirection/copy of processor accesses to enable on-the-fly processing, snooping/caching address translation tables (virtual/physical) 8 On-the-Fly Processing Near- Accelerator Accelerator Accelerators Processor Processor NMA Processor shared L3 NMA Access Processor Main
18 Access Processor (AP) Near- Accelerator support (continued) Applications executed on host processor interact with AP through special instructions (e.g., PowerEN icswx) and/or special data structures mapped on AP command port AP can be dynamically (re)programmed during runtime binary loaded from or main memory AP is multi-threaded, provides multi-session support AP manages NMA configuration configures execution pipelines, loads parameters, constants, etc. dynamic reconfiguration of FPGA-based NMAs controls storage, access and transfer of configuration data from main memory to NMAs Performance monitoring Multiple APs interconnect to scale to larger systems Near- Accelerator Accelerator Accelerators shared L3 Access Processor Main 9
19 Access Processor (AP) Near- Accelerator support (continued) Applications executed on host processor interact with AP through special instructions (e.g., PowerEN icswx) and/or special data structures mapped on AP command port AP can be dynamically (re)programmed during runtime binary loaded from or main memory AP is multi-threaded, provides multi-session support AP manages NMA configuration configures execution pipelines, loads parameters, constants, etc. dynamic reconfiguration of FPGA-based NMAs controls storage, access and transfer of configuration data from main memory to NMAs Performance monitoring Multiple APs interconnect to scale to larger systems Accelerator Accelerator Near- Accelerator Accelerator Accelerators shared L3 Accelerator Access Accelerator Processor Accelerator Accelerator Main 9
20 Near- Acceleration on ConTutto ConTutto Ideal platform to investigate and experiment with Near- Acceleration on a commercial OpenPOWER server, addressing multiple aspects: design of near-memory accelerator devices integration into computer system architecture use of multiple devices to scale to larger storage and processing capabilities programming of a hybrid system based on near-memory computing applications Demonstration of initial implementation of Programmable Near- Accelerator concept on ConTutto for FFT computation at the IBM booth Ongoing work design space exploration covering device, system and application levels development of near-memory computing tool set and ecosystem including compiler, debugger, performance analysis, and run-time optimization tools 10
21 Concluding remarks This work has been initiated as part of the DOME project, in which IBM and the Netherlands Institute for Radio Astronomy (ASTRON) jointly perform fundamental research on large-scale green Exascale computing for the Square Kilometre Array (SKA), which will become the largest and most sensitive radio telescope in the world Three PhD positions available as part of European Union Horizon 2020 / Marie Curie ITN-EID program NeMeCo which is aimed at developing power-efficient HPC systems for Big-data processing based on the exploitation of near-memory computing topics: run-time optimization compiler technologies near-memory accelerator architecture more information at keyword: NeMeCo 11
22 Backup Material 12
23 B-FSM Technology Programmable state machine Efficient multi-way branches involving evaluation of many (combinations of) conditions in parallel: loop conditions, counters, timers, data arrival, etc. Compact data structure Fast deterministic reaction time dispatch instructions within 2 cycles (@ > 2 GHz) Multi-threaded operation B-FSM Successful application to a range of accelerators Regular expression scanners, protocol engines, XML parsers, near-memory accelerators Processing rates of ~20 Gbit/s per B-FSM in 45 nm Small area cost enables scaling to extremely high aggregate processing rates Access Processor 13
24 Near- Acceleration in 3D Stack 14
ConTutto - A flexible memory interface in the OpenPOWER ecosystem OpenPOWER Foundation
ConTutto - A flexible memory interface in the OpenPOWER ecosystem 2016 OpenPOWER Foundation P8 Memory Sub-System 8 DMI links available on a P8 Dual-Chip-Module Differential Memory Interface (DMI) high-speed
More informationEnergy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS
Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory
More informationExploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center
Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating
More informationSorting big data on heterogeneous near-data processing systems
Sorting big data on heterogeneous near-data processing systems Erik Vermij IBM Research the Netherlands erik.vermij@nl.ibm.com Leandro Fiorin IBM Research the Netherlands leandro.fiorin@nl.ibm.com Koen
More informationTransprecision Computing
Transprecision Computing Dionysios Speaker Diamantopoulos name, Title Company/Organization Name IBM Research - Zurich Join the Conversation #OpenPOWERSummit A look into the next 15 years -8x Source: The
More informationA Lightweight OpenMP Runtime
Alexandre Eichenberger - Kevin O Brien 6/26/ A Lightweight OpenMP Runtime -- OpenMP for Exascale Architectures -- T.J. Watson, IBM Research Goals Thread-rich computing environments are becoming more prevalent
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationVASim: An Open Virtual Automata Simulator for Automata Processing Research
University of Virginia Technical Report #CS2016-03 VASim: An Open Virtual Automata Simulator for Automata Processing Research J. Wadden 1, K. Skadron 1 We present VASim, an open, extensible virtual automata
More informationHardware-accelerated regular expression matching with overlap handling on IBM PowerEN processor
Kubilay Atasu IBM Research Zurich 23 May 2013 Hardware-accelerated regular expression matching with overlap handling on IBM PowerEN processor Kubilay Atasu, Florian Doerfler, Jan van Lunteren, and Christoph
More informationFPGA & Hybrid Systems in the Enterprise Drivers, Exemplars and Challenges
Bob Blainey IBM Software Group 27 Feb 2011 FPGA & Hybrid Systems in the Enterprise Drivers, Exemplars and Challenges Workshop on The Role of FPGAs in a Converged Future with Heterogeneous Programmable
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationOPERA. Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications
OPERA Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications Co-funded by the Horizon 2020 Framework Programme of the
More informationBuilding supercomputers from embedded technologies
http://www.montblanc-project.eu Building supercomputers from embedded technologies Alex Ramirez Barcelona Supercomputing Center Technical Coordinator This project and the research leading to these results
More informationExpressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17
Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]
More informationContinuum Computer Architecture
Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005
More informationSIMULINK AS A TOOL FOR PROTOTYPING RECONFIGURABLE IMAGE PROCESSING APPLICATIONS
SIMULINK AS A TOOL FOR PROTOTYPING RECONFIGURABLE IMAGE PROCESSING APPLICATIONS B. Kovář, J. Schier Ústav teorie informace a automatizace AV ČR, Praha P. Zemčík, A. Herout, V. Beran Ústav počítačové grafiky
More informationNew Zealand Involvement in Solving the SKA Computing Challenges
New Zealand Involvement in Solving the SKA Computing Challenges D R ANDREW E N S O R D I R ECTO R H P C R ESEARC H L A B O R ATORY/ D I R ECTOR N Z SKA ALLIANCE COMPUTING FO R S K A COLLO Q U I UM 2 0
More informationJohn W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands
Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on
More informationfor Exascale Architectures
Toward a Selfaware System for Exascale Architectures Aaron Landwehr, Stéphane Zuckerman, Guang R. Gao University of Delaware 1 Organization Introduction Position, Motivation, Problem Statement Background
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)
PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) Rob van Nieuwpoort Vrije Universiteit Amsterdam & Astron, the Netherlands Institute for Radio Astronomy Why Radio? Credit: NASA/IPAC
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationThe S6000 Family of Processors
The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which
More informationIntroduction. L25: Modern Compiler Design
Introduction L25: Modern Compiler Design Course Aims Understand the performance characteristics of modern processors Be familiar with strategies for optimising dynamic dispatch for languages like JavaScript
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationThread and Data parallelism in CPUs - will GPUs become obsolete?
Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für
More informationA Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors
A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,
More informationDr. Yassine Hariri CMC Microsystems
Dr. Yassine Hariri Hariri@cmc.ca CMC Microsystems 03-26-2013 Agenda MCES Workshop Agenda and Topics Canada s National Design Network and CMC Microsystems Processor Eras: Background and History Single core
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationIBM Power Systems: Open innovation to put data to work Dexter Henderson Vice President IBM Power Systems
IBM Power Systems: Open innovation to put data to work Dexter Henderson Vice President IBM Power Systems 2014 IBM Corporation Powerful Forces are Changing the Way Business Gets Done Data growing exponentially
More informationEfficient Parallel Programming on Xeon Phi for Exascale
Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration
More information6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models
Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationIBM POWER SYSTEMS: YOUR UNFAIR ADVANTAGE
IBM POWER SYSTEMS: YOUR UNFAIR ADVANTAGE Choosing IT infrastructure is a crucial decision, and the right choice will position your organization for success. IBM Power Systems provides an innovative platform
More informationRace to Exascale: Opportunities and Challenges. Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation
Race to Exascale: Opportunities and Challenges Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation Exascale Goal: 1-ExaFlops (10 18 ) within 20 MW by 2018 1 ZFlops 100 EFlops 10 EFlops
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationFacilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit
Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM Join the Conversation #OpenPOWERSummit Moral of the Story OpenPOWER is the best platform to
More informationInnovative DSPLL and MultiSynth Clock Architecture Enables High-Density 10/40/100G Line Card Designs
Innovative and MultiSynth Clock Architecture Enables High-Density 10/40/100G Line Card Designs Introduction The insatiable demand for bandwidth to support applications such as video streaming and cloud
More information100M Gate Designs in FPGAs
100M Gate Designs in FPGAs Fact or Fiction? NMI FPGA Network 11 th October 2016 Jonathan Meadowcroft, Cadence Design Systems Why in the world, would I do that? ASIC replacement? Probably not! Cost prohibitive
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationA Universal Micro-Server Ecosystem Exceeding the Energy and Performance Scaling Boundaries
A Universal Micro-Server Ecosystem Exceeding the Energy and Performance Scaling Boundaries www.uniserver2020.eu UniServer facilitates the advent of IoT solutions through the adoption of a distributed infrastructure
More informationResource allocation and utilization in the Blue Gene/L supercomputer
Resource allocation and utilization in the Blue Gene/L supercomputer Tamar Domany, Y Aridor, O Goldshmidt, Y Kliteynik, EShmueli, U Silbershtein IBM Labs in Haifa Agenda Blue Gene/L Background Blue Gene/L
More informationThe Mont-Blanc approach towards Exascale
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are
More informationThe challenges of computing at astronomical scale
Netherlands Institute for Radio Astronomy The challenges of computing at astronomical scale Chris Broekema Thursday 15th February, 2018, New Zealand SKA Forum 2018, Auckland, New Zealand ASTRON is part
More informationReconfigurable Cell Array for DSP Applications
Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical and Information Technology Lund University, Sweden econfigurable computing Coarse-grained reconfigurable cell
More informationFrom Majorca with love
From Majorca with love IEEE Photonics Society - Winter Topicals 2010 Photonics for Routing and Interconnects January 11, 2010 Organizers: H. Dorren (Technical University of Eindhoven) L. Kimerling (MIT)
More informationTECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING
TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Table of Contents: The Accelerated Data Center Optimizing Data Center Productivity Same Throughput with Fewer Server Nodes
More informationOpenPOWER Performance
OpenPOWER Performance Alex Mericas Chief Engineer, OpenPOWER Performance IBM Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Delivering the Linux ecosystem for Power SOLUTIONS OpenPOWER
More informationIBM Power Systems HPC Cluster
IBM Power Systems HPC Cluster Highlights Complete and fully Integrated HPC cluster for demanding workloads Modular and Extensible: match components & configurations to meet demands Integrated: racked &
More informationRevisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison
Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to
More informationSoftware Defined Hardware
Software Defined Hardware For data intensive computation Wade Shen DARPA I2O September 19, 2017 1 Goal Statement Build runtime reconfigurable hardware and software that enables near ASIC performance (within
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT. Rob van Nieuwpoort
PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT Rob van Nieuwpoort rob@cs.vu.nl Who am I 10 years of Grid / Cloud computing 6 years of many-core computing, radio astronomy
More informationUnderstanding the Endianess and the benefits Red Hat Enterprise Linux for Power, little endian
Filipe Miranda Global Lead for Red Hat Products on IBM z Systems and Power Systems Red Hat Inc. Understanding the Endianess and the benefits Red Hat Enterprise Linux for Power, little
More informationEmerging Memory: In-System Enablement
Subsystem Development Emerging : In-System Enablement Edgar Cordero Adam McPadden Connor Krukosky 2016 IBM Corporation Background There is an emergence of new memory technologies currently The industry
More informationTwos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.
Twos Complement Signed Numbers IT 3123 Hardware and Software Concepts Modern Computer Implementations April 26 Notice: This session is being recorded. Copyright 2009 by Bob Brown http://xkcd.com/571/ Reminder:
More information1 Publishable Summary
1 Publishable Summary 1.1 VELOX Motivation and Goals The current trend in designing processors with multiple cores, where cores operate in parallel and each of them supports multiple threads, makes the
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationOpenFOAM on POWER8. Stretching the performance envelope. A White Paper by OCF
OpenFOAM on POWER8 Stretching the performance envelope A White Paper by OCF Executive Summary In this white paper, we will show that the IBM Power architecture provides a uniquely powerful platform for
More informationPactron FPGA Accelerated Computing Solutions
Pactron FPGA Accelerated Computing Solutions Intel Xeon + Altera FPGA 2015 Pactron HJPC Corporation 1 Motivation for Accelerators Enhanced Performance: Accelerators compliment CPU cores to meet market
More informationGPUS FOR NGVLA. M Clark, April 2015
S FOR NGVLA M Clark, April 2015 GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS MACHINES PC DATA CENTER MOBILE The World Leader in Visual Computing 2 What is a? Tesla K40
More informationIntegrated Management of OpenPOWER Converged Infrastructures. Revolutionizing the Datacenter
Integrated Management of OpenPOWER Converged Infrastructures Marcelo Perazolo, Architect IBM Systems Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Converged Infrastructure Systems
More informationLong Term Trends for Embedded System Design
Long Term Trends for Embedded System Design Ahmed Amine JERRAYA Laboratoire TIMA, 46 Avenue Félix Viallet, 38031 Grenoble CEDEX, France Email: Ahmed.Jerraya@imag.fr Abstract. An embedded system is an application
More informationNetwork Disaggregation
Network Disaggregation Mian Usman Network Architect Internet2 Global Summit 2018 1 GÉANT Network The GÉANT network interconnects research, education and innovation communities worldwide, with secure, high-capacity
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationAMD Disaggregates the Server, Defines New Hyperscale Building Block
AMD Disaggregates the Server, Defines New Hyperscale Building Block Fabric Based Architecture Enables Next Generation Data Center Optimization Executive Summary AMD SeaMicro s disaggregated server enables
More informationWelcome. Altera Technology Roadshow 2013
Welcome Altera Technology Roadshow 2013 Altera at a Glance Founded in Silicon Valley, California in 1983 Industry s first reprogrammable logic semiconductors $1.78 billion in 2012 sales Over 2,900 employees
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationTowards Performance Modeling of 3D Memory Integrated FPGA Architectures
Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Shreyas G. Singapura, Anand Panangadan and Viktor K. Prasanna University of Southern California, Los Angeles CA 90089, USA, {singapur,
More informationHeterogeneous Computing Systems in Cloud Datacenters
FPL 2016 Lausanne, August 31 Heterogeneous Computing Systems in Cloud Datacenters Christoph Hagleitner, hle@zurich.ibm.com IBM Research - Zurich Lab IBM Research Zurich Lab (ZRL) Established in 1956 Two
More informationComputer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing
More informationMetalFS: Near-Storage Operators for CAPI SNAP Max Plauth, PhD Student Hasso Plattner Institute for Digital Engineering University of Potsdam
MetalFS: Near-Storage Operators for CAPI SNAP Max Plauth, PhD Student Hasso Plattner Institute for Digital Engineering University of Potsdam Join the Conversation #OpenPOWERSummit People behind MetalFS
More informationEnergy-Efficient Data Transfers in Radio Astronomy with Software UDP RDMA Third Workshop on Innovating the Network for Data-Intensive Science, INDIS16
Energy-Efficient Data Transfers in Radio Astronomy with Software UDP RDMA Third Workshop on Innovating the Network for Data-Intensive Science, INDIS16 Przemek Lenkiewicz, Researcher@IBM Netherlands Bernard
More informationA Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms
A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms Speaker: Jingheng Xu Tsinghua University Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit Contents 1 About
More informationMore Course Information
More Course Information Labs and lectures are both important Labs: cover more on hands-on design/tool/flow issues Lectures: important in terms of basic concepts and fundamentals Do well in labs Do well
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More informationLec 13: Linking and Memory. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Announcements
Lec 13: Linking and Memory Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University PA 2 is out Due on Oct 22 nd Announcements Prelim Oct 23 rd, 7:30-9:30/10:00 All content up to Lecture on Oct
More informationFET Proactive initiative: Advanced Computing Architectures
FET Proactive initiative: Advanced Computing Architectures Terms of Reference 1 INTRODUCTION Fuelled by Moore's law, progress in electronics continues at an unabated pace permitting the design of devices
More informationA Low Latency Solution Stack for High Frequency Trading. High-Frequency Trading. Solution. White Paper
A Low Latency Solution Stack for High Frequency Trading White Paper High-Frequency Trading High-frequency trading has gained a strong foothold in financial markets, driven by several factors including
More informationPower 7. Dan Christiani Kyle Wieschowski
Power 7 Dan Christiani Kyle Wieschowski History 1980-2000 1980 RISC Prototype 1990 POWER1 (Performance Optimization With Enhanced RISC) (1 um) 1993 IBM launches 66MHz POWER2 (.35 um) 1997 POWER2 Super
More informationEnergy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich
Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich Revolu'onizing the Datacenter Datacenter Join the Conversa'on #OpenPOWERSummit Towards highly efficient data
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationEmbedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.
Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationMedical practice: diagnostics, treatment and surgery in supercomputing centers
International Advanced Research Workshop on High Performance Computing from Clouds and Big Data to Exascale and Beyond Medical practice: diagnostics, treatment and surgery in supercomputing centers Prof.
More informationOpenRadio. A programmable wireless dataplane. Manu Bansal Stanford University. Joint work with Jeff Mehlman, Sachin Katti, Phil Levis
OpenRadio A programmable wireless dataplane Manu Bansal Stanford University Joint work with Jeff Mehlman, Sachin Katti, Phil Levis HotSDN 12, August 13, 2012, Helsinki, Finland 2 Opening up the radio Why?
More informationThe DEEP (and DEEP-ER) projects
The DEEP (and DEEP-ER) projects Estela Suarez - Jülich Supercomputing Centre BDEC for Europe Workshop Barcelona, 28.01.2015 The research leading to these results has received funding from the European
More informationBig Data Systems on Future Hardware. Bingsheng He NUS Computing
Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big
More informationLECTURE 11. Memory Hierarchy
LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed
More informationUsing FPGAs as Microservices
Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,
More information