Reactive Runtime Systems for Heterogeneous Extreme Computations

Size: px
Start display at page:

Download "Reactive Runtime Systems for Heterogeneous Extreme Computations"

Transcription

1 Reactive Runtime Systems for Heterogeneous Extreme Computations Thomas Sterling Chief Scientist, CREST Professor, School of Informatics and Computing Indiana University November 19, 2014

2 Shifting Paradigms of Computing Abacus Counting tables Pascaline Difference engine Charles Babbage Per Georg Scheutz Tabulators Herman Hollerith Analog computer Vannevar Bush Machine Harvard Architecture Howard Aiken Konrad Zuse Charles Babbage Analytical Engine 2

3 Foundations: The von Neumann Age Information Theory Claude Shannon Computabilty Turing/Church Cybernetics Norbert Wiener Stored Program Computer Architecture von Neumann The von Neumann Shift: Vacuum tubes, core memory Technology assumptions ALUs are most expensive components Memory capacity and clock rate are scale drivers mainframes Data movement of secondary importance Von Neumann extended: Semiconductor, Exploitation of parallelism Out of order completion Vector SIMD Multiprocessor (MIMD) SMP Maintain sequential consistency MPP/Clusters Ensemble computations with message passing 3

4 Capability Computing Memory Requirements 100 PB Planned EU HBP Exaflop machine Subcellular Detail Cellular Human Brain 100 TB 1 TB 10 GB Jülich 28-rack BlueGene/Q machine BBP / CSCS 4-rack BlueGene/Q + BGAS CADMOS 4-rack BlueGene/P EPFL 4-rack BlueGene/L Cellular Neocortical Column Cellular Mesocircuit Cellular Rodent Brain Glia-Cell/Vasculature O(1-10x) Reaction-Diffusion O(100-1,000x) Molecular Dynamics O(>1,000,000,000x?) Plasticity O(1-10x) Learning & Memory O(10-100x) Behavior O( x) 1 MB 1 Gigaflops Single Cellular Model 1 Teraflops 1 Petaflops 1 Exaflops Computational Complexity

5 The Negative Impact of Global Barriers in Astrophysics Codes Computational phase diagram from the MPI based GADGET code (used for N- body and SPH simulations) using 1M particles over four time steps on 128 procs. Red indicates computation Blue indicates waiting for communication

6 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 Clock Rate (MHz) Clock Rate 10,000 1, Heavyweight Lightweight Heterogeneous Courtesy of Peter K ogge, U N D

7 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 TC (Flops/Cycle) Total Concurrency 1.E+07 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 Heavyweight Lightweight Heterogeneous Courtesy of Peter K ogge, U N D

8 Performance Gain Gain with Respect to Cores per Node and Latency; 8 Tasks per Core, Overhead of 16 reg-ops, 8% Network Ops Performance Gain of Non-Blocking Programs over Blocking Programs with Respect to Core Count (Memory Contention) and Net. Latency Cores per Node Latency (Reg-ops)

9 Performance Gain Gain with Respect to Cores per Node and Overhead; Latency of 8192 reg-ops, 64 Tasks per Core Performance Gain of Non-Blocking Programs over Blocking Programs with Varying Core Counts (Memory Contention) and Overheads

10 Objectives High utilization of each core Scaling to large number of cores Synchronization reducing algorithms Methodology The Purpose of a QUARK Dynamic DAG scheduling (QUARK) Explicit parallelism Implicit communication Fine granularity / block data layout Arbitrary DAG with dynamic scheduling Runtime DAG scheduled parallelism Fork-join parallelism Notice the synchronization penalty in the presence of heterogeneity. Courtesy of Jack Dongarra, U T K

11 Asynchronous Many-Task Runtime Systems Dharma Sandia National Laboratories Legion Stanford University Charm++ University of Illinois Uintah University of Utah STAPL Texas A&M University HPX Indiana University OCR Rice University 11

12 Semantic Components of ParalleX

13 Overlapping computational phases for hydrodynamics Computational phases for LULESH (mini-app for hydrodynamics codes). Red indicates work White indicates waiting for communication MPI HPX Overdecomposition: MPI used 64 process while HPX used 1E3 threads spread across 64 cores.

14 HPX-5 Development Progress All cases run on 16 cores (1 locality) Zoom-in on best performers Courtesy of M att Anderson, I ndiana U niversity

15 Memory Vault E I N T E R C O N N E C T G I H Parcel Handler J AGAS Translation Memory Interface Row Buffers Scratchpad Thread 0 registers C D Thread N-1 registers F Access Control K... Wide-word Struct ALU Wide Instruction Buffer L Datapaths with associated control Control-only interfaces Power Management Thread Manager A B N Dataflow Control State M PRECISE Decoder O Fault Detection P and Handling

16 Goal and Objectives Neo-Digital Age Devise means of exploiting nano-scale semiconductor technologies at end of Moore s Law to leverage fabrication facilities investments Create scalable structures and semantics capable of optimal performance (time to solution) within technology and power limitations Technical Strategy 1. Liberate parallel computer architecture from von Neumann (vn) archaism for efficiency and scalability; eliminate vn bottleneck 2. Rebalance and integration of functional elements for data movement, operations, storage, control to minimize time & energy 3. Emphasize tight coupled logic locality for low time & energy 4. Dynamic adaptive localized control to address asynchrony and expose parallelism with emerging behavior of global computing 5. Innovation in execution model for governing principles at all levels 16

17 Neo-Digital Age extending past foundations Near nano-scale semiconductor technology Flat-lining of Moore s Law at single-digit nano-meters Cost benefits of fab lines for economy of scale through mass-market Logic function modules Exploit existing and new IP for efficient functional units ALUs, latches/registers, nearest-neighbor data paths Integrated optics Orders of magnitude bandwidth increase Inter-socket On chip Advanced packaging and cooling Dramatic improvement opportunities in volumetric utilization 3-D integration of combined memory/logic/communication dies Return to wafer-scale integration 17

18 Neo-Digital Age Principles Elimination of vn-based parallel architecture Constrained parallel control state Processor-centric approach optimizes for ALU utilization with very deep memory hierarchy; wrong answer ALU-pervasive structures Merge ALUs with storage and communication structures High availability of ALUs rather than high utilization Slashes access latencies to save time and energy Cells of logic/memory/data-passing for single-cycle actions Optimized for space/energy/performance Optimized for memory bandwidth utilization Emphasis on fine-grain nearest-neighbor data-movement structures Direct access to adjacent state storage Enables communication through nearest neighbor Virtual objects in global name space Data, instructions, synchronization Intra-medium packet-switched wormhole routing Dynamic allocation 18

19 Prior Art: Non-von Neumann Architectures Dataflow Static (Dennis) Dynamic (Arvind) Systolic Arrays Streaming (HT Kung) Neural Networks connectionist Processor in Memory (PIM) Bit or word-level logic on-chip at/near sense amps of memory SIMD (Iobst) or MIMD (Kogge) Cellular Automata (CA) Global emergent behavior from local rules and state von Neumann (ironically) Continuum Computer Architecture CA with global naming and active packets (Sterling/Brodowicz) 19

20 Local communication Systolic array: mostly Dataflow: classically not Properties Processor in Memory: logic at the sense amps Cellular Automata: fully Continuum Computer Architecture: fully, with pipelined packet switched Neural networks: not Event driven for asynchrony management Systolic array: classically not, iwarp yes Dataflow: yes PIM: no Cellular Automata: can be CCA: yes Neural networks: classically no, neuromorphic yes Merged logic/storage/communication Systolic array: yes Dataflow: classically not, possible PIM: yes Cellular Automata: yes CCA: yes Neural networks: yes 20

21 tag tag tag match logic register file value value value ALU network interface control

22 fonton 3-D die stack TSVs and on-chip network optical fiber bundle integrated optical transceiver CCA system CCA die

23 Performance Maximum ALUs For high utilization Maximum memory bandwidth Lots of inter-cell communication bandwidth Reduced overhead Reduced latency Adaptive routing for contention avoidance Multi-variate storage beyond the bit Multi-variate logic beyond base-2 Boolean Dynamic adaptive execution model 23

24 Energy Minimize distance between elements Permeate structures with ALUs Short latencies between memory & registers Multiple clock rates to match timings between logic and storage and Neighbor cell memory/register access Eliminate large caches and multi-level caches Eliminate speculative execution 24

25 Questions can be Answered Now 1. Trade-offs of DRAM bits and SRAM bits Space, Time, Energy 2. Cell rules derived from parallel execution model 3. Cell granularity Breakpoint where mitosis is better 4. Design of logic cell (Fonton) Very simple compared to full conventional processors 5. Reference implementation Emulation, Simulation, FPGA, ASIC, custom 6. 3-D packaging 7. Software environment 8. Application analysis 25

26 Conclusions New high density functional structures are required at end of Moore s Law and are emerging Reactive Runtime systems supported by innovations in hardware architecture mechanisms will exploit extremes of parallelism at high efficiency Neo-Digital age advances beyond von Neumann architectures to maximize execution concurrency and react to uncertainties of asynchrony 26

27

Continuum Computer Architecture

Continuum Computer Architecture Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Enabling Exascale Computing through the ParalleX Execution Model

Enabling Exascale Computing through the ParalleX Execution Model Invited Presentation to: ECMWF Workshop 2010 Enabling Exascale Computing through the ParalleX Execution Model Thomas Sterling Arnaud & Edwards Professor, Department of Computer Science Adjunct Professor,

More information

Toward a Memory-centric Architecture

Toward a Memory-centric Architecture Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

MIND: Scalable Embedded Computing through Advanced Processor in Memory

MIND: Scalable Embedded Computing through Advanced Processor in Memory Presentation to the High Performance Embedded Computing Conference 2002: From PIM to Petaflops Computing MIND: Scalable Embedded Computing through Advanced Processor in Memory Thomas Sterling California

More information

Computer Architecture. Prologue. Topics Computer Architecture. Computer Organization. Organization vs. Architecture. History of Computers

Computer Architecture. Prologue. Topics Computer Architecture. Computer Organization. Organization vs. Architecture. History of Computers Computer Architecture Prologue 1 Topics Computer Architecture Computer Organization Organization vs. Architecture History of Computers Generations of Computers Moore s Law 2 Computer Architecture (1) Definition?

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

HPC Technology Trends

HPC Technology Trends HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

The Dawning of the Neo-Digital Age in the Era of Nano-Scale Technology

The Dawning of the Neo-Digital Age in the Era of Nano-Scale Technology CHPC 2017 Plenary Presentation The Dawning of the Neo-Digital Age in the Era of Nano-Scale Technology Thomas Sterling Professor of Intelligent Systems Engineering Director, Center for Research in Extreme

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

The Future of GPU Computing

The Future of GPU Computing The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist

More information

Parallel Processors. Session 1 Introduction

Parallel Processors. Session 1 Introduction Parallel Processors Session 1 Introduction Applications of Parallel Processors Structural Analysis Weather Forecasting Petroleum Exploration Fusion Energy Research Medical Diagnosis Aerodynamics Simulations

More information

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes.

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes. DARMA Janine C. Bennett, Jonathan Lifflander, David S. Hollman, Jeremiah Wilke, Hemanth Kolla, Aram Markosyan, Nicole Slattengren, Robert L. Clay (PM) PSAAP-WEST February 22, 2017 Sandia National Laboratories

More information

Lecture 8: RISC & Parallel Computers. Parallel computers

Lecture 8: RISC & Parallel Computers. Parallel computers Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:

CS550. TA: TBA   Office: xxx Office hours: TBA. Blackboard: CS550 Advanced Operating Systems (Distributed Operating Systems) Instructor: Xian-He Sun Email: sun@iit.edu, Phone: (312) 567-5260 Office hours: 1:30pm-2:30pm Tuesday, Thursday at SB229C, or by appointment

More information

Introduction to Computer Science. What is Computer Science?

Introduction to Computer Science. What is Computer Science? Introduction to Computer Science CS A101 What is Computer Science? First, some misconceptions. Misconception 1: I can put together my own PC, am good with Windows, and can surf the net with ease, so I

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Interconnect Challenges in a Many Core Compute Environment Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Agenda Microprocessor general trends Implications Tradeoffs Summary

More information

THREAD LEVEL PARALLELISM

THREAD LEVEL PARALLELISM THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Dataflow Architectures. Karin Strauss

Dataflow Architectures. Karin Strauss Dataflow Architectures Karin Strauss Introduction Dataflow machines: programmable computers with hardware optimized for fine grain data-driven parallel computation fine grain: at the instruction granularity

More information

Leveraging Flash in HPC Systems

Leveraging Flash in HPC Systems Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,

More information

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

Preliminary Design Examination of the ParalleX System from a Software and Hardware Perspective

Preliminary Design Examination of the ParalleX System from a Software and Hardware Perspective Preliminary Design Examination of the ParalleX System from a Software and Hardware Perspective Alexandre Tabbal Department of Electrical Engineering atabbal@ece.lsu.edu Matthew Anderson Department of Physics

More information

The future is parallel but it may not be easy

The future is parallel but it may not be easy The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the

More information

Moore s Law. Computer architect goal Software developer assumption

Moore s Law. Computer architect goal Software developer assumption Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

3D systems-on-chip. A clever partitioning of circuits to improve area, cost, power and performance. The 3D technology landscape

3D systems-on-chip. A clever partitioning of circuits to improve area, cost, power and performance. The 3D technology landscape Edition April 2017 Semiconductor technology & processing 3D systems-on-chip A clever partitioning of circuits to improve area, cost, power and performance. In recent years, the technology of 3D integration

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

Arquitecturas y Modelos de. Multicore

Arquitecturas y Modelos de. Multicore Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have

More information

LECTURE 10: Improving Memory Access: Direct and Spatial caches

LECTURE 10: Improving Memory Access: Direct and Spatial caches EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses

More information

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing

More information

Moore s Law. Computer architect goal Software developer assumption

Moore s Law. Computer architect goal Software developer assumption Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard

More information

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu Mohsen Imani University of California San Diego Winter 2016 Technology Trend for IoT http://www.flashmemorysummit.com/english/collaterals/proceedi ngs/2014/20140807_304c_hill.pdf 2 Motivation IoT significantly

More information

Lecture 24: Memory, VM, Multiproc

Lecture 24: Memory, VM, Multiproc Lecture 24: Memory, VM, Multiproc Today s topics: Security wrap-up Off-chip Memory Virtual memory Multiprocessors, cache coherence 1 Spectre: Variant 1 x is controlled by attacker Thanks to bpred, x can

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

Babbage Analytical Machine

Babbage Analytical Machine Von Neumann Machine Babbage Analytical Machine The basis of modern computers is proposed by a professor of mathematics at Cambridge University named Charles Babbage (1972-1871). He has invented a mechanical

More information

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE The most popular taxonomy of computer architecture was defined by Flynn in 1966. Flynn s classification scheme is based on the notion of a stream of information.

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J. UCAS-6 6 > Stanford > Imperial > Verify 2011 Marching Memory マーチングメモリ Tadao Nakamura 中村維男 Based on Patent Application by Tadao Nakamura and Michael J. Flynn 1 Copyright 2010 Tadao Nakamura C-M-C Computer

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Roadmap and Change. Presentation to 2004 Workshop on Extreme Supercomputing Panel: How Much and How Fast. Thomas Sterling

Roadmap and Change. Presentation to 2004 Workshop on Extreme Supercomputing Panel: How Much and How Fast. Thomas Sterling Presentation to 2004 Workshop on Extreme Supercomputing Panel: Roadmap and Change How Much and How Fast Thomas Sterling California Institute of Technology and NS Jet Propulsion aboratory October 12, 2004

More information

Copyright 2012 EMC Corporation. All rights reserved.

Copyright 2012 EMC Corporation. All rights reserved. 1 FLASH 1 ST THE STORAGE STRATEGY FOR THE NEXT DECADE Iztok Sitar Sr. Technology Consultant EMC Slovenia 2 Information Tipping Point Ahead The Future Will Be Nothing Like The Past 140,000 120,000 100,000

More information

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more

More information

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Overview. CS 472 Concurrent & Parallel Programming University of Evansville Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University

More information

Computer Organization

Computer Organization Computer Organization KR Chowdhary Professor & Head Email: kr.chowdhary@gmail.com webpage: krchowdhary.com Department of Computer Science and Engineering MBM Engineering College, Jodhpur November 14, 2013

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory - CAPSL. Introduction

Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory - CAPSL. Introduction Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory - CAPSL Introduction CPEG 852 - Spring 2014 Advanced Topics in Computing Systems Guang R. Gao ACM

More information

Introduction to GPU computing

Introduction to GPU computing Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

CHAPTER 1 Introduction

CHAPTER 1 Introduction CHAPTER 1 Introduction 1.1 Overview 1 1.2 The Main Components of a Computer 3 1.3 An Example System: Wading through the Jargon 4 1.4 Standards Organizations 15 1.5 Historical Development 16 1.5.1 Generation

More information

Emerging DRAM Technologies

Emerging DRAM Technologies 1 Emerging DRAM Technologies Michael Thiems amt051@email.mot.com DigitalDNA Systems Architecture Laboratory Motorola Labs 2 Motivation DRAM and the memory subsystem significantly impacts the performance

More information

General introduction: GPUs and the realm of parallel architectures

General introduction: GPUs and the realm of parallel architectures General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

Automata Computing. Mircea R. Stan UVA-ECE Center for Automata Processing (CAP) Co-Director Kevin Skadron, UVA-CS, CAP Director

Automata Computing. Mircea R. Stan UVA-ECE Center for Automata Processing (CAP) Co-Director Kevin Skadron, UVA-CS, CAP Director Automata Computing Mircea R. Stan (mircea@virginia.edu), UVA-ECE Center for Automata Processing (CAP) Co-Director Kevin Skadron, UVA-CS, CAP Director 2013 Micron Technology, Inc. All rights reserved. Products

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

HPX A GENERAL PURPOSE C++ RUNTIME SYSTEM FOR PARALLEL AND DISTRIBUTED APPLICATIONS OF ANY SCALE

HPX A GENERAL PURPOSE C++ RUNTIME SYSTEM FOR PARALLEL AND DISTRIBUTED APPLICATIONS OF ANY SCALE HPX A GENERAL PURPOSE C++ RUNTIME SYSTEM FOR PARALLEL AND DISTRIBUTED APPLICATIONS OF ANY SCALE The Venture Point TECHNOLOGY DEMANDS NEW RESPONSE 2 Technology Demands new Response 3 Technology Demands

More information

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines

More information

Efficient Parallel Programming on Xeon Phi for Exascale

Efficient Parallel Programming on Xeon Phi for Exascale Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018 Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş Evolution of Computers & Microprocessors Dr. Cahit Karakuş Evolution of Computers First generation (1939-1954) - vacuum tube IBM 650, 1954 Evolution of Computers Second generation (1954-1959) - transistor

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

2011 Francisco Delgadillo

2011 Francisco Delgadillo 1800 s: Analytical Engine Charles Babbage Dawn of Human Concept of Numbers Abacus 1642: Pascal s Machine 1880: Mechanical Tabulator Herman Hollerith 1674: Leibniz Calculating Machine 1911: Hollerith s

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information