Dataflow Supercomputers
|
|
- Brandon West
- 5 years ago
- Views:
Transcription
1 Dataflow Supercomputers Michael J. Flynn Maxeler Technologies and Stanford University
2 Outline History Dataflow as a supercomputer technology openspl: generalizing the dataflow programming model Optimizing the hardware for dataflow
3 The great parallel precessor debate of 1967 Amdahl espouses to sequential machine and posits Amdahl s law Danial Slotnick (father of ILLIAC IV) posits the parallel approach while recognizing the problem of programming
4 The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage. -Daniel Slotnick (1967).Speedup in parallel processing is achieved by programming effort Michael J Flynn
5 The (multi core) Parallel Processor Problem Efficient distribution of tasks Inter-node communications (data assembly & dispatch) reduces computational efficiency: speedup/nodes Memory bottleneck limitations The sequential programming model: Layers of abstraction hide critical sources of and limits to efficient parallel execution
6 May s first law: Software efficiency halves every 18 months, exactly compensating for Moore s Law May s second law: Compiler technology doubles efficiency no faster than once a decade David May
7 Dataflow as a supercomputer technology Looking for another way. experience from Maxeler Some Dataflow is an old technology (70 s and 80 s), conceptually creates and ideal machine to match the program but the interconnect problem was insurmountable for the day. Today s FPGAs have come a long way and enable an emulation of the dataflow machine.
8 Hardware and Software Alternatives Hardware: A reconfigurable heterogeneous accelerator array model Software: A spatial (2D) dataflow programming model rather than a sequential model
9 Accelerator HW model Assumes host CPU + FPGA accelerator Application consists of two parts Essential (high usage, >99%) part (kernel(s)) Bulk part (<1% dynamic activity) Essential part is executed on accelerator; Bulk part on host So Slotnick s law of effort now only applies to a small portion of the application
10 FPGA accelerator hardware model: server with acceleration cards 10
11 Each (essential) program has a data flow graph (DFG) The ideal HW to execute the DFG is a data flow machine that exactly matches the DFG A compiler / translator transforms the DF machine so that it can be emulated by the FPGA. FPGA based accelerators, while slow in cycle time, offer much more flexibility in matching DFGs. Limitation 1: The DFG is limited in (static) size to O (104) nodes. Limitation 2: Only the control structure is matched not the data access patterns
12 Acceleration with Static, Synchronous, Streaming DFMs Create a static DFM (unroll loops, etc.); generally the goal is throughput not latency. Create a fully synchronous DFM synchronized to multiple memory channels. The time through the DFM is always the same. Stream computations across the long DFM array, creating MISD or pipelined parallelism. If silicon area and pin BW allow, create multiple copies of the DFM (as with SIMD or vector computations). Iterate on the DFM aspect ratio to optimize speedup. 12
13 Acceleration with Static, Synchronous, Streaming DFMs Create a fully synchronous data flow machine synchronized to multiple memory channels, then stream computations across a long array PCIe accelerator card w memory is DFE (Engine) Computation #2 Data from node memory Computation #1 Results to memory FPGA based DFM Buffer intermediate results 13
14 Example: X x x SCSVar x = io.input("x", scsint(32)); 30 SCSVar result = x * x + 30; io.output("y", result, scsint(32)); + y
15 Example: Moving Average Y = (Xn-1 + X + Xn+1) / 3 SCSVar x = io.input( x, scsfloat(7,17)); SCSVar prev = stream.offset(x, -1); SCSVar next = stream.offset(x, 1); SCSVar sum = prev + x + next; SCSVar result = sum / 3; io.output( y, result, scsfloat(7,17));
16 Example: Choices x SCSVar x = io.input( x, scsuint(24)); SCSVar result = (x>10)? x+1 : x-1; io.output( y, result, scsuint(24)); > y
17 Data flow graph as generated by compiler 4866 nodes; about 250x100 Each node represents a line of JAVA code with area time parameters, so that the designer can change the aspect ratio to improve pin BW, area usage and speedup
18 18
19 8 dataflow engines ( GB RAM) High-speed MaxRing Zero-copy RDMA between CPUs and DFEs over Infiniband Dynamic CPU/DFE balancing 19
20 Example: Seismic Data Processing For Oil & Gas exploration: distribute grid of sensors over large area Sonic impulse the area and record reflections: frequency, amplitude, delay at each sensor Sea based surveys use 30,000 sensors to record data (120 db range) each sampled at more than 2kbps with new sonic impulse every 10 seconds Order of terabytes of data each day
21 1200m Generates >1GB every 10s 1200m 1200m 1200m 1200m
22 Up to 240x speedup for 1 MAX2 card compared to single CPU core Speedup increases with cube size 1 billion point modelling domain using single FPGA card 22
23 Achieved Computational Speedup for the entire application (not just the kernel) compared to Intel server 624 RTM with Chevron VTI 19x and TTI 25x Credit 32x and Rates 26x 62 4 Sparse Matrix 20-40x Lattice Boltzman Fluid Flow 30x Seismic Trace Processing 24x Conjugate Gradient Opt 26x
24 So for HPC, how can emulation (FPGA) be better than high performance x86 processor(s)? Multi core approach lacks robustness in streaming hardware (spanning area, time, power) Multi core lacks robust parallel software methodology and tools FPGAs emulate the ideal data flow machine Success comes about from their flexibility in matching the DFG with a synchronous DFM and streaming data through and shear size > 1 million cells Effort and support tools provide significant application speedup 24
25 Generalizing the programming model openspl Open spatial programming language, an orderly way to expose parallelism 2D dataflow is the programmer s model, JAVA the syntax Could target hardware implementations, beyond the DFEs map on to CPUs (e.g. using OpenMP/MPI) GPUs Other accelerators
26 Temporal Computing (1D) A program is a sequence of instructions Performance is dominated by: CPU Memory latency ALU availability Memory Actual computation time Get Read Inst data. 1 1 C Write Get Read Inst O Result data M. 1 2 P 2 C Write Get Read Inst O Result data M. 2 3 P 3 C Write O Result M 3 P Time
27 Spatial Computing (2D) Synchronous data movement data in ALU Contr ol Contr ol ALU ALU Buffe r data out ALU ALU Read data [1..N] Computation Write results [1..N] Throughput dominated Time
28 OpenSPL Basics Control and Data-flows are decoupled both are fully programmable can run in parallel for maximum performance Operations exist in space and by default run in parallel their number is limited only by the available space All operations can be customized at various levels e.g., from algorithm down to the number representation Data sets (actions) streams through the operations The data transport and processing can be matched
29 OpenSPL Models Memory: Fast Memory (FMEM): many, small in size, low latency Large Memory (LMEM): few, large in size, high latency Scalars: many, tiny, lowest latency, fixed during exec. Execution: datasets + scalar settings sent as atomic actions all data flows through the system synchronously in ticks Programming: API allows construction of a graph computation meta-programming allows complex construction
30 Spatial Arithmetic Operations instantiated as separate arithmetic units Units along data paths use custom arithmetic and number representation The above may reduce individual unit sizes can maximize the number that fit on a given SCS Data rates of memory and I/O communication may also be maximized due to scaled down data sizes Exponent (8) s S Mantissa (23) S S S S S S Exponent (3) Potentially optimal encoding s S S S Mantissa (10)
31 Spatial Arithmetic at All Levels Arithmetic optimizations at the bit level e.g., minimizing the number of 1 s in binary numbers, leading to linear savings of both space and power (the zeros are omitted in the implementation) Higher level arithmetic optimizations e.g., in matrix algebra, the location of all non-zero elements in sparse matrix computations is important Spatial encoding of data structures can reduce transfers between memory and computational units (boost performance and improve efficiency) In temporal computing encoding and decoding would take time and eventually can cancel out all of the advantages In spatial computing, encoding and decoding just consume a bit more of additional space
32 Benchmarking Spatial Computers Spatial computing systems generate one result during every tick SC system efficiency is strongly determined by how efficiently data can be fed from external sources Fair comparison metrics are needed, among others: computations per cubic foot of datacenter space computations per Watt operational costs per computation
33 Hardware: FPGA pros & Cons The FPGA while quite suitable for emulation is not an ideal hardware substrate Too fine grain, wasted area Expensive Place and route times are excessive and will get longer Slow cycle time FPGA advantages Commodity part with best process technology Flexible interconnect Transistor density scaling In principle, possible to quickly reduce to ASIC
34 Silicon device density scaling (ITRS 10 year projections) Net: there s either 20 billion transistors or 50 Giga Bytes of Flash on a 1cm2 die
35 Hardware alternatives: DFArray Clearly an array structure is attractive The LUT is inefficient in the context of dataflow Dataflow operations are relative few and well defined (arithmetic, logic, mux and FIFO and store) Flexibility in data sizing is important but not necessarily to the bit. Existing DSPs (really MACs) in FPGAs are prized and well used (2000 or so per chip)
36 Hardware alternatives: DFArray The flexible interconnect required by the dataflow will still limit the cycle time (perhaps 5x slower than CPU). This also reduces power requirements. The great advantage is in the increased operational density (perhaps 10x over existing FPGAs) enabling much larger dataflow machines and greater parallelism. Avoids very long (many hours) place & route time required by FPGA
37 Hardware alternatives: DFArray The big disadvantage: it s not a commodity part (even an expensive one). There s a lot of research issues in determining the best dataflow hardware fabric.
38 Conclusions Parallel Processing demands rethinking algorithms, programming approach and environment and hardware. The success of FPGA acceleration points to the weakness of evolutionary approaches to parallel processing: hardware (multi core) and software (C++, etc.), at least for some applications The automation of acceleration is still early on; still required: tools, methodology for writing apps., analysis methodology and (maybe) a new hardware basis For FPGA success software is key: VHDL, inefficient place and route, SW are big limitations 38
39 Conclusions 2 In parallel processing: to find success, start with the problem not the solution. There s a lot of research ahead to effectively create parallel translation technology.
40 Thank you
Accelerating computation with FPGAs
Accelerating computation with FPGAs Michael J. Flynn Maxeler Technologies and Stanford University M. J. Flynn Maxeler Technologies 1 Based on work done by my colleagues at Maxeler, especially Oskar Mencer,
More informationThe future is parallel but it may not be easy
The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationA Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis
A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development
More informationTracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers
Tracking Acceleration with FPGAs Future Tracking, CMS Week 4/12/17 Sioni Summers Contents Introduction FPGAs & 'DataFlow Engines' for computing Device architecture Maxeler HLT Tracking Acceleration 2 Introduction
More informationReconfigurable Computing. Introduction
Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally
More informationFPGA: What? Why? Marco D. Santambrogio
FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much
More informationEE282 Computer Architecture. Lecture 1: What is Computer Architecture?
EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer
More informationThe S6000 Family of Processors
The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which
More informationTrends in the Infrastructure of Computing
Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationA Data-Parallel Genealogy: The GPU Family Tree
A Data-Parallel Genealogy: The GPU Family Tree Department of Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Outline Moore s Law brings
More informationThemes. The Network 1. Energy in the DC: ~15% network? Energy by Technology
Themes The Network 1 Low Power Computing David Andersen Carnegie Mellon University Last two classes: Saving power by running more slowly and sleeping more. This time: Network intro; saving power by architecting
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationComputer Architecture s Changing Definition
Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction
More informationHardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University
Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis
More informationCO405H. Department of Compu:ng Imperial College London. Computing in Space with OpenSPL Topic 4: DataFlow Engines (DFEs)
Oskar Mencer CO405H Computing in Space with OpenSPL Topic 4: DataFlow Engines (DFEs) Georgi Gaydadjiev Department of Compu:ng Imperial College London h#p://www.doc.ic.ac.uk/~oskar/ h#p://www.doc.ic.ac.uk/~georgig/
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationEECS150 - Digital Design Lecture 09 - Parallelism
EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization
More information04 - DSP Architecture and Microarchitecture
September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationSimplifying FPGA Design for SDR with a Network on Chip Architecture
Simplifying FPGA Design for SDR with a Network on Chip Architecture Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 RF NoC 3 Status and Conclusions USRP FPGA Capability Gen
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationIntroduction to High-Performance Computing
Introduction to High-Performance Computing Simon D. Levy BIOL 274 17 November 2010 Chapter 12 12.1: Concurrent Processing High-Performance Computing A fancy term for computers significantly faster than
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationEECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141
EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationTDT 4260 lecture 3 spring semester 2015
1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,
More informationChapter 1. Introduction
Chapter 1 Introduction In a packet-switched network, packets are buffered when they cannot be processed or transmitted at the rate they arrive. There are three main reasons that a router, with generic
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationParallelism and Concurrency. COS 326 David Walker Princeton University
Parallelism and Concurrency COS 326 David Walker Princeton University Parallelism What is it? Today's technology trends. How can we take advantage of it? Why is it so much harder to program? Some preliminary
More informationAcceleration in the Wild, with Data Flow Computing
Acceleration in the Wild, with Data Flow Computing James Spooner, VP of Acceleration QCon, Finance Track, 08 March 2012 Acceleration in the Wild with Data Flow Deliberate, focused approach to improving
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationCustom computing systems
Custom computing systems difference engine: Charles Babbage 1832 - compute maths tables digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic Splash2: Supercomputing esearch Center
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationCSL 860: Modern Parallel
CSL 860: Modern Parallel Computation Course Information www.cse.iitd.ac.in/~subodh/courses/csl860 Grading: Quizes25 Lab Exercise 17 + 8 Project35 (25% design, 25% presentations, 50% Demo) Final Exam 25
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationContinuum Computer Architecture
Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005
More informationESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder
ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators
More informationDHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering
DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY Department of Computer science and engineering Year :II year CS6303 COMPUTER ARCHITECTURE Question Bank UNIT-1OVERVIEW AND INSTRUCTIONS PART-B
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationCOE 561 Digital System Design & Synthesis Introduction
1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete
More informationMoore s Law. Computer architect goal Software developer assumption
Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer
More informationMoore s Law. Computer architect goal Software developer assumption
Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationOverview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips
Overview CSE372 Digital Systems Organization and Design Lab Prof. Milo Martin Unit 5: Hardware Synthesis CAD (Computer Aided Design) Use computers to design computers Virtuous cycle Architectural-level,
More informationChapter 5: ASICs Vs. PLDs
Chapter 5: ASICs Vs. PLDs 5.1 Introduction A general definition of the term Application Specific Integrated Circuit (ASIC) is virtually every type of chip that is designed to perform a dedicated task.
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationScaling Throughput Processors for Machine Intelligence
Scaling Throughput Processors for Machine Intelligence ScaledML Stanford 24-Mar-18 simon@graphcore.ai 1 MI The impact on humanity of harnessing machine intelligence will be greater than the impact of harnessing
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationCourse II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan
Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor
More informationOverview. Processor organizations Types of parallel machines. Real machines
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments
More informationOverview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM
Memories Overview Memory Classification Read-Only Memory (ROM) Types of ROM PROM, EPROM, E 2 PROM Flash ROMs (Compact Flash, Secure Digital, Memory Stick) Random Access Memory (RAM) Types of RAM Static
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationParallel Computing Ideas
Parallel Computing Ideas K. 1 1 Department of Mathematics 2018 Why When to go for speed Historically: Production code Code takes a long time to run Code runs many times Code is not end in itself 2010:
More information"On the Capability and Achievable Performance of FPGAs for HPC Applications"
"On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationFast Hardware For AI
Fast Hardware For AI Karl Freund karl@moorinsightsstrategy.com Sr. Analyst, AI and HPC Moor Insights & Strategy Follow my blogs covering Machine Learning Hardware on Forbes: http://www.forbes.com/sites/moorinsights
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationMemory. Lecture 22 CS301
Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationBoosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search
Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationThe Challenges of System Design. Raising Performance and Reducing Power Consumption
The Challenges of System Design Raising Performance and Reducing Power Consumption 1 Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2 Product Challenge - Software
More informationConstructing Virtual Architectures on Tiled Processors. David Wentzlaff Anant Agarwal MIT
Constructing Virtual Architectures on Tiled Processors David Wentzlaff Anant Agarwal MIT 1 Emulators and JITs for Multi-Core David Wentzlaff Anant Agarwal MIT 2 Why Multi-Core? 3 Why Multi-Core? Future
More informationComputer Architecture. R. Poss
Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion
More informationApproaches to Parallel Computing
Approaches to Parallel Computing K. Cooper 1 1 Department of Mathematics Washington State University 2019 Paradigms Concept Many hands make light work... Set several processors to work on separate aspects
More information10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationBeyond Latency and Throughput
Beyond Latency and Throughput Performance for Heterogeneous Multi-Core Architectures JoAnn M. Paul Virginia Tech, ECE National Capital Region Common basis for two themes Flynn s Taxonomy Computers viewed
More informationPerformance of computer systems
Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationLecture 2: Performance
Lecture 2: Performance Today s topics: Technology wrap-up Performance trends and equations Reminders: YouTube videos, canvas, and class webpage: http://www.cs.utah.edu/~rajeev/cs3810/ 1 Important Trends
More informationEmbedded Systems: Hardware Components (part I) Todor Stefanov
Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationTools and techniques for optimization and debugging. Fabio Affinito October 2015
Tools and techniques for optimization and debugging Fabio Affinito October 2015 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object, made of different
More informationSystem-on-Chip Architecture for Mobile Applications. Sabyasachi Dey
System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution
More informationLECTURE 1. Introduction
LECTURE 1 Introduction CLASSES OF COMPUTERS When we think of a computer, most of us might first think of our laptop or maybe one of the desktop machines frequently used in the Majors Lab. Computers, however,
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationECE 154A. Architecture. Dmitri Strukov
ECE 154A Introduction to Computer Architecture Dmitri Strukov Lecture 1 Outline Admin What this class is about? Prerequisites ii Simple computer Performance Historical trends Economics 2 Admin Office Hours:
More informationMarching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.
UCAS-6 6 > Stanford > Imperial > Verify 2011 Marching Memory マーチングメモリ Tadao Nakamura 中村維男 Based on Patent Application by Tadao Nakamura and Michael J. Flynn 1 Copyright 2010 Tadao Nakamura C-M-C Computer
More informationHigh Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx
High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014 Not a Complete Product Overview Page 2 Outline Page 3 Petabytes per month Increasing Bandwidth Global IP Traffic Growth
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationGraphics Hardware 2008
AMD Smarter Choice Graphics Hardware 2008 Mike Mantor AMD Fellow Architect michael.mantor@amd.com GPUs vs. Multi-core CPUs On a Converging Course or Fundamentally Different? Many Cores Disruptive Change
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationCenter Extreme Scale CS Research
Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek
More informationHigh Performance Computing Systems
High Performance Computing Systems Shared Memory Doug Shook Shared Memory Bottlenecks Trips to memory Cache coherence 2 Why Multicore? Shared memory systems used to be purely the domain of HPC... What
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More information