International IEEE Symposium on Field-Programmable Custom Computing Machines

Size: px
Start display at page:

Download "International IEEE Symposium on Field-Programmable Custom Computing Machines"

Transcription

1 - International IEEE Symposium on ield-programmable Custom Computing Machines Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Bandwidth Kentaro Sano Yoshiaki Hatsuda Satoru Yamamoto Graduate School of Information Sciences Tohoku University, Japan Sendai north-east Tokyo 1 International Symposium May 3, Outline Introduction Stencil computation and its streaming Architecture and design Performance model Evaluation Conclusions 9-PGA prototype system 2 International Symposium May 3,

2 Introduction General-purpose µ-processors used for high-performance computing systems Increasing cores for higher performance, but low growing rate of off-chip I/O bandwidth Insufficient memory/network bandwidth for the entire arithmetic performance Only a fraction of the peak performance usually utilized in large-scale systems Stencil computation Typical scientific computing kernel Small operational intensity :.5 lop/byte µ-processors assume higher op. intensity Core Core Core Core Cache off-chip I/O BW /Network 3.7 lop/byte is lost. (88% of the peak) Poor performance! 3 AMD Opteron X4 : 4.2 lop/byte International Symposium May 3, Stencil Computation on CPUs/GPUs Highly-optimized implementation techniques achieve : Processor Peak Performance [Glop/s] Sustained (efficiency) 1 core of Xeon E522 (quad-core) (31%) 8 cores of Xeon E522 SMP node (22%) 1 x NVidia Tesla C (65%) 16 x NVidia Tesla C (42%) 4 Inefficiency is caused by General-purpose structure, unsuitable for stencil computation Imbalanced bandwidth for the entire performance Another promising way : scalable custom computing on PGAs International Symposium May 3,

3 This Research Linear scalability for high-performance stencil computation by custom computing on multiple PGAs Scalable Streaming-Array (SSA) Domain-specific architecture Policy to overcome high design-cost (Custom & programmable soft-cores) SW layer Common HW layer PGA device easy to change static, but customizable 5 Major contributions Extensible architecture for scalable multi-pga system with constant memory-bandwidth Programmable design of soft-cores tailored for the app domain Performance model to predict the scalability 9-PGA system demonstrating 26 Glop/s with only 2 GB/s International Symposium May 3, Iterative Stencil Computation i -1 i i+1 for (n=; n< N max ; n++) for (j=; j< J max ; j++) for (i=; i< I max ; i++) v i,j = f (v i,j, v i+1,j, v i-1,j, v i,j+1, v i,j-1 ) local computation Time-step iteration Grid traverse j -1 j j +1 Stencil 2D iterative stencil computation Computational grid 6 Update all the grid-points with local stencil computation Repeat for successive time-steps Ex: Linear eq. solvers (Jacobi method, etc.) Difference-scheme computations (CD, etc.) International Symposium May 3,

4 Streaming with a Cyclic Buffer I-1 active stencil S i-1,j-1 cyclic buffer (size= 2I ) v i,j data input v i-1,j-1 J-1 time-step iteration n time-step iteration n+1 Cyclic buffer 7 Tomoyoshi Kobori and Tsutomu Maruyama, A High Speed Computation System for 3D CHC Lattice Gas Model with PGA, Proceedings of PL23, , 23. International Symposium May 3, Pipelining Multiple Iterations I-1 active stencil S i-1,j-1 S i-2,j-2 S i-3,j-3 cyclic buffer (size=2i ) v i,j data input v i-1,j-1 v i-2,j-2 J-1 time-step iteration n time-step iteration n+1 time-step iteration n+2 Pipelining execution with constant bandwidth! 8 Tomoyoshi Kobori and Tsutomu Maruyama, A High Speed Computation System for 3D CHC Lattice Gas Model with PGA, Proceedings of PL23, , 23. International Symposium May 3,

5 Scalable Streaming-Array Architecture I-1 active stencil S i-1,j-1 S i-2,j-2 S i-3,j-3 cyclic buffer (size=2i ) v i,j data input v i-1,j-1 v i-2,j-2 J-1 time-step n time-step n+1 time-step n+2 Buffer Stage 1 Buffer Stage 2 Buffer Stage 3 Multiple time-step execution per memory-read. 9 International Symposium May 3, SSA Design for Jacobi Computation Requirements High density & utilization of P units : simple processing elements (PEs) lexibility in application domain : programmable PEs & sequencers 1 Pipelined Stage Module Overview of SSA on a single PGA International Symposium May 3,

6 Processing Element (PE) from / to PE x, y-1 8-stage pipeline Buffer Constant Multiplexor loating-point multiply and accumulate unit (MAC) from PE x-1,y I PE x, y data addr Constants Buffer 1 2 MR EX1 EX2 EX3 EX4 EX5 MW x MAC + to PE x+1, y from / to PE x, y+1 Sequencer of each PSM Control PEs in PSM Micro operations PSM = SIMD core 11 International Symposium May 3, SSA Assembly Language Two opecodes (control, comput.) lset bnz mulp accp loop-set with # of iterations branch if loop-counter is not zero multiply multiply and accumulate-positive 12 International Symposium May 3,

7 Performance Model of SSA 2 words /cycle PGA 1 8 PEs 8 PEs 8 PEs PGA 2 SSA 8 x 16 PGA n SSA 8 x PSMs 13 Cycles of pipelined exec. on multiple PGAs Speedup to single PGA Grid : to Iteration : 5 to 8 Linear speedup up to 1s PGAs for medium size Speedup to a single PGA Linear x=y=248, g=8 x=y=512, g=8 x=y=256, g=8 x=y=128, g=8 x=y=128, g=5 x=y=128, g= Number of PGAs (each has 16 PSMs.) International Symposium May 3, Prototyping SSA for Evaluation max 17MHz Peak of 25.5 Glop/s to Host DDR2 Master Master PGA PC GB/s Slave PGA 2 USB I/ (NIOS II) SSA Scalable Streaming 8 Array x 12 63% (8 x 12 ALUTs PEs) 99% DSPs Controller Stream Controller I/O Module I/O Module SSA Scalable Streaming Array 8 x 16 79% (8 x 16 ALUTs PEs) 1% DSPs I/O Module max 137MHz Peak of 34 Glop/s to/from Next Slave Slaves 3, 4, 5, 6, 7, 8, 9 14 ALTERA Stratix III PGA x 9 IEEE754 single-precision P Sustained 2GB/s for reading and writing DDR2 PSM has 8 PEs. req = 133MHz International Symposium May 3,

8 Photograph of 9-PGA System Slave 8 Slave 6 Slave 9 Slave 4 Slave 7 Slave 5 Slave 2 Master DDR2 15 Terasic DE3-15 boards Slave 3 International Symposium May 3, Benchmark Problems 2D heat-propagation Laplace s equation Jacobi method 78 iterations Grid sizes = 128 x 128, 512 x 512, 248 x 248 simple condition Same hardware, different programs 16 complex condition Boundary condition and results International Symposium May 3,

9 Results Performance [Glop/s] Peak 128 x x x Performance per Watt [Glop/sW] Glop/sW w/o fans w/ fans computing idle 1.36 Power (w/ fans) Power [W] Number of PGAs Performance Number of PGAs Performance per Watt 17 Linear scalability already for small problem Sustained 26Glop/s for 2GB/s bandwidth Utilization : 87.5 % (ideal for 4 mul. & 3 add) Power-efficiency of 1.36Glop/sW competing with the top 3 of Green5 list Glop/sW Machine 1st 1.68 Blue Gene/Q, US 2nd 1.44 GRAPE-DR, JP 3rd.96 Tsubame, JP Green5 Nov International Symposium May 3, Conclusions Scalable Streaming-Array High-performance stencil-computation Extensible multi-pga system Programmable simple soft-cores Performance model SW layer Common HW layer PGA device Design policy 9-PGA prototype system Linear scalability & 26Glop/s for only 2GB/s 1.36Glops/W competing with Green5 list uture work Larger and complex 3D problems (fluid dynamics) SSA compiler SSA synthesis tool 18 International Symposium May 3,

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2 Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era 11/16/2011 Many-Core Computing 2 Gene M. Amdahl, Validity of the Single-Processor Approach to Achieving

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Design of an FPGA-Based FDTD Accelerator Using OpenCL

Design of an FPGA-Based FDTD Accelerator Using OpenCL Design of an FPGA-Based FDTD Accelerator Using OpenCL Yasuhiro Takei, Hasitha Muthumala Waidyasooriya, Masanori Hariyama and Michitaka Kameyama Graduate School of Information Sciences, Tohoku University

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

VLIW DSP Processor Design for Mobile Communication Applications. Contents crafted by Dr. Christian Panis Catena Radio Design

VLIW DSP Processor Design for Mobile Communication Applications. Contents crafted by Dr. Christian Panis Catena Radio Design VLIW DSP Processor Design for Mobile Communication Applications Contents crafted by Dr. Christian Panis Catena Radio Design Agenda Trends in mobile communication Architectural core features with significant

More information

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) A Study of Data Partitioning on OpenC-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1 Outline Background and Motivations Data Partitioning on FPGA OpenC on FPGA

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

Memory Access Scheduling

Memory Access Scheduling Memory Access Scheduling ECE 5900 Computer Engineering Seminar Ying Xu Mar 4, 2005 Instructor: Dr. Chigan 1 ECE 5900 spring 05 1 Outline Introduction Modern DRAM architecture Memory access scheduling Structure

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Outline Introduction to CA (LGA HPP model) CA computation model for real CA applications Detailed computation method Validate the model using specific

Outline Introduction to CA (LGA HPP model) CA computation model for real CA applications Detailed computation method Validate the model using specific Performance evaluation of FPGA-based Cellular Automata accelerators S. Murtaza, A. G. Hoekstra, P. M. A. Sloot Section Computational Science Institute for Informatics University of Amsterdam Amsterdam,

More information

Development of Parallel Queue Processor and its Integrated Development Environment

Development of Parallel Queue Processor and its Integrated Development Environment Master thesis of the University of Aizu, Feb. 21, 2011 Development of Parallel Queue Processor and its Integrated Development Environment m5131139, Hiroki Hoshino Supervised by Prof. Ben Abdallah Abderazek

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc.

Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc. Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc. CT Image Reconstruction Herman Head Sinogram Herman Head Reconstruction CT Image Reconstruction for all

More information

Energy scalability and the RESUME scalable video codec

Energy scalability and the RESUME scalable video codec Energy scalability and the RESUME scalable video codec Harald Devos, Hendrik Eeckhaut, Mark Christiaens ELIS/PARIS Ghent University pag. 1 Outline Introduction Scalable Video Reconfigurable HW: FPGAs Implementation

More information

A Multi-Tiered Optimization Framework for Heterogeneous Computing

A Multi-Tiered Optimization Framework for Heterogeneous Computing A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Peak Performance for an Application in CUDA

Peak Performance for an Application in CUDA Peak Performance for an Application in CUDA Nicolás Wolovick 1 Fa.M.A.F., Universidad Nacional de Córdoba, Argentina May 4, 2010 Fa.M.A.F. - U.N.C. Outline The Numeric Problem Serial Codes Our aim CUDA

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

OASIS Network-on-Chip Prototyping on FPGA

OASIS Network-on-Chip Prototyping on FPGA Master thesis of the University of Aizu, Feb. 20, 2012 OASIS Network-on-Chip Prototyping on FPGA m5141120, Kenichi Mori Supervised by Prof. Ben Abdallah Abderazek Adaptive Systems Laboratory, Master of

More information

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dipl.-Inform. Dominik Göddeke (dominik.goeddeke@math.uni-dortmund.de) Mathematics III: Applied Mathematics and Numerics

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System

More information

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba,

More information

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture

More information

asoc: : A Scalable On-Chip Communication Architecture

asoc: : A Scalable On-Chip Communication Architecture asoc: : A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang,, Andrew Laffely,, and Wayne Burleson University of Massachusetts, Amherst Reconfigurable Computing Group Supported by

More information

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental

More information

How much energy can you save with a multicore computer for web applications?

How much energy can you save with a multicore computer for web applications? How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green

More information

IMAGINE: Signal and Image Processing Using Streams

IMAGINE: Signal and Image Processing Using Streams IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture

More information

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Performance comparison between a massive SMP machine and clusters

Performance comparison between a massive SMP machine and clusters Performance comparison between a massive SMP machine and clusters Martin Scarcia, Stefano Alberto Russo Sissa/eLab joint Democritos/Sissa Laboratory for e-science Via Beirut 2/4 34151 Trieste, Italy Stefano

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Trends in the Infrastructure of Computing

Trends in the Infrastructure of Computing Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Lab 1: Using the LegUp High-level Synthesis Framework

Lab 1: Using the LegUp High-level Synthesis Framework Lab 1: Using the LegUp High-level Synthesis Framework 1 Introduction and Motivation This lab will give you an overview of how to use the LegUp high-level synthesis framework. In LegUp, you can compile

More information

Distributed Vision Processing in Smart Camera Networks

Distributed Vision Processing in Smart Camera Networks Distributed Vision Processing in Smart Camera Networks CVPR-07 Hamid Aghajan, Stanford University, USA François Berry, Univ. Blaise Pascal, France Horst Bischof, TU Graz, Austria Richard Kleihorst, NXP

More information

Computer Architecture. Introduction. Lynn Choi Korea University

Computer Architecture. Introduction. Lynn Choi Korea University Computer Architecture Introduction Lynn Choi Korea University Class Information Lecturer Prof. Lynn Choi, School of Electrical Eng. Phone: 3290-3249, 공학관 411, lchoi@korea.ac.kr, TA: 윤창현 / 신동욱, 3290-3896,

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

EECS150 - Digital Design Lecture 13 - Accelerators. Recap and Outline

EECS150 - Digital Design Lecture 13 - Accelerators. Recap and Outline EECS150 - Digital Design Lecture 13 - Accelerators Oct. 10, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John Wawrzynek)

More information

KeyStone II. CorePac Overview

KeyStone II. CorePac Overview KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

VLSI Signal Processing

VLSI Signal Processing VLSI Signal Processing Programmable DSP Architectures Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao Tung University Outline DSP Arithmetic Stream Interface

More information

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know. Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Yet Another Implementation of CoRAM Memory

Yet Another Implementation of CoRAM Memory Dec 7, 2013 CARL2013@Davis, CA Py Yet Another Implementation of Memory Architecture for Modern FPGA-based Computing Shinya Takamaeda-Yamazaki, Kenji Kise, James C. Hoe * Tokyo Institute of Technology JSPS

More information

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters SIAM PP 2014 High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters C. Riesinger, A. Bakhtiari, M. Schreiber Technische Universität München February 20, 2014

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING 2/20/13 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING Heike McCraw mccraw@icl.utk.edu 1. Basic Essentials OUTLINE Abstract architecture model Communication, Computation, and Locality

More information

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI Introduction to Parallel Programming for Multi/Many Clusters Part II-3: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo 2 Overview Introduction Local Data Structure

More information

Next Generation Multi-Purpose Microprocessor

Next Generation Multi-Purpose Microprocessor Next Generation Multi-Purpose Microprocessor Presentation at MPSA, 4 th of November 2009 www.aeroflex.com/gaisler OUTLINE NGMP key requirements Development schedule Architectural Overview LEON4FT features

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT

ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT THE FREE AND OPEN RISC INSTRUCTION SET ARCHITECTURE Codasip is the leading provider of RISC-V processor IP Codasip Bk: A portfolio of RISC-V processors Uniquely

More information

Design of Scalable Network Considering Diameter and Cable Delay

Design of Scalable Network Considering Diameter and Cable Delay Tohoku Design of Scalable etwork Considering Diameter and Cable Delay Kentaro Sano Tohoku University, JAPA Agenda Introduction Assumption Preliminary evaluation & candidate networks Cable length and delay

More information

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective Nathan Woods XtremeData FPGA 2007 Outline Background Problem Statement Possible Solutions Description

More information

Performance of computer systems

Performance of computer systems Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type

More information

Jim Keller. Digital Equipment Corp. Hudson MA

Jim Keller. Digital Equipment Corp. Hudson MA Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC

Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC BWCCA 2010 Fukuoka, Japan November 4-6 2010 Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC Akram Ben Ahmed, Abderazek Ben Abdallah, Kenichi Kuroda The University of Aizu

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Activities of Cyberscience Center and Performance Evaluation of the SX-9 Supercomputer

Activities of Cyberscience Center and Performance Evaluation of the SX-9 Supercomputer Activities of Cyberscience Center and Performance Evaluation of the SX-9 Supercomputer KOBAYASHI Hiroaki, EGAWA Ryusuke, OKABE Kouki, ITO Eiichi, OIZUMI Kenji Abstract The Cyberscience Center at Tohoku

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information

A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering

A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering HPRCTA 2010 Stefan Craciun Dr. Alan D. George Dr. Herman Lam Dr. Jose C. Principe November 14, 2010 NSF CHREC Center ECE Department,

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA The AMD64 Technology for Server and Workstation Dr. Ulrich Knechtel Enterprise Program Manager EMEA Agenda Direct Connect Architecture AMD Opteron TM Processor Roadmap Competition OEM support The AMD64

More information

Multicore Scaling: The ECM Model

Multicore Scaling: The ECM Model Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation 2nd International Workshop on Overlay Architectures for FPGAs (OLAF) 2016 Kevin Andryc, Tedy Thomas and Russell Tessier University of Massachusetts

More information

are Softw Instruction Set Architecture Microarchitecture are rdw

are Softw Instruction Set Architecture Microarchitecture are rdw Program, Application Software Programming Language Compiler/Interpreter Operating System Instruction Set Architecture Hardware Microarchitecture Digital Logic Devices (transistors, etc.) Solid-State Physics

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information