The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
|
|
- Cordelia Bell
- 6 years ago
- Views:
Transcription
1 The Power of Streams on the SRC MAP Wim Bohm Colorado State University RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. MAP C Pure C runs on the MAP Generated code: circuits Basic blocks in outer loops become special purpose hardware function units Basic blocks in inner loop bodies are merged and become pipelined circuits Sequential semantics obeyed One basic block executed at the time Pipelined inner loops are slowed down to disambiguate read/ write conflicts if necessary MAP C compiler identifies (cause of) loop slowdown You want more parallelism? Unroll loops Use (OpenMP style) parallel sections RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 1
2 Variable = register Array = memory Memory Model OBM eclare OBMx to be array with element type t where t is some 64 bits type SRC6: 6 OBMs Block RAM An array lives by default in Block RAM(s) More freedom for element type SRC6: 444 Block RAMs (VP100) 1 array access = 1 clock cycle but compiler can slow down loops for read / write disambiguation loop slow down: loop runs at >1 clock cycle rate RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. Programming with loops and arrays Transformational Approach Start with pure C code Compiler indicates (reasons for) loop slow downs Performance tune (removing inefficiencies) avoid re-reading of data from OBMs using elay Queues avoid read / write conflicts in same iteration Partition Code and ata distribute data over OBMs and Block RAMs, unroll loops distribute code over two FPGAs, MPI type communication only one chip at the time can access a particular OBM Goal 1: 1 loop iteration = 1 clock cycle Can very often be achieved Goal 2: xploit as much bandwidth as possible Through loop unrolling and task parallelism Balance act: chip area versus bandwidth RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 2
3 How to performance tune: Macros C code can be extended using macros allowing for program transformations that cannot be expressed straightforwardly in C Macros have semantics unlike C functions: they have a rate and a delay (so they can be integrated in loops) rate (#clocks between inputs) delay (#clocks between in and output) macros can have state (kept between macro calls) Two types of macros system provided user provided (Verilog or schematic capture) RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. elay Queues for efficient Window Access Keep data on chip using elay Queues elay queue = system macro implemented with SRL16s (small, fixed size) BRAMs (large, variable size) Simple example: 3x3 window stepping 1 by 1 in column major order image 16 deep 9 points in stencil 8 have been seen before Per window the leading point should be the only data access. input array traversal RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 3
4 elay Queues 1 f7 f4 ata access input(i) f8 f5 f2 f6 f0 f0 two 16 word elay Queues shift and store previous columns RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. elay Queues 2 f7 f4 ata access input(i) f8 f5 f0 f6 f0 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 4
5 elay Queues 3 f7 f4 f0 ata access input(i) f8 f5 f6 f2 f0 f2 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. elay Queues 16 f7 f4 3 ata access input(i) f8 f5 4 f0 5 f0 f0 f2 f3 f4 f5 f6 f7 f8 f RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 5
6 elay Queues 32 f7 3 f29 ata access input(i) f8 4 f30 5 f31 f0 f2 f3 f4 f5 f6 f7 f8 f f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30 f31 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. elay Queues 35 Compute first window f0 6 f32 ata access input(i) 7 f33 f2 8 f34 f3 f4 f5 f6 f7 f8 f f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30 f31 f32 f33 f34 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 6
7 elay Queues 36 Compute next widow 7 f33 ata access input(i) f2 8 f34 f3 9 f35 f4 f5 f6 f7 f8 f f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30 f31 f32 f33 f34 f35 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. elay Queues 37 and next 1 clock cycle per window f2 8 f34 ata access input(i) f3 9 f35 f4 f20 f36 f5 f6 f7 f8 f f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30 f31 f32 f33 f34 f35 f36 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 7
8 elay Queues: Performance 3x3 window access code 512 x 512 pixel image Routine Style Number Comment of clocks Straight Window 2,376,617 close to 9 clocks per iteration 2,340,900: the difference is pipeline prime effect elay Queue 279,999 close to1 clock per iteration : theoretical limit RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. Streams A stream is a flexible communication mechanism Providing parallelism, locality and implicit synchronization Freeing the programmer of timing headaches On the MAP it provides a uniform communication model From µ-proc to MAP (MA stream) From Global Common Memory to MAP From MAP to MAP (GPIO stream) From FPGA to FPGA (Bridge stream) Inside FPGA between parallel sections Stream buffer = FIFO lement can travel through in one clock Processes, FPGAs, MAPs, GCM are all a few clock cycles away from each other RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 8
9 Streams cont Stream producers and consumers Live in their own parallel sections Loop based MAP streams are producer driven Producer can conditionally put elements in the stream Consumer cannot conditionally get but can run at its own speed Conditional gets would cause loop slowdown Two case studies ilate rode (Image Processing kernel) LU ecomposition (Numerical kernel) RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. ilate / rode image operators ilate enlarges light objects and shrinks dark objects by taking max pixel value in (masked) window rode enlarges dark objects and shrinks light objects by taking min pixel value in (masked) window together they simplify the image, leaving only dominant trends Focus Of Attention Codes often use a number of dilates followed by same number of erodes ideal scenario for multiple stream connected loops internally the loops use delay queues for windows RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 9
10 rode dilate example = MAX( ) = MAX( ) = MIN( ) = MIN( ) RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 10
11 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 11
12 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 12
13 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 13
14 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 14
15 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 15
16 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 16
17 17 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV.
18 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. ilate / erode on 8 3x3 windows RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 18
19 ilate / rode Pipeline 1 Word (8 Pixels) Streams elay Queues RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. own Sampling Complication: In FOA the image is often first down-sampled for (sequential) efficiency reasons. When the ilate/rode pipe is preceded by a down sample, elements are accessed out of (streaming) order. This can be achieved by Storing image in 1 or 4 memories this requires MA-ing the data in before the computation starts 1 memory: loop slowdown or Streaming image through delay queues The computation now runs fully pipelined at stream speed RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 19
20 rode ilate Results No problem with producer controlled streams Pipeline parallelism works great here 8 /s (16 sections) run about as fast as 1 / the delay is ~2 microseconds per / 16 operators plus 1 down sample in parallel pipeline each 8 iterations loop-unrolled ( it all fits in one FPGA ) 8 down samples: 24 maxs 128 dilates/erodes every clock cycle dilate / erode: 4 mins / maxs for cross mask 8 mins / maxs for square mask totalling 792 ops / clock cycle at 100Mhz: 79.2 GigaOpS ( ideal ) No time difference for 2 FPGAs vs 1 FPGA bridge stream just as fast as internal stream But there is no need for the second FPGA (27% LUTs on 1 FPGA) RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. ownsample, dilate, erode performance on a 640 x 480 pixel image own sample: 4*640*480 = 1,228,800 ilates rodes: 16*6*320*240 = 14,645,600 Total number of ops: 15,974,400 Code versions: MA into MA into stream in one OBM four OBMs from GCM (units) MAP ( µ-secs ) > 20 > 46 > 69 ( GOPS ) MAP Speedup Vs Pentium > 46 > 104 > 156 Pentium speed: 36 milliseconds ( ~ 444 MOPS ) RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 20
21 LU ecomposition You know it: [ Cormen et. al. pp 750 ] A for k = 1 to n { } for i = k+1 to n Aik /= Akk for i = k+1 to n Aij -= Aik*Akj RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. Pipelining Iteration k+1 can start after iteration k has finished row k+1 neither row k nor column k are read or written anymore This naturally leads to a a streaming algorithm every iteration becomes a process results from iteration/process k flow to process k+1 process k uses row k to update sub matrix A[k+1:n, k+1:n] Actually, only a fixed number P of processes run in parallel after that, data is re-circulated RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 21
22 Streaming LU Pipeline of P processes parallel sections in OpenMP lingo algorithm runs in ceil( N/P ) sweeps In one sweep P outer k iterations run in parallel After each sweep P rows and columns are done Problem size n is reduced to (n-p) 2 Rest of the (updated) array is re-circulated ach process owns a row Process p owns row p + (i-1)p in sweep i, i=1 to ceil (N/P) Array elements are updated while streaming through the pipe of processes Total work O(n 3 ), speedup ~P Forward / backward substitution are simple loops RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. First, Flip the matrix vertically P1 P2 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 22
23 Feed matrix to Processor 1 P1 P2 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. P1 P2 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 23
24 Feed matrix to Processor 2 P2 P3 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. P2 P3 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 24
25 P3 P4 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. OBM1 0 OBM2 Process Pipeline OBMS: OBM1 -> OBM2 on even sweeps OBM2 -> OBM1 on odd sweeps OBM3 : siphoning off finished results P=12 on SRC6 OBM1 S0 1 S1 P-1 OBM3 OBM2 Processes are grouped in a parallel sections construct each process is a section Process 0 reads either from OBM1 or OBM2 writes to stream S0 Process i reads from Si-1 writes to Si Process P-1 reads from Sp-1 writes finished data to OBM3 writes rest to OBM1 or OBM0 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 25
26 OBM1 OBM2 P0 P1 P=12 on SRC6 Inside a process for(i = (s-1)*p; i < n; i++) { for(j = (s-1)*p; j < n; j++) { //ping-pong: read from OBM 1 or 2 if(k&0x1) w= OBM1 [ i * n + j ]; else w= OBM2 [ i * n + j ]; // if (i < me) leave data unchanged if (i ==me) { // store my row if (j == i ) piv = w; myrow[ j ] = w; } else if (i > me) { // update this row with my row // if (j < me) leave data unchanged if (j == me) { w /= piv; mul = w;} else if (j > me) w -= mul*myrow[ j ]; } put_stream( &S0, w); } } RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. LU (plus solvers) Performance n=64 n=128 n=256 n=512 (unit) Pentium IV (2.8 GHz ) (MFlops) MAP (P=12) (MFlops) MAP (P=23*) (MFlops) MAP speedup (P=23) *: FPGA1: 10 LU processes + forward and backward solvers FPGA2: 13 LU processes RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 26
27 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. Conclusions C (and FTN) run on FPGA based HPC system BUG Mode allows most code development and performance tuning on workstation We can apply standard software design stepwise refinement using macros parallel tasks connected by streams (put get: implicit timing) Streams allow for massive bandwidth (800MBs/stream) and very short latency (few clock cycles) between on chip processes between processors and processors between processors and memories Future work multi-map stream and barrier codes RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 27
Modern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationLecture 4: Synchronous Data Flow Graphs - HJ94 goal: Skiing down a mountain
Lecture 4: Synchronous ata Flow Graphs - I. Verbauwhede, 05-06 K.U.Leuven HJ94 goal: Skiing down a mountain SPW, Matlab, C pipelining, unrolling Specification Algorithm Transformations loop merging, compaction
More information6.375 Ray Tracing Hardware Accelerator
6.375 Ray Tracing Hardware Accelerator Chun Fai Cheung, Sabrina Neuman, Michael Poon May 13, 2010 Abstract This report describes the design and implementation of a hardware accelerator for software ray
More informationOverpartioning with the Rice dhpf Compiler
Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf
More informationEE178 Spring 2018 Lecture Module 4. Eric Crabill
EE178 Spring 2018 Lecture Module 4 Eric Crabill Goals Implementation tradeoffs Design variables: throughput, latency, area Pipelining for throughput Retiming for throughput and latency Interleaving for
More informationHigh Level, high speed FPGA programming
Opportunity: FPAs High Level, high speed FPA programming Wim Bohm, Bruce Draper, Ross Beveridge, Charlie Ross, Monica Chawathe Colorado State University Reconfigurable Computing technology High speed at
More informationL2: Design Representations
CS250 VLSI Systems Design L2: Design Representations John Wawrzynek, Krste Asanovic, with John Lazzaro and Yunsup Lee (TA) Engineering Challenge Application Gap usually too large to bridge in one step,
More informationLect. 2: Types of Parallelism
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More information2-D Wave Equation Suggested Project
CSE 260 Introduction to Parallel Computation 2-D Wave Equation Suggested Project 10/9/01 Project Overview Goal: program a simple parallel application in a variety of styles. learn different parallel languages
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationReview for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects
Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,
More informationSRC MAPstation Image Processing: Edge Detection
SRC MAPstation Image Processing: Edge Detection David Caliga, Director Software Applications SRC Computers, Inc. dcaliga@srccomputers.com Motivations The purpose of detecting sharp changes in image brightness
More informationEEL 4783: HDL in Digital System Design
EEL 4783: HDL in Digital System Design Lecture 9: Coding for Synthesis (cont.) Prof. Mingjie Lin 1 Code Principles Use blocking assignments to model combinatorial logic. Use nonblocking assignments to
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More information1-D Time-Domain Convolution. for (i=0; i < outputsize; i++) { y[i] = 0; for (j=0; j < kernelsize; j++) { y[i] += x[i - j] * h[j]; } }
Introduction: Convolution is a common operation in digital signal processing. In this project, you will be creating a custom circuit implemented on the Nallatech board that exploits a significant amount
More informationExploring GPU Architecture for N2P Image Processing Algorithms
Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly
More informationMPI Programming. Henrik R. Nagel Scientific Computing IT Division
1 MPI Programming Henrik R. Nagel Scientific Computing IT Division 2 Outline Introduction Finite Difference Method Finite Element Method LU Factorization SOR Method Monte Carlo Method Molecular Dynamics
More informationDK2. Handel-C code optimization
DK2 Handel-C code optimization Celoxica, the Celoxica logo and Handel-C are trademarks of Celoxica Limited. All other products or services mentioned herein may be trademarks of their respective owners.
More informationScan Algorithm Effects on Parallelism and Memory Conflicts
Scan Algorithm Effects on Parallelism and Memory Conflicts 11 Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n
More informationEvaluation and Improvements of Programming Models for the Intel SCC Many-core Processor
Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Carsten Clauss, Stefan Lankes, Pablo Reble, Thomas Bemmerl International Workshop on New Algorithms and Programming
More informationECE 5775 Student-Led Discussions (10/16)
ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A Adam Macioszek, Julia Currie, Nick Sarkis Sparse Matrix Vector Multiplication Nick Comly, Felipe Fortuna, Mark Li, Serena Krech Matrix
More informationComputer Architecture: Dataflow/Systolic Arrays
Data Flow Computer Architecture: Dataflow/Systolic Arrays he models we have examined all assumed Instructions are fetched and retired in sequential, control flow order his is part of the Von-Neumann model
More informationECE 498AL. Lecture 12: Application Lessons. When the tires hit the road
ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM
ESE532: System-on-a-Chip Architecture Day 20: April 3, 2017 Pipelining, Frequency, Dataflow Today What drives cycle times Pipelining in Vivado HLS C Avoiding bottlenecks feeding data in Vivado HLS C Penn
More informationParallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19
Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing
More informationPerformance and Overhead in a Hybrid Reconfigurable Computer
Performance and Overhead in a Hybrid Reconfigurable Computer Osman Devrim Fidanci 1, Dan Poznanovic 2, Kris Gaj 3, Tarek El-Ghazawi 1, Nikitas Alexandridis 1 1 George Washington University, 2 SRC Computers
More informationOutcomes. Spiral 1 / Unit 6. Flip Flops FLIP FLOPS AND REGISTERS. Flip flops and Registers. Outputs only change once per clock period
1-6.1 1-6.2 Spiral 1 / Unit 6 Flip flops and Registers Mark Redekopp Outcomes I know the difference between combinational and sequential logic and can name examples of each. I understand latency, throughput,
More informationCost-Effective Parallel Computational Electromagnetic Modeling
Cost-Effective Parallel Computational Electromagnetic Modeling, Tom Cwik {Daniel.S.Katz, cwik}@jpl.nasa.gov Beowulf System at PL (Hyglac) l 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory,
More informationCOSC 6385 Computer Architecture - Memory Hierarchy Design (III)
COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses
More informationStructure of Computer Systems
288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram
More informationParallel Processing SIMD, Vector and GPU s
Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating
More informationImproving Area and Resource Utilization Lab
Lab Workbook Introduction This lab introduces various techniques and directives which can be used in Vivado HLS to improve design performance as well as area and resource utilization. The design under
More informationFPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY
FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic
More informationCOSC 6385 Computer Architecture - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity
More informationExploring Parallelism At Different Levels
Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities
More informationSpiral 1 / Unit 6. Flip-flops and Registers
1-5.1 Spiral 1 / Unit 6 Flip-flops and Registers 1-5.2 Outcomes I know the difference between combinational and sequential logic and can name examples of each. I understand latency, throughput, and at
More informationCS575: Parallel Processing Sanjay Rajopadhye CSU. Course Topics
CS575: Parallel Processing Sanjay Rajopadhye CSU Lecture 2: Parallel Computer Models CS575 lecture 1 Course Topics Introduction, background Complexity, orders of magnitude, recurrences Models of parallel
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More information12. Use of Test Generation Algorithms and Emulation
12. Use of Test Generation Algorithms and Emulation 1 12. Use of Test Generation Algorithms and Emulation Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationCache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationOutcomes. Spiral 1 / Unit 6. Flip Flops FLIP FLOPS AND REGISTERS. Flip flops and Registers. Outputs only change once per clock period
1-5.1 1-5.2 Spiral 1 / Unit 6 Flip flops and Registers Mark Redekopp Outcomes I know the difference between combinational and sequential logic and can name examples of each. I understand latency, throughput,
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationAn Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers
An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers Allen Michalski 1, Kris Gaj 1, Tarek El-Ghazawi 2 1 ECE Department, George Mason University
More informationAES Core Specification. Author: Homer Hsing
AES Core Specification Author: Homer Hsing homer.hsing@gmail.com Rev. 0.1.1 October 30, 2012 This page has been intentionally left blank. www.opencores.org Rev 0.1.1 ii Revision History Rev. Date Author
More informationFPGA Matrix Multiplier
FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationCompiling for GPUs. Adarsh Yoga Madhav Ramesh
Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation
More informationToday. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses
Today Comments about assignment 3-43 Comments about assignment 3 ASICs and Programmable logic Others courses octor Per should show up in the end of the lecture Mealy machines can not be coded in a single
More informationEECS150 - Digital Design Lecture 09 - Parallelism
EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization
More informationChapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,
Chapter 2 (cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Improving Cache Performance Average mem access time = hit time + miss rate * miss penalty speed up
More informationLooPo: Automatic Loop Parallelization
LooPo: Automatic Loop Parallelization Michael Claßen Fakultät für Informatik und Mathematik Düsseldorf, November 27 th 2008 Model-Based Loop Transformations model-based approach: map source code to an
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationCS Computer Architecture
CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic
More informationMemory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple
Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss
More informationOptimizing DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most
More informationSoftware Pipelining by Modulo Scheduling. Philip Sweany University of North Texas
Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationVerilog for High Performance
Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationIMPLICIT+EXPLICIT Architecture
IMPLICIT+EXPLICIT Architecture Fortran Carte Programming Environment C Implicitly Controlled Device Dense logic device Typically fixed logic µp, DSP, ASIC, etc. Implicit Device Explicit Device Explicitly
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationField Programmable Gate Array (FPGA)
Field Programmable Gate Array (FPGA) Lecturer: Krébesz, Tamas 1 FPGA in general Reprogrammable Si chip Invented in 1985 by Ross Freeman (Xilinx inc.) Combines the advantages of ASIC and uc-based systems
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationCode Performance Analysis
Code Performance Analysis Massimiliano Fatica ASCI TST Review May 8 2003 Performance Theoretical peak performance of the ASCI machines are in the Teraflops range, but sustained performance with real applications
More informationOptimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic
More informationVHDL for Synthesis. Course Description. Course Duration. Goals
VHDL for Synthesis Course Description This course provides all necessary theoretical and practical know how to write an efficient synthesizable HDL code through VHDL standard language. The course goes
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationMODELING LANGUAGES AND ABSTRACT MODELS. Giovanni De Micheli Stanford University. Chapter 3 in book, please read it.
MODELING LANGUAGES AND ABSTRACT MODELS Giovanni De Micheli Stanford University Chapter 3 in book, please read it. Outline Hardware modeling issues: Representations and models. Issues in hardware languages.
More informationKeck-Voon LING School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore
MPC on a Chip Keck-Voon LING (ekvling@ntu.edu.sg) School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore EPSRC Project Kick-off Meeting, Imperial College, London,
More informationA Preliminary Assessment of the ACRI 1 Fortran Compiler
A Preliminary Assessment of the ACRI 1 Fortran Compiler Joan M. Parcerisa, Antonio González, Josep Llosa, Toni Jerez Computer Architecture Department Universitat Politècnica de Catalunya Report No UPC-DAC-94-24
More information10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationFPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH
FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering,
More informationESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable?
ESE534: Computer Organization Day 22: April 9, 2012 Time Multiplexing Tabula March 1, 2010 Announced new architecture We would say w=1, c=8 arch. 1 [src: www.tabula.com] 2 Previously Today Saw how to pipeline
More informationLecture 12 VHDL Synthesis
CPE 487: Digital System Design Spring 2018 Lecture 12 VHDL Synthesis Bryan Ackland Department of Electrical and Computer Engineering Stevens Institute of Technology Hoboken, NJ 07030 1 What is Synthesis?
More informationIs There A Tradeoff Between Programmability and Performance?
Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable
More informationMohamed Taher The George Washington University
Experience Programming Current HPRCs Mohamed Taher The George Washington University Acknowledgements GWU: Prof.\Tarek El-Ghazawi and Esam El-Araby GMU: Prof.\Kris Gaj ARSC SRC SGI CRAY 2 Outline Introduction
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests
ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer
More informationParallel Poisson Solver in Fortran
Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first
More informationParallel Programming Multicore systems
FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have
More informationECE453: Advanced Computer Architecture II Homework 1
ECE453: Advanced Computer Architecture II Homework 1 Assigned: January 18, 2005 From Parallel Computer Architecture: A Hardware/Software Approach, Culler, Singh, Gupta: 1.15) The cost of a copy of n bytes
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationHigh-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication
High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication Erik H. D Hollander Electronics and Information Systems Department Ghent University, Ghent, Belgium Erik.DHollander@ugent.be
More informationCSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing
HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;
More informationAdvanced optimizations of cache performance ( 2.2)
Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationEECS150 - Digital Design Lecture 5 - Verilog Logic Synthesis
EECS150 - Digital Design Lecture 5 - Verilog Logic Synthesis Jan 31, 2012 John Wawrzynek Spring 2012 EECS150 - Lec05-verilog_synth Page 1 Outline Quick review of essentials of state elements Finite State
More informationFPGA for Software Engineers
FPGA for Software Engineers Course Description This course closes the gap between hardware and software engineers by providing the software engineer all the necessary FPGA concepts and terms. The course
More information