The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Size: px
Start display at page:

Download "The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED."

Transcription

1 The Power of Streams on the SRC MAP Wim Bohm Colorado State University RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. MAP C Pure C runs on the MAP Generated code: circuits Basic blocks in outer loops become special purpose hardware function units Basic blocks in inner loop bodies are merged and become pipelined circuits Sequential semantics obeyed One basic block executed at the time Pipelined inner loops are slowed down to disambiguate read/ write conflicts if necessary MAP C compiler identifies (cause of) loop slowdown You want more parallelism? Unroll loops Use (OpenMP style) parallel sections RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 1

2 Variable = register Array = memory Memory Model OBM eclare OBMx to be array with element type t where t is some 64 bits type SRC6: 6 OBMs Block RAM An array lives by default in Block RAM(s) More freedom for element type SRC6: 444 Block RAMs (VP100) 1 array access = 1 clock cycle but compiler can slow down loops for read / write disambiguation loop slow down: loop runs at >1 clock cycle rate RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. Programming with loops and arrays Transformational Approach Start with pure C code Compiler indicates (reasons for) loop slow downs Performance tune (removing inefficiencies) avoid re-reading of data from OBMs using elay Queues avoid read / write conflicts in same iteration Partition Code and ata distribute data over OBMs and Block RAMs, unroll loops distribute code over two FPGAs, MPI type communication only one chip at the time can access a particular OBM Goal 1: 1 loop iteration = 1 clock cycle Can very often be achieved Goal 2: xploit as much bandwidth as possible Through loop unrolling and task parallelism Balance act: chip area versus bandwidth RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 2

3 How to performance tune: Macros C code can be extended using macros allowing for program transformations that cannot be expressed straightforwardly in C Macros have semantics unlike C functions: they have a rate and a delay (so they can be integrated in loops) rate (#clocks between inputs) delay (#clocks between in and output) macros can have state (kept between macro calls) Two types of macros system provided user provided (Verilog or schematic capture) RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. elay Queues for efficient Window Access Keep data on chip using elay Queues elay queue = system macro implemented with SRL16s (small, fixed size) BRAMs (large, variable size) Simple example: 3x3 window stepping 1 by 1 in column major order image 16 deep 9 points in stencil 8 have been seen before Per window the leading point should be the only data access. input array traversal RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 3

4 elay Queues 1 f7 f4 ata access input(i) f8 f5 f2 f6 f0 f0 two 16 word elay Queues shift and store previous columns RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. elay Queues 2 f7 f4 ata access input(i) f8 f5 f0 f6 f0 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 4

5 elay Queues 3 f7 f4 f0 ata access input(i) f8 f5 f6 f2 f0 f2 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. elay Queues 16 f7 f4 3 ata access input(i) f8 f5 4 f0 5 f0 f0 f2 f3 f4 f5 f6 f7 f8 f RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 5

6 elay Queues 32 f7 3 f29 ata access input(i) f8 4 f30 5 f31 f0 f2 f3 f4 f5 f6 f7 f8 f f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30 f31 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. elay Queues 35 Compute first window f0 6 f32 ata access input(i) 7 f33 f2 8 f34 f3 f4 f5 f6 f7 f8 f f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30 f31 f32 f33 f34 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 6

7 elay Queues 36 Compute next widow 7 f33 ata access input(i) f2 8 f34 f3 9 f35 f4 f5 f6 f7 f8 f f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30 f31 f32 f33 f34 f35 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. elay Queues 37 and next 1 clock cycle per window f2 8 f34 ata access input(i) f3 9 f35 f4 f20 f36 f5 f6 f7 f8 f f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30 f31 f32 f33 f34 f35 f36 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 7

8 elay Queues: Performance 3x3 window access code 512 x 512 pixel image Routine Style Number Comment of clocks Straight Window 2,376,617 close to 9 clocks per iteration 2,340,900: the difference is pipeline prime effect elay Queue 279,999 close to1 clock per iteration : theoretical limit RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. Streams A stream is a flexible communication mechanism Providing parallelism, locality and implicit synchronization Freeing the programmer of timing headaches On the MAP it provides a uniform communication model From µ-proc to MAP (MA stream) From Global Common Memory to MAP From MAP to MAP (GPIO stream) From FPGA to FPGA (Bridge stream) Inside FPGA between parallel sections Stream buffer = FIFO lement can travel through in one clock Processes, FPGAs, MAPs, GCM are all a few clock cycles away from each other RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 8

9 Streams cont Stream producers and consumers Live in their own parallel sections Loop based MAP streams are producer driven Producer can conditionally put elements in the stream Consumer cannot conditionally get but can run at its own speed Conditional gets would cause loop slowdown Two case studies ilate rode (Image Processing kernel) LU ecomposition (Numerical kernel) RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. ilate / rode image operators ilate enlarges light objects and shrinks dark objects by taking max pixel value in (masked) window rode enlarges dark objects and shrinks light objects by taking min pixel value in (masked) window together they simplify the image, leaving only dominant trends Focus Of Attention Codes often use a number of dilates followed by same number of erodes ideal scenario for multiple stream connected loops internally the loops use delay queues for windows RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 9

10 rode dilate example = MAX( ) = MAX( ) = MIN( ) = MIN( ) RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 10

11 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 11

12 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 12

13 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 13

14 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 14

15 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 15

16 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 16

17 17 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV.

18 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. ilate / erode on 8 3x3 windows RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 18

19 ilate / rode Pipeline 1 Word (8 Pixels) Streams elay Queues RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. own Sampling Complication: In FOA the image is often first down-sampled for (sequential) efficiency reasons. When the ilate/rode pipe is preceded by a down sample, elements are accessed out of (streaming) order. This can be achieved by Storing image in 1 or 4 memories this requires MA-ing the data in before the computation starts 1 memory: loop slowdown or Streaming image through delay queues The computation now runs fully pipelined at stream speed RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 19

20 rode ilate Results No problem with producer controlled streams Pipeline parallelism works great here 8 /s (16 sections) run about as fast as 1 / the delay is ~2 microseconds per / 16 operators plus 1 down sample in parallel pipeline each 8 iterations loop-unrolled ( it all fits in one FPGA ) 8 down samples: 24 maxs 128 dilates/erodes every clock cycle dilate / erode: 4 mins / maxs for cross mask 8 mins / maxs for square mask totalling 792 ops / clock cycle at 100Mhz: 79.2 GigaOpS ( ideal ) No time difference for 2 FPGAs vs 1 FPGA bridge stream just as fast as internal stream But there is no need for the second FPGA (27% LUTs on 1 FPGA) RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. ownsample, dilate, erode performance on a 640 x 480 pixel image own sample: 4*640*480 = 1,228,800 ilates rodes: 16*6*320*240 = 14,645,600 Total number of ops: 15,974,400 Code versions: MA into MA into stream in one OBM four OBMs from GCM (units) MAP ( µ-secs ) > 20 > 46 > 69 ( GOPS ) MAP Speedup Vs Pentium > 46 > 104 > 156 Pentium speed: 36 milliseconds ( ~ 444 MOPS ) RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 20

21 LU ecomposition You know it: [ Cormen et. al. pp 750 ] A for k = 1 to n { } for i = k+1 to n Aik /= Akk for i = k+1 to n Aij -= Aik*Akj RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. Pipelining Iteration k+1 can start after iteration k has finished row k+1 neither row k nor column k are read or written anymore This naturally leads to a a streaming algorithm every iteration becomes a process results from iteration/process k flow to process k+1 process k uses row k to update sub matrix A[k+1:n, k+1:n] Actually, only a fixed number P of processes run in parallel after that, data is re-circulated RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 21

22 Streaming LU Pipeline of P processes parallel sections in OpenMP lingo algorithm runs in ceil( N/P ) sweeps In one sweep P outer k iterations run in parallel After each sweep P rows and columns are done Problem size n is reduced to (n-p) 2 Rest of the (updated) array is re-circulated ach process owns a row Process p owns row p + (i-1)p in sweep i, i=1 to ceil (N/P) Array elements are updated while streaming through the pipe of processes Total work O(n 3 ), speedup ~P Forward / backward substitution are simple loops RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. First, Flip the matrix vertically P1 P2 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 22

23 Feed matrix to Processor 1 P1 P2 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. P1 P2 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 23

24 Feed matrix to Processor 2 P2 P3 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. P2 P3 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 24

25 P3 P4 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. OBM1 0 OBM2 Process Pipeline OBMS: OBM1 -> OBM2 on even sweeps OBM2 -> OBM1 on odd sweeps OBM3 : siphoning off finished results P=12 on SRC6 OBM1 S0 1 S1 P-1 OBM3 OBM2 Processes are grouped in a parallel sections construct each process is a section Process 0 reads either from OBM1 or OBM2 writes to stream S0 Process i reads from Si-1 writes to Si Process P-1 reads from Sp-1 writes finished data to OBM3 writes rest to OBM1 or OBM0 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 25

26 OBM1 OBM2 P0 P1 P=12 on SRC6 Inside a process for(i = (s-1)*p; i < n; i++) { for(j = (s-1)*p; j < n; j++) { //ping-pong: read from OBM 1 or 2 if(k&0x1) w= OBM1 [ i * n + j ]; else w= OBM2 [ i * n + j ]; // if (i < me) leave data unchanged if (i ==me) { // store my row if (j == i ) piv = w; myrow[ j ] = w; } else if (i > me) { // update this row with my row // if (j < me) leave data unchanged if (j == me) { w /= piv; mul = w;} else if (j > me) w -= mul*myrow[ j ]; } put_stream( &S0, w); } } RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. LU (plus solvers) Performance n=64 n=128 n=256 n=512 (unit) Pentium IV (2.8 GHz ) (MFlops) MAP (P=12) (MFlops) MAP (P=23*) (MFlops) MAP speedup (P=23) *: FPGA1: 10 LU processes + forward and backward solvers FPGA2: 13 LU processes RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 26

27 RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. Conclusions C (and FTN) run on FPGA based HPC system BUG Mode allows most code development and performance tuning on workstation We can apply standard software design stepwise refinement using macros parallel tasks connected by streams (put get: implicit timing) Streams allow for massive bandwidth (800MBs/stream) and very short latency (few clock cycles) between on chip processes between processors and processors between processors and memories Future work multi-map stream and barrier codes RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. 27

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Lecture 4: Synchronous Data Flow Graphs - HJ94 goal: Skiing down a mountain

Lecture 4: Synchronous Data Flow Graphs - HJ94 goal: Skiing down a mountain Lecture 4: Synchronous ata Flow Graphs - I. Verbauwhede, 05-06 K.U.Leuven HJ94 goal: Skiing down a mountain SPW, Matlab, C pipelining, unrolling Specification Algorithm Transformations loop merging, compaction

More information

6.375 Ray Tracing Hardware Accelerator

6.375 Ray Tracing Hardware Accelerator 6.375 Ray Tracing Hardware Accelerator Chun Fai Cheung, Sabrina Neuman, Michael Poon May 13, 2010 Abstract This report describes the design and implementation of a hardware accelerator for software ray

More information

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf

More information

EE178 Spring 2018 Lecture Module 4. Eric Crabill

EE178 Spring 2018 Lecture Module 4. Eric Crabill EE178 Spring 2018 Lecture Module 4 Eric Crabill Goals Implementation tradeoffs Design variables: throughput, latency, area Pipelining for throughput Retiming for throughput and latency Interleaving for

More information

High Level, high speed FPGA programming

High Level, high speed FPGA programming Opportunity: FPAs High Level, high speed FPA programming Wim Bohm, Bruce Draper, Ross Beveridge, Charlie Ross, Monica Chawathe Colorado State University Reconfigurable Computing technology High speed at

More information

L2: Design Representations

L2: Design Representations CS250 VLSI Systems Design L2: Design Representations John Wawrzynek, Krste Asanovic, with John Lazzaro and Yunsup Lee (TA) Engineering Challenge Application Gap usually too large to bridge in one step,

More information

Lect. 2: Types of Parallelism

Lect. 2: Types of Parallelism Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

2-D Wave Equation Suggested Project

2-D Wave Equation Suggested Project CSE 260 Introduction to Parallel Computation 2-D Wave Equation Suggested Project 10/9/01 Project Overview Goal: program a simple parallel application in a variety of styles. learn different parallel languages

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,

More information

SRC MAPstation Image Processing: Edge Detection

SRC MAPstation Image Processing: Edge Detection SRC MAPstation Image Processing: Edge Detection David Caliga, Director Software Applications SRC Computers, Inc. dcaliga@srccomputers.com Motivations The purpose of detecting sharp changes in image brightness

More information

EEL 4783: HDL in Digital System Design

EEL 4783: HDL in Digital System Design EEL 4783: HDL in Digital System Design Lecture 9: Coding for Synthesis (cont.) Prof. Mingjie Lin 1 Code Principles Use blocking assignments to model combinatorial logic. Use nonblocking assignments to

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

1-D Time-Domain Convolution. for (i=0; i < outputsize; i++) { y[i] = 0; for (j=0; j < kernelsize; j++) { y[i] += x[i - j] * h[j]; } }

1-D Time-Domain Convolution. for (i=0; i < outputsize; i++) { y[i] = 0; for (j=0; j < kernelsize; j++) { y[i] += x[i - j] * h[j]; } } Introduction: Convolution is a common operation in digital signal processing. In this project, you will be creating a custom circuit implemented on the Nallatech board that exploits a significant amount

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

MPI Programming. Henrik R. Nagel Scientific Computing IT Division

MPI Programming. Henrik R. Nagel Scientific Computing IT Division 1 MPI Programming Henrik R. Nagel Scientific Computing IT Division 2 Outline Introduction Finite Difference Method Finite Element Method LU Factorization SOR Method Monte Carlo Method Molecular Dynamics

More information

DK2. Handel-C code optimization

DK2. Handel-C code optimization DK2 Handel-C code optimization Celoxica, the Celoxica logo and Handel-C are trademarks of Celoxica Limited. All other products or services mentioned herein may be trademarks of their respective owners.

More information

Scan Algorithm Effects on Parallelism and Memory Conflicts

Scan Algorithm Effects on Parallelism and Memory Conflicts Scan Algorithm Effects on Parallelism and Memory Conflicts 11 Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n

More information

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Carsten Clauss, Stefan Lankes, Pablo Reble, Thomas Bemmerl International Workshop on New Algorithms and Programming

More information

ECE 5775 Student-Led Discussions (10/16)

ECE 5775 Student-Led Discussions (10/16) ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A Adam Macioszek, Julia Currie, Nick Sarkis Sparse Matrix Vector Multiplication Nick Comly, Felipe Fortuna, Mark Li, Serena Krech Matrix

More information

Computer Architecture: Dataflow/Systolic Arrays

Computer Architecture: Dataflow/Systolic Arrays Data Flow Computer Architecture: Dataflow/Systolic Arrays he models we have examined all assumed Instructions are fetched and retired in sequential, control flow order his is part of the Von-Neumann model

More information

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM ESE532: System-on-a-Chip Architecture Day 20: April 3, 2017 Pipelining, Frequency, Dataflow Today What drives cycle times Pipelining in Vivado HLS C Avoiding bottlenecks feeding data in Vivado HLS C Penn

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

Performance and Overhead in a Hybrid Reconfigurable Computer

Performance and Overhead in a Hybrid Reconfigurable Computer Performance and Overhead in a Hybrid Reconfigurable Computer Osman Devrim Fidanci 1, Dan Poznanovic 2, Kris Gaj 3, Tarek El-Ghazawi 1, Nikitas Alexandridis 1 1 George Washington University, 2 SRC Computers

More information

Outcomes. Spiral 1 / Unit 6. Flip Flops FLIP FLOPS AND REGISTERS. Flip flops and Registers. Outputs only change once per clock period

Outcomes. Spiral 1 / Unit 6. Flip Flops FLIP FLOPS AND REGISTERS. Flip flops and Registers. Outputs only change once per clock period 1-6.1 1-6.2 Spiral 1 / Unit 6 Flip flops and Registers Mark Redekopp Outcomes I know the difference between combinational and sequential logic and can name examples of each. I understand latency, throughput,

More information

Cost-Effective Parallel Computational Electromagnetic Modeling

Cost-Effective Parallel Computational Electromagnetic Modeling Cost-Effective Parallel Computational Electromagnetic Modeling, Tom Cwik {Daniel.S.Katz, cwik}@jpl.nasa.gov Beowulf System at PL (Hyglac) l 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory,

More information

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III) COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses

More information

Structure of Computer Systems

Structure of Computer Systems 288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating

More information

Improving Area and Resource Utilization Lab

Improving Area and Resource Utilization Lab Lab Workbook Introduction This lab introduces various techniques and directives which can be used in Vivado HLS to improve design performance as well as area and resource utilization. The design under

More information

FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY

FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

Spiral 1 / Unit 6. Flip-flops and Registers

Spiral 1 / Unit 6. Flip-flops and Registers 1-5.1 Spiral 1 / Unit 6 Flip-flops and Registers 1-5.2 Outcomes I know the difference between combinational and sequential logic and can name examples of each. I understand latency, throughput, and at

More information

CS575: Parallel Processing Sanjay Rajopadhye CSU. Course Topics

CS575: Parallel Processing Sanjay Rajopadhye CSU. Course Topics CS575: Parallel Processing Sanjay Rajopadhye CSU Lecture 2: Parallel Computer Models CS575 lecture 1 Course Topics Introduction, background Complexity, orders of magnitude, recurrences Models of parallel

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

12. Use of Test Generation Algorithms and Emulation

12. Use of Test Generation Algorithms and Emulation 12. Use of Test Generation Algorithms and Emulation 1 12. Use of Test Generation Algorithms and Emulation Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

Outcomes. Spiral 1 / Unit 6. Flip Flops FLIP FLOPS AND REGISTERS. Flip flops and Registers. Outputs only change once per clock period

Outcomes. Spiral 1 / Unit 6. Flip Flops FLIP FLOPS AND REGISTERS. Flip flops and Registers. Outputs only change once per clock period 1-5.1 1-5.2 Spiral 1 / Unit 6 Flip flops and Registers Mark Redekopp Outcomes I know the difference between combinational and sequential logic and can name examples of each. I understand latency, throughput,

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers

An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers Allen Michalski 1, Kris Gaj 1, Tarek El-Ghazawi 2 1 ECE Department, George Mason University

More information

AES Core Specification. Author: Homer Hsing

AES Core Specification. Author: Homer Hsing AES Core Specification Author: Homer Hsing homer.hsing@gmail.com Rev. 0.1.1 October 30, 2012 This page has been intentionally left blank. www.opencores.org Rev 0.1.1 ii Revision History Rev. Date Author

More information

FPGA Matrix Multiplier

FPGA Matrix Multiplier FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses Today Comments about assignment 3-43 Comments about assignment 3 ASICs and Programmable logic Others courses octor Per should show up in the end of the lecture Mealy machines can not be coded in a single

More information

EECS150 - Digital Design Lecture 09 - Parallelism

EECS150 - Digital Design Lecture 09 - Parallelism EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization

More information

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002, Chapter 2 (cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Improving Cache Performance Average mem access time = hit time + miss rate * miss penalty speed up

More information

LooPo: Automatic Loop Parallelization

LooPo: Automatic Loop Parallelization LooPo: Automatic Loop Parallelization Michael Claßen Fakultät für Informatik und Mathematik Düsseldorf, November 27 th 2008 Model-Based Loop Transformations model-based approach: map source code to an

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss

More information

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most

More information

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo

More information

SDSoC: Session 1

SDSoC: Session 1 SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

IMPLICIT+EXPLICIT Architecture

IMPLICIT+EXPLICIT Architecture IMPLICIT+EXPLICIT Architecture Fortran Carte Programming Environment C Implicitly Controlled Device Dense logic device Typically fixed logic µp, DSP, ASIC, etc. Implicit Device Explicit Device Explicitly

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

Field Programmable Gate Array (FPGA)

Field Programmable Gate Array (FPGA) Field Programmable Gate Array (FPGA) Lecturer: Krébesz, Tamas 1 FPGA in general Reprogrammable Si chip Invented in 1985 by Ross Freeman (Xilinx inc.) Combines the advantages of ASIC and uc-based systems

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

Code Performance Analysis

Code Performance Analysis Code Performance Analysis Massimiliano Fatica ASCI TST Review May 8 2003 Performance Theoretical peak performance of the ASCI machines are in the Teraflops range, but sustained performance with real applications

More information

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic

More information

VHDL for Synthesis. Course Description. Course Duration. Goals

VHDL for Synthesis. Course Description. Course Duration. Goals VHDL for Synthesis Course Description This course provides all necessary theoretical and practical know how to write an efficient synthesizable HDL code through VHDL standard language. The course goes

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

MODELING LANGUAGES AND ABSTRACT MODELS. Giovanni De Micheli Stanford University. Chapter 3 in book, please read it.

MODELING LANGUAGES AND ABSTRACT MODELS. Giovanni De Micheli Stanford University. Chapter 3 in book, please read it. MODELING LANGUAGES AND ABSTRACT MODELS Giovanni De Micheli Stanford University Chapter 3 in book, please read it. Outline Hardware modeling issues: Representations and models. Issues in hardware languages.

More information

Keck-Voon LING School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore

Keck-Voon LING School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore MPC on a Chip Keck-Voon LING (ekvling@ntu.edu.sg) School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore EPSRC Project Kick-off Meeting, Imperial College, London,

More information

A Preliminary Assessment of the ACRI 1 Fortran Compiler

A Preliminary Assessment of the ACRI 1 Fortran Compiler A Preliminary Assessment of the ACRI 1 Fortran Compiler Joan M. Parcerisa, Antonio González, Josep Llosa, Toni Jerez Computer Architecture Department Universitat Politècnica de Catalunya Report No UPC-DAC-94-24

More information

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer

More information

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering,

More information

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable?

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable? ESE534: Computer Organization Day 22: April 9, 2012 Time Multiplexing Tabula March 1, 2010 Announced new architecture We would say w=1, c=8 arch. 1 [src: www.tabula.com] 2 Previously Today Saw how to pipeline

More information

Lecture 12 VHDL Synthesis

Lecture 12 VHDL Synthesis CPE 487: Digital System Design Spring 2018 Lecture 12 VHDL Synthesis Bryan Ackland Department of Electrical and Computer Engineering Stevens Institute of Technology Hoboken, NJ 07030 1 What is Synthesis?

More information

Is There A Tradeoff Between Programmability and Performance?

Is There A Tradeoff Between Programmability and Performance? Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable

More information

Mohamed Taher The George Washington University

Mohamed Taher The George Washington University Experience Programming Current HPRCs Mohamed Taher The George Washington University Acknowledgements GWU: Prof.\Tarek El-Ghazawi and Esam El-Araby GMU: Prof.\Kris Gaj ARSC SRC SGI CRAY 2 Outline Introduction

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

Parallel Programming Multicore systems

Parallel Programming Multicore systems FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have

More information

ECE453: Advanced Computer Architecture II Homework 1

ECE453: Advanced Computer Architecture II Homework 1 ECE453: Advanced Computer Architecture II Homework 1 Assigned: January 18, 2005 From Parallel Computer Architecture: A Hardware/Software Approach, Culler, Singh, Gupta: 1.15) The cost of a copy of n bytes

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication Erik H. D Hollander Electronics and Information Systems Department Ghent University, Ghent, Belgium Erik.DHollander@ugent.be

More information

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

EECS150 - Digital Design Lecture 5 - Verilog Logic Synthesis

EECS150 - Digital Design Lecture 5 - Verilog Logic Synthesis EECS150 - Digital Design Lecture 5 - Verilog Logic Synthesis Jan 31, 2012 John Wawrzynek Spring 2012 EECS150 - Lec05-verilog_synth Page 1 Outline Quick review of essentials of state elements Finite State

More information

FPGA for Software Engineers

FPGA for Software Engineers FPGA for Software Engineers Course Description This course closes the gap between hardware and software engineers by providing the software engineer all the necessary FPGA concepts and terms. The course

More information