Lecture 6: Hard vs Soft Logic. James C. Hoe Department of ECE Carnegie Mellon University
|
|
- Rosemary Johns
- 5 years ago
- Views:
Transcription
1 Lecture 6: Hard vs Soft Logic James C. Hoe Department of ECE Carnegie Mellon niversity F17 L06 S1, James C. Hoe, CM/ECE/CALCM, 2017
2 Housekeeping Your goal today: understand the difference between hard and soft logic Notices Handout #3: lab 1, due noon, 9/22 Readings (skim) Kuon and Rose, Measuring the gap between FPGAs and ASICs, ISFPGA, M. Papamichael, et al., CONNECT, ISFPGA, Chung, et al., Single Chip Heterogeneous Computing MICRO, F17 L06 S2, James C. Hoe, CM/ECE/CALCM, 2017
3 The Project Template Pick a compute application Pick a metric of merit Study implementation options on the Zedboard a good software implementation must be one option the rest is up to you Report findings Keep in mind, you have optimistically 6 weeks; don forget you are taking other courses F17 L06 S3, James C. Hoe, CM/ECE/CALCM, 2017
4 DoF: you pick the application The problem could be well studied (expect thoroughness and depth) unproven (credit for honest attempts) Convince us it is 6 weeks of effort Something there is a reason to do on FPGAs Best if it is something you want to or have to do anyways Need to find and study (at least) 1 closely relevant research paper as starting point F17 L06 S4, James C. Hoe, CM/ECE/CALCM, 2017
5 F17 L06 S5, James C. Hoe, CM/ECE/CALCM, 2017 DoF: you define the metric What you can study performance(throughput or latency?) cost (in terms of what?) power and energy (how will you measure?) design effort (what will you measure) app specific metrics (e.g., numerical accuracy) composite metric: energy delay product, performance/watt, performance/$ Must commit up front measurement procedure/benchmark testable good enough target condition
6 DoF: Platform You have the Zedboard You may substitute a reconfigurable platform you are already using (check with me first) You have access to more advanced platforms risky learning curve to fit in 6 weeks only if this plays into what else you are doing in life F17 L06 S6, James C. Hoe, CM/ECE/CALCM, 2017
7 DoF: Approach 1 option must be a good software only baseline This is a study do more than crank out implementations think about what are the design choices hypothesize the expected effects of your choices corroborate hypothesis by implementation and evaluation Implementation approach: no artificial bounds how would you work in real life? if you have access, you can use it (including tools and IPs) F17 L06 S7, James C. Hoe, CM/ECE/CALCM, 2017 Convince us it is 6 weeks of effort
8 What makes a good project Interesting and/or important Not totally obvious (to you) You have special insights or interest Hard enough for 6 weeks Not too hard, too risky Most importantly, you should enjoy it The above need not be an AND F17 L06 S8, James C. Hoe, CM/ECE/CALCM, 2017
9 We now return you to our regularly scheduled program F17 L06 S9, James C. Hoe, CM/ECE/CALCM, 2017
10 Reasons for FPGAs FPGA Pro s and Con s no manufacturing NRE (non recurring eng.) cost faster design time: try out increments as you go less validation time: debug as you go at full speed / can also patch after shipping The price of FPGAs high unit cost (not for high volume products) ~10x overhead in area/speed/power/. RTL level design abstraction (relative to SW) F17 L06 S10, James C. Hoe, CM/ECE/CALCM, 2017
11 Hard vs Soft Processor Cores Table 4.2: The Zynq Book Processor Configuration DMIPs MicroBlaze 900LT/700FF/ 2BRAM to 3800LT/3200FF/ 6DSP/21BRAM Table 4.3: The Zynq Book area optimized (3 stage) 196 perf. optimized (5 stage) with branch optimizations perf. optimized (5 stage) without branch optimizations 228 ARM Cortex A9 1GHz; both cores combined 5000 Processor Configuration CoreMark MicroBlaze 125MHz; 5 stage (Virtex 5) 238 ARM Cortex A9 1GHz; both cores combined 5927 ARM Cortex A9 800MHz; both cores combined ??from book F17 L06 S11, James C. Hoe, CM/ECE/CALCM, 2017
12 [Kuon and Rose, 2006] Altera Stratix II FPGA, 90nm Quartus II balanced, standard fit hard multipliers and memory blocks ST Micro 90nm standard cells Synopsys high effort, add scan chain ST Micro memory compiler Cadence place and route Basic Results avg 21x/40x in area (w/wo using hard macros) 3~4x critical path ~12x dynamic power F17 L06 S12, James C. Hoe, CM/ECE/CALCM, 2017
13 Benchmarking Opencores and local designs removed cases where FPGA and ASIC are more than 5% different in FF count (Bias?) Metrics evaluated logic density circuit speed power consumption [Table 1: Kuon and Rose, Measuring the Gap between FPGAs and ASICs, 2006] F17 L06 S13, James C. Hoe, CM/ECE/CALCM, 2017
14 Area Ratios Differences attributed to overhead surrounding LTs and FFs [Table 2: Kuon and Rose, Measuring the Gap between FPGAs and ASICs, 2006] F17 L06 S14, James C. Hoe, CM/ECE/CALCM, 2017
15 Critical Path Ratios [Table 3&4: Kuon and Rose, Measuring the Gap between FPGAs and ASICs, 2006] F17 L06 S15, James C. Hoe, CM/ECE/CALCM, 2017
16 Dynamic Power Ratios [Table 5: Kuon and Rose, Measuring the Gap between FPGAs and ASICs, 2006] F17 L06 S16, James C. Hoe, CM/ECE/CALCM, 2017
17 F17 L06 S17, James C. Hoe, CM/ECE/CALCM, 2017 Actual Mileage Varies Comparisons strongly affected by exact design, FPGA/ASIC target, methodology comparing less than best effort designs can bias in either direction design is not a point a full comparison would have to be Pareto front to Pareto front Either precise in a specific context, or warm fuzzy rule of thumb (~10x all around) 2x<< ~10x <<100x Amoving target with arch and process changes
18 Effects of Tuning FPGA design not a scaled version of ASIC design different relative cost in logic vs. wires vs. mem different relative speed in logic vs. wires vs. mem also unique usage and operating characteristics Designed for FPGA RTLs need different tuning F17 L06 S18, James C. Hoe, CM/ECE/CALCM, 2017
19 FPGA Wire Peculiarities Routing architecture over provisioned to handle worst case In a typical design, wires appear cheaper relative to other resource types Best case is nearest neighbor, regular grid Counterintuitively, you SHOLD use wider busses consume unused free wires compensate for lower frequency F17 L06 S19, James C. Hoe, CM/ECE/CALCM, 2017
20 FPGA Memory Peculiarities Large memory abnormally fast Large memory are free until your run out Quantized memory options jumps between FF based vs. LT RAM vs. BRAMs choose from fixed menu of sizes and aspect ratios Must manage RAM usage don t waste BRAM on small buffers tune buffer sizes to natural granularities, e.g., zero incremental cost to go from 2Kb to 4Kb pack buffers to share same physical array F17 L06 S20, James C. Hoe, CM/ECE/CALCM, 2017
21 FPGA Logic Peculiarities Logic slower than expected Sharp aberrations around hard macro use e.g., faster mult than add in Virtex II CLB mapped logic not divisible for pipelining over pipeline adds cycles without freq. increase sweetspot frequency that is easy to reach but hard to exceed Design for performance correct and maximal usage of hard macros shallowly pipelined, wide datapath F17 L06 S21, James C. Hoe, CM/ECE/CALCM, 2017
22 E.g., FPGA vs ASIC tuned NoC on FPGA ASIC RTL from nocs.stanford.edu/cgi bin/trac.cgi/ wiki/resources/router FPGA RTL from LTs 9K 8K 7K 6K 5K 4K 3K 2K 1K FPGA Resource sage (same router/noc configuration) fixing general FPGA style guidelines Single Router LTs 60K 50K 40K 30K 20K 10K 4x4 Mesh NoC Avg. Packet Latency (in ns) Network Performance (uniform random 100MHz) same config Load (in Gbps) at same cost [Papamichael, ISFPGA 2012] F17 L06 S22, James C. Hoe, CM/ECE/CALCM, 2017
23 Soft IPs need not be general purpose Reconfigurable fabric provides generality Soft IPs should be maximally specialized to usage Ring Fat Tree Mesh High Radix niform Random Traffic 90% Neighbor Traffic Latency (in cycles) Latency (in cycles) Load (in flits/cycle) F17 L06 S23, James C. Hoe, CM/ECE/CALCM, Load (in flits/cycle)
24 It is not just FPGA vs ASIC CP: highest level abstraction / Software most general purpose support Multicore: still high level abstraction / general parallelism GP: explicitly parallel programs / best for SIMD, regular FPGA: ASIC like abstraction / overhead for reprogrammability ASIC: lowest level abstraction / Hardware fixed application and tuning tradeoff between efficiency and effort F17 L06 S24, James C. Hoe, CM/ECE/CALCM, 2017
25 Case Study [Chung, MICRO 2010] CP GPs FPGA ASIC Intel Core i7 960 Nvidia GTX285 ATI R5870 Xilinx V6 LX760 Std. Cell Year Node 45nm 55nm 40nm 40nm 65nm Die area 263mm 2 470mm 2 334mm 2 Clock rate 3.2GHz 1.5GHz 1.5GHz 0.3GHz Single prec. floating point apps M M Mult FFT Black Scholes MKL Multithreaded Spiral.net Multithreaded PARSEC multithreaded CBLAS 2.3 CAL++ hand coded CFFT /3.1 CDA 2.3 Spiral.net hand coded F17 L06 S25, James C. Hoe, CM/ECE/CALCM, 2017
26 Best Case Performance and Energy MMM Device GFLOP/s actual (GFLOP/s)/mm 2 normalized to 40nm GFLOP/J normalized to 40nm Intel Core i7 (45nm) Nvidia GTX285 (55nm) ATI R5870 (40nm) XilinxV6 LX760 (40nm) same RTL std cell (65nm) CP and GP benchmarking is compute bound; FPGA and Std Cell effectively compute bound (no off chip I/O) Power (switching+leakage) measurements isolated the core from the system For detail see [Chung, et al. MICRO 2010] F17 L06 S26, James C. Hoe, CM/ECE/CALCM, 2017
27 Less Regular Applications GFLOP/s (GFLOP/s)/mm 2 GFLOP/J FFT 2 10 Intel Core i7 (45nm) Nvidia GTX285 (55nm) ATI R5870 (40nm) XilinxV6 LX760 (40nm) same RTL std cell (65nm) Mopt/s (Mopts/s)/mm 2 Mopts/J Black Scholes Intel Core i7 (45nm) Nvidia GTX285 (55nm) ATI R5870 (40nm) XilinxV6 LX760 (40nm) same RTL std cell (65nm) F17 L06 S27, James C. Hoe, CM/ECE/CALCM, 2017
28 Tradeoff in Heterogeneity? Big Core What will you choose to put on it? GPGP little core little core little core Custom Logic FPGA little core little core little core little core little core little core F17 L06 S28, James C. Hoe, CM/ECE/CALCM, 2017
29 Amdahl s Law on Multicore BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE A program is rarely completely parallelizable; let s say a fraction f is perfectly parallelizable Speedup of n cores over sequential Speedup 1 (1 f ) f n for small f, die area under utilized Base Core Equivalent (BCE) in [Hill and Marty, 2008] F17 L06 S29, James C. Hoe, CM/ECE/CALCM, 2017
30 Asymmetric Multicores BCE BCE BCE BCE BCE BCE BCE F17 L06 S30, James C. Hoe, CM/ECE/CALCM, 2017 Trade pwr/area efficient slow BCEs for a pwr/area hungry fast core fast core for sequential code slow cores for parallel sections [Hill and Marty, 2008] BCE BCE BCE Fast Core BCE BCE 1 BCE Speedup 1 f f perf BCE BCE BCE seq ( n r) perf seq Base Core Equivalent (BCE) in [Hill and Marty, 2008] r = cost of fast core in BCE perf seq = speedup of fast core over BCE solve for optimal die area allocation given f
31 Asymmetric BCE BCE BCE BCE BCE BCE BCE Fast Core BCE BCE Fast Core F17 L06 S31, James C. Hoe, CM/ECE/CALCM, 2017 Heterogeneous Multicores [Chung, et al. MICRO 2010] BCE BCE BCE Base Core Equivalent Heterogeneous Speedup Speedup 1 f perf 1- f perf seq perf seq seq 1 f ( n r) [Hill and Marty, 2008] simplified f is fraction parallelizable n is total die area in BCE units r is fast core area in BCE units perf seq (r) is fast core perf. relative to BCE 1 f ( n - r) For the sake of analysis, break the area for GP/FPGA/etc. into units of cores that are the same size as BCEs. Each core type is characterized by a relative performance µ and relative power compared to a BCE
32 Modeling Power and Bandwidth Budgets Heterogeneous Fast Core Speedup 1- f perf seq 1 f ( n - r) The above is based on area alone Power or bandwidth budget limits the usable die area if P is total power budget expressed as a multiple of a BCE s power, then usable core area n r P if B is total memory bandwidth expressed also as a multiple of BCEs, then usable core area n r B μ F17 L06 S32, James C. Hoe, CM/ECE/CALCM, 2017
33 and μ example values Nvidia GTX285 Nvidia GTX480 ATI R5870 Xilinx LX760 Custom Logic MMM Black Scholes FFT 2 10 Φ μ Φ On equal area basis, 3.41x performance at 0.74x power relative a BCE μ Φ 1.27 μ 8.47 Φ μ Φ μ Nominal BCE based on an Intel Atom in order processor, 26mm 2 in a 45nm process F17 L06 S33, James C. Hoe, CM/ECE/CALCM, 2017
34 Combine Model with ITRS Trends Year Technology 40nm 32nm 22nm 16nm 11nm Core die budget (mm 2 ) Normalized area (BCE) (16x) Core power (W) (1x) Bandwidth (GB/s) (1.4x) Rel pwr per device 1X 0.75X 0.5X 0.36X 0.25X Fast Core 2011 parameters reflect high end systems of the day; future parameters extrapolated from ITRS mm 2 populated by an optimally sized Fast Core and cores of choice F17 L06 S34, James C. Hoe, CM/ECE/CALCM, 2017
35 Single Prec. MMMult (f=99%) 300 ASIC 250 Speedup nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (5) R5870 GP R5870 FPGA LX760 Sym f=0.990 & Asym multicore Power Bound Mem Bound F17 L06 S35, James C. Hoe, CM/ECE/CALCM, 2017
36 Single Prec. MMMult (f=90%) Speedup ASIC GP R5870 FPGA LX760 Asymmetric multicore Symmetric multicore f=0.900 x 0 40nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (5) R5870 Power Bound Mem Bound F17 L06 S36, James C. Hoe, CM/ECE/CALCM, 2017
37 Single Prec. MMMult (f=50%) ASIC/GP/FPGA Asymmetric Symmetric Speedup f= nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC ASIC FPGA (3) LX760 GP R5870 (5) R5870 Power Bound Mem Bound F17 L06 S37, James C. Hoe, CM/ECE/CALCM, 2017
38 Single Prec. FFT 1024 (f=99%) 60 ASIC/GP/FPGA 50 Speedup Asymmetric Symmetric 10 f= nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (4) GTX480 Power Bound Mem Bound F17 L06 S38, James C. Hoe, CM/ECE/CALCM, 2017
39 FFT 1024 (f=99%) if hypothetical 1TB/sec bandwidth 200 ASIC Speedup FPGA LX760 GP GTX nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC ASIC FPGA (3) LX760 GP R5870 (4) GTX480 f=0.990 Sym & Asym multicore Power Bound Mem Bound F17 L06 S39, James C. Hoe, CM/ECE/CALCM, 2017
40 Parting Thoughts FPGAs pay an overhead for reconfigurability significant but reducing power and BW bottleneck can compress differences FPGAs differ from ASICs in more than then reconfiguration overhead require distinct architecture and tuning {Multicore/GP/FPGA} are all midpoints between CP and ASIC extremes none is a panacea go with the easier option unless not good enough F17 L06 S40, James C. Hoe, CM/ECE/CALCM, 2017
Lecture 1: Welcome, why are you here? James C. Hoe Department of ECE Carnegie Mellon University
18 643 Lecture 1: Welcome, why are you here? James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L01 S1, James C. Hoe, CMU/ECE/CALCM, 2017 18 643 F17 L01 S2, James C. Hoe, CMU/ECE/CALCM,
More informationRe-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs
This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs
More informationFast Flexible FPGA-Tuned Networks-on-Chip
This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Fast Flexible FPGA-Tuned Networks-on-Chip Michael K. Papamichael, James C. Hoe
More informationBeiHang Short Course, Part 5: Pandora Smart IP Generators
BeiHang Short Course, Part 5: Pandora Smart IP Generators James C. Hoe Department of ECE Carnegie Mellon University Collaborator: Michael Papamichael J. C. Hoe, CMU/ECE/CALCM, 0, BHSC L5 s CONNECT NoC
More informationLecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 14: Memory Hierarchy James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L14 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping understand memory system
More informationINTRODUCTION TO FPGA ARCHITECTURE
3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)
More informationBasic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices
3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific
More informationLecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University
18 643 Lecture 11: OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L11 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: understand Altera
More informationUltra-Fast NoC Emulation on a Single FPGA
The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationEE282 Computer Architecture. Lecture 1: What is Computer Architecture?
EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer
More informationAUGMENTING FPGAS WITH EMBEDDED NETWORKS-ON-CHIP
AUGMENTING FPGAS WITH EMBEDDED NETWORKS-ON-CHIP Mohamed S. Abdelfattah and Vaughn Betz Department of Electrical and Computer Engineering University of Toronto, Toronto, ON, Canada {mohamed,vaughn}@eecg.utoronto.ca
More informationA Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationAll Programmable: from Silicon to System
All Programmable: from Silicon to System Ivo Bolsens, Senior Vice President & CTO Page 1 Moore s Law: The Technology Pipeline Page 2 Industry Debates Variability Page 3 Industry Debates on Cost Page 4
More informationHoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs. Chethan Kumar H B and Nachiket Kapre
-DSP Harnessing the Xilinx DSP Multiplexers to efficiently support NoCs on FPGAs Chethan Kumar H B and Nachiket Kapre nachiket@ieee.org FPL 201 paper Jan Gray co-author Specs 60 s+100 FFs 2.9ns clock Smallest
More informationSynthesizable FPGA Fabrics Targetable by the VTR CAD Tool
Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Jin Hee Kim and Jason Anderson FPL 2015 London, UK September 3, 2015 2 Motivation for Synthesizable FPGA Trend towards ASIC design flow Design
More informationLecture 21: Parallelism ILP to Multicores. Parallel Processing 101
18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture
More informationChapter 18 - Multicore Computers
Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca
More informationFPGA system development What you need to think about. Frédéric Leens, CEO
FPGA system development What you need to think about Frédéric Leens, CEO About Byte Paradigm 2005 : Founded by 3 ASIC-SoC-FPGA engineers as a Design Center for high-end FPGA and board design. 2007 : GP
More informationAn FPGA Architecture Supporting Dynamically-Controlled Power Gating
An FPGA Architecture Supporting Dynamically-Controlled Power Gating Altera Corporation March 16 th, 2012 Assem Bsoul and Steve Wilton {absoul, stevew}@ece.ubc.ca System-on-Chip Research Group Department
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationMulticore Hardware and Parallelism
Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3
More informationLecture 7: Introduction to Co-synthesis Algorithms
Design & Co-design of Embedded Systems Lecture 7: Introduction to Co-synthesis Algorithms Sharif University of Technology Computer Engineering Dept. Winter-Spring 2008 Mehdi Modarressi Topics for today
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationChapter 5: ASICs Vs. PLDs
Chapter 5: ASICs Vs. PLDs 5.1 Introduction A general definition of the term Application Specific Integrated Circuit (ASIC) is virtually every type of chip that is designed to perform a dedicated task.
More informationDynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers
Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Young Hoon Kang, Taek-Jun Kwon, and Jeff Draper {youngkan, tjkwon, draper}@isi.edu University of Southern California
More informationFIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations
FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for oc Modeling in Full-System Simulations Michael K. Papamichael, James C. Hoe, Onur Mutlu papamix@cs.cmu.edu, jhoe@ece.cmu.edu, onur@cmu.edu
More informationComputer Architecture s Changing Definition
Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction
More informationECE 4514 Digital Design II. Spring Lecture 22: Design Economics: FPGAs, ASICs, Full Custom
ECE 4514 Digital Design II Lecture 22: Design Economics: FPGAs, ASICs, Full Custom A Tools/Methods Lecture Overview Wows and Woes of scaling The case of the Microprocessor How efficiently does a microprocessor
More informationCONNECT: Fast Flexible FPGA-Tuned Networks-on-Chip
Workshop on the Intersections of Computer Architecture and Reconfigurable Logic (CARL 212): Category 2 CONNECT: Fast Flexible FPGA-Tuned Networks-on-Chip Michael K. Papamichael Computer Science Department
More informationIEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Power Analysis of Embedded NoCs on FPGAs and Comparison With Custom Buses Mohamed S. Abdelfattah, Graduate Student Member, IEEE, and Vaughn
More informationLecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel
More informationBig.LITTLE Processing with ARM Cortex -A15 & Cortex-A7
Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7 Improving Energy Efficiency in High-Performance Mobile Platforms Peter Greenhalgh, ARM September 2011 This paper presents the rationale and design
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationDigital Integrated Circuits
Digital Integrated Circuits Lecture 9 Jaeyong Chung Robust Systems Laboratory Incheon National University DIGITAL DESIGN FLOW Chung EPC6055 2 FPGA vs. ASIC FPGA (A programmable Logic Device) Faster time-to-market
More informationParallelism in Hardware
Parallelism in Hardware Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3 Moore s Law
More informationEmbedded Systems. 7. System Components
Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationCase study: Performance-efficient Implementation of Robust Header Compression (ROHC) using an Application-Specific Processor
Case study: Performance-efficient Implementation of Robust Header Compression (ROHC) using an Application-Specific Processor Gert Goossens, Patrick Verbist, Erik Brockmeyer, Luc De Coster Synopsys 1 Agenda
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationComputer Architecture. R. Poss
Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion
More informationLow-Power Interconnection Networks
Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:
More informationThe future is parallel but it may not be easy
The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the
More informationCS310 Embedded Computer Systems. Maeng
1 INTRODUCTION (PART II) Maeng Three key embedded system technologies 2 Technology A manner of accomplishing a task, especially using technical processes, methods, or knowledge Three key technologies for
More informationExploiting Dark Silicon in Server Design. Nikos Hardavellas Northwestern University, EECS
Exploiting Dark Silicon in Server Design Nikos Hardavellas Northwestern University, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 45nm 32nm 22nm 16nm
More informationCOE 561 Digital System Design & Synthesis Introduction
1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design
More informationPOLYMORPHIC ON-CHIP NETWORKS
POLYMORPHIC ON-CHIP NETWORKS Martha Mercaldi Kim, John D. Davis*, Mark Oskin, Todd Austin** University of Washington *Microsoft Research, Silicon Valley ** University of Michigan On-Chip Network Selection
More informationSlides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2
Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era 11/16/2011 Many-Core Computing 2 Gene M. Amdahl, Validity of the Single-Processor Approach to Achieving
More information6.375: Complex Digital Systems. February 3, L01-1. Something new and exciting as well as useful
6.375: Complex Digital Systems Lecturer: TA: Administration: Arvind Ming Liu Sally Lee February 3, 2016 http://csg.csail.mit.edu/6.375 L01-1 Why take 6.375 Something new and exciting as well as useful
More informationFPGA Technology and Industry Experience
FPGA Technology and Industry Experience Guest Lecture at HSLU, Horw (Lucerne) May 24 2012 Oliver Brndler, FPGA Design Center, Enclustra GmbH Silvio Ziegler, FPGA Design Center, Enclustra GmbH Content Enclustra
More informationSimplifying FPGA Design for SDR with a Network on Chip Architecture
Simplifying FPGA Design for SDR with a Network on Chip Architecture Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 RF NoC 3 Status and Conclusions USRP FPGA Capability Gen
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationHES-7 ASIC Prototyping
Rev. 1.9 September 14, 2012 Co-authored by: Slawek Grabowski and Zibi Zalewski, Aldec, Inc. Kirk Saban, Xilinx, Inc. Abstract This paper highlights possibilities of ASIC verification using FPGA-based prototyping,
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationEE178 Spring 2018 Lecture Module 4. Eric Crabill
EE178 Spring 2018 Lecture Module 4 Eric Crabill Goals Implementation tradeoffs Design variables: throughput, latency, area Pipelining for throughput Retiming for throughput and latency Interleaving for
More informationA Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013
A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company
More informationPower Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas
Power Solutions for Leading-Edge FPGAs Vaughn Betz & Paul Ekas Agenda 90 nm Power Overview Stratix II : Power Optimization Without Sacrificing Performance Technical Features & Competitive Results Dynamic
More informationIntroduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013
Introduction to FPGA Design with Vivado High-Level Synthesis Notice of Disclaimer The information disclosed to you hereunder (the Materials ) is provided solely for the selection and use of Xilinx products.
More information15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 4: Pipelining Prof. Onur Mutlu Carnegie Mellon University Last Time Addressing modes Other ISA-level tradeoffs Programmer vs. microarchitect Virtual memory Unaligned
More informationWelcome. Altera Technology Roadshow 2013
Welcome Altera Technology Roadshow 2013 Altera at a Glance Founded in Silicon Valley, California in 1983 Industry s first reprogrammable logic semiconductors $1.78 billion in 2012 sales Over 2,900 employees
More informationResearch Challenges for FPGAs
Research Challenges for FPGAs Vaughn Betz CAD Scalability Recent FPGA Capacity Growth Logic Eleme ents (Thousands) 400 350 300 250 200 150 100 50 0 MCNC Benchmarks 250 nm FLEX 10KE Logic: 34X Memory Bits:
More informationReduce Your System Power Consumption with Altera FPGAs Altera Corporation Public
Reduce Your System Power Consumption with Altera FPGAs Agenda Benefits of lower power in systems Stratix III power technology Cyclone III power Quartus II power optimization and estimation tools Summary
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationWhite Paper Assessing FPGA DSP Benchmarks at 40 nm
White Paper Assessing FPGA DSP Benchmarks at 40 nm Introduction Benchmarking the performance of algorithms, devices, and programming methodologies is a well-worn topic among developers and research of
More informationMicroprocessor Soft-Cores: An Evaluation of Design Methods and Concepts on FPGAs
Microprocessor Soft-Cores: An Evaluation of Design Methods and Concepts on FPGAs Pieter Anemaet (1159100), Thijs van As (1143840) {P.A.M.Anemaet, T.vanAs}@student.tudelft.nl Computer Architecture (Special
More information6.375: Complex Digital Systems. Something new and exciting as well as useful. Fun: Design systems that you never thought you could design in a course
6.375: Complex Digital Systems Lecturer: Arvind TA: Richard S. Uhler Administration: Sally Lee February 6, 2013 http://csg.csail.mit.edu/6.375 L01-1 Why take 6.375 Something new and exciting as well as
More informationGeneral Purpose Signal Processors
General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:
More informationNetworks-on-Chip for FPGAs: Hard, Soft or Mixed?
Networks-on-Chip for FPGAs: Hard, Soft or Mixed? MOHAMED S. ABDELFATTAH and VAUGHN BETZ, University of Toronto As FPGA capacity increases, a growing challenge is connecting ever-more components with the
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationHRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard
More informationFinal Lecture. A few minutes to wrap up and add some perspective
Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection
More informationStacked Silicon Interconnect Technology (SSIT)
Stacked Silicon Interconnect Technology (SSIT) Suresh Ramalingam Xilinx Inc. MEPTEC, January 12, 2011 Agenda Background and Motivation Stacked Silicon Interconnect Technology Summary Background and Motivation
More informationThere s STILL plenty of room at the bottom! Andreas Olofsson
There s STILL plenty of room at the bottom! Andreas Olofsson 1 Richard Feynman s Lecture (1959) There's Plenty of Room at the Bottom An Invitation to Enter a New Field of Physics Why cannot we write the
More information2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.
ESE534: Computer Organization Previously Day 7: February 6, 2012 Memories Arithmetic: addition, subtraction Reuse: pipelining bit-serial (vectorization) Area/Time Tradeoffs Latency and Throughput 1 2 Today
More informationLecture: Interconnection Networks
Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet
More informationMulticore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.
CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous
More informationOpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) hyoukjun@gatech.edu April
More information2011 Signal Processing CoDR: Technology Roadmap W. Turner SPDO. 14 th April 2011
2011 Signal Processing CoDR: Technology Roadmap W. Turner SPDO 14 th April 2011 Technology Roadmap Objectives: Identify known potential technologies applicable to the SKA Provide traceable attributes of
More informationToday. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses
Today Comments about assignment 3-43 Comments about assignment 3 ASICs and Programmable logic Others courses octor Per should show up in the end of the lecture Mealy machines can not be coded in a single
More information1. NoCs: What s the point?
1. Nos: What s the point? What is the role of networks-on-chip in future many-core systems? What topologies are most promising for performance? What about for energy scaling? How heavily utilized are Nos
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationUnderstanding Peak Floating-Point Performance Claims
white paper FPGA Understanding Peak ing-point Performance Claims Learn how to calculate and compare the peak floating-point capabilities of digital signal processors (DSPs), graphics processing units (GPUs),
More informationLec 25: Parallel Processors. Announcements
Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza
More informationECE 1160/2160 Embedded Systems Design. Midterm Review. Wei Gao. ECE 1160/2160 Embedded Systems Design
ECE 1160/2160 Embedded Systems Design Midterm Review Wei Gao ECE 1160/2160 Embedded Systems Design 1 Midterm Exam When: next Monday (10/16) 4:30-5:45pm Where: Benedum G26 15% of your final grade What about:
More informationEXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu
Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationCS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures CS61C L22 Caches II (1) CPS today! Lecture #22 Caches II 2005-11-16 There is one handout today at the front and back of the room! Lecturer PSOE,
More informationProtoFlex: FPGA-Accelerated Hybrid Simulator
ProtoFlex: FPGA-Accelerated Hybrid Simulator Eric S. Chung, Eriko Nurvitadhi James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at Multiprocessor Simulation Simulating one processor in software
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationLUMOS. A Framework with Analy1cal Models for Heterogeneous Architectures. Liang Wang, and Kevin Skadron (University of Virginia)
LUMOS A Framework with Analy1cal Models for Heterogeneous Architectures Liang Wang, and Kevin Skadron (University of Virginia) What is LUMOS A set of first- order analy1cal models targe1ng heterogeneous
More informationECE 471 Embedded Systems Lecture 2
ECE 471 Embedded Systems Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 7 September 2018 Announcements Reminder: The class notes are posted to the website. HW#1 will
More informationLecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)
Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew
More informationThemes. The Network 1. Energy in the DC: ~15% network? Energy by Technology
Themes The Network 1 Low Power Computing David Andersen Carnegie Mellon University Last two classes: Saving power by running more slowly and sleeping more. This time: Network intro; saving power by architecting
More informationCS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 35 Caches IV / VM I 2004-11-19 Andy Carle inst.eecs.berkeley.edu/~cs61c-ta Google strikes back against recent encroachments into the Search
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More information