Two routes to specialisation: Loki and lowrisc. Robert Mullins, University of Cambridge WEEE September 2015 Espoo, Finland
|
|
- Christal Henry
- 6 years ago
- Views:
Transcription
1 Two routes to specialisation: Loki and lowrisc Robert Mullins, University of Cambridge WEEE September 2015 Espoo, Finland
2 Specialisation More transistors but end of Dennard scaling Dark silicon, utilisation wall etc. Specialisation is an answer, but not without problems
3 Specialisation More transistors but end of Dennard scaling Dark silicon, utilisation wall etc. Specialisation is an answer, but not without problems Some possible directions Many heterogeneous SoCs Tackle complexity with open-source? (lowrisc) Explore how to make SoC designs more flexible (target broader markets) Homogeneous sea of resources FPGA -> CGRA -> manycore/mppa (Loki) Specialise software for each application
4 Loki Simple tiled many-core processor 8-cores per tile + 64KB SRAM, 40nm Each core is a complete 32-bit processor 40nm (<2W for 128-cores) Message-passing support at ISA level Every instruction can send its result to a remote location on chip Register mapped FIFOs Fast multicast support within a tile No cache coherency support between tiles at present (can share data via L2) Configurable on-chip memory system Each tile contains SRAM that may be dedicated as scratchpad, L1 or L2 cache
5 Loki Inter-tile routers 8 cores Local interconnects 64KB Chip-wide networks: 1. L1$ to L2$ requests 2. L2$ to memory requests 3. Core to core data 4. Mem/L2$ responses 5. Credit network
6 A Loki tile
7 A Loki tile Sequential consistency is retained within a tile as operations arrive at each bank in the order they entered the network (crossbar) [see Zhang PDCN 05]
8
9
10 Loki s memory system Each bank can service a miss and offers hit-under-miss support Synchronization/atomics Load-and-OP (AND, OR, XOR, ADD), Exchange LL/SC Can access command set at memory banks (sendconfig instruction) Send cache line to another bank Flush, invalidate or prefetch cache lines Bypass L1/L2 Memset cache line Same mechanism can be used to form packets on core-to-core network
11
12
13 Area (approx.) Cores ~50% SRAM 40-45% Routers 4-6% Other ~2-3%
14 Loki pipeline Small custom ISA Incl. support for predicated execution 6 register mapped network FIFOs (blocking reads) Decoupled loads Every instruction can send its result on network Can send instructions too! Channel map table Read in decode stage 16 entry table that maps channel names to network addresses
15 Example uint32_t updatecrc32(uint8_t ch, uint32_t crc) { return table[(crc ˆ ch) & 0xff] ˆ (crc >> 8); } setchmapi 1, r15 [...] fetch r10 xor r11, r14, r13 lli r12, %lo(table) lui r12, %hi(table) andi r11, r11, 255 slli r11, r11, 2 addu r11, r12, r11 ldw 0(r11) -> 1 srli r12, r14, 8 xor.eop r11, r2, r12
16 L0 I$ / scratchpad Fetch stage contains small (64 instruction), fully associative, I$ Can skip tag checks with in buffer jmp Instructions just executed in FIFO order until end of packet (don t have an actual PC) Execute stage contains small (256 word) local scratchpad
17 Execution patterns (within a tile) MIMD DLP (SIMD) DLP with helper core (scalarization) One core is dedicated to provide common data over multicast bus Enables work done by remaining data-parallel cores to be reduced Worker farm Task-level pipelines Dataflow (single persistent instruction packet per core) Can support a single instruction per core [See UCAM-CL-TR-846 for full details]
18 Example: JPEG colour conversation [Bates13]
19 DOACROSS loops [Campanoni et al. ISCA 2014] Substantial speedup available from exploiting DOACROSS parallelism 16 in-order cores ( Atom like) Much improved performance with low-latency communication mechanism ( ring cache RC) for signals and values
20 Example: ADPCM (encoder) We can exploit some DOACROSS parallelism in the case of ADPCM Achieves 2X using 3 cores Can do slightly better by simply splitting loop body across two cores Body then fits in core s L0 I$ ~2.5X on 2 cores Plan to explore simplified HELIX implementation for our Loki LLVM port Fast signals and shared L1 should make Loki a good target
21 ILP Splitting Another approach to grouping/fusing cores LLVM pass to automatically split a program across N cores in a tile using available ILP, communicates values over local tile core to core network Early results: Stencil2D (MachSuite) 1.78X (3 cores) Gemm/Blocked (MachSuite) 1.75X (3 cores) Matrix Multiply (2 cores) Initial attempt 0.72X With use of restrict 1.41X Exploit commutativity 1.86X Currently, exploring optimisations to consider the order basic blocks are visited and some microarchitectural enhancements (inp. FIFO issues) [Alex Bradbury]
22 AES case study AES-128-CTR mode 2 days work for a recent graduate Want to avoid running same code on each core: Would have poor L0 I$ performance Cores would produce less regular memory accesses Instead, the AES code is mapped as a task pipeline Loki Results 5.1 cycles/byte on one tile 2.5 cycles/byte on two tiles 11.5Gps at 450MHz for 128-cores Comparison: ARM + NEON Bitsliced implementation Lower bound is 13 cycles/byte
23 AES example: single tile (8-core) mapping Cores 1-5 address banks 2-5 using 4 separate channels to save 1 or 2 address manipulation instructions in the loop body
24 Current status Loki LLVM compiler implementation ISS + complete SystemC model SystemVerilog implementation is complete ( < 30K LOC) Generates 128-core ASIC version and 32-core FPGA implementation Test infrastructure, including random program generator Promising single-tile and multi-tile results Will tape-out very soon! 4mm x 4mm die, TSMC 40nm (128 cores, 1MB on-chip cache) Off-chip I/O to FPGA Northbridge 4 x 13-bit length matched full-duplex source synchronous DDR channels
25 Development boards Dev. boards will be available next year. Package (352 ball BGA) and board from Michael Taylor s group at UCSD See Community Aim to distribute boards to research groups or provide remote access Support research in compilers, mapping, applications etc.
26 Subject: Redo BBC Micro (2008)
27 Subject: Development of an open-source SoC (2014) Create an open-source SoC capable of running Linux well Make it real to encourage contributions and grow community Volume silicon manufacture Ability to purchase in small quantities Low-cost development board Regular updates to SoC Events, training and documentation lowrisc C.I.C (Not-for-profit company)
28 Why create an open source SoC? Research and teaching Serve the open-source community Demand from industry Remove constraints on use of processor IP Use lots of cores freely to provide flexible implementation Lower costs create proven base for derivatives Why now?
29 Approach to design Aim for simplicity no backwards compatibility issues, no baggage, clean sheet design Think about security from the start Free from commercial influences and release cycles Cores are free and customisable (one ISA) Aim to maximise functionality and flexibility (no trade-offs to create product range)
30 RISC-V RISC-V ISA from UC Berkeley Aim to create open ISA standard for industry Explicitly designed to be extensible Simple base integer ISA (~40 instructions) 32-bit, 64-bit, 128-bit (!) variants Rocket SoC: cores, L1, L2 cache, interconnect Silicon proven (45nm and 28nm) Chisel (open-source HW construction language)
31 lowrisc SoC
32 Current status
33 General purpose tagged memory Prevent control-flow hijacking attacks Accelerate debug tools use-after-free detection Per-word locks, full/empty bits for synchronization Control-flow integrity Assist Garbage collection Dynamic information flow tracking (DIFT) Capabilities Transactional memory Provenance tracking
34 General-purpose tagged memory LLVM pass has been implemented to tag sensitive pointers i.e. code pointers, virtual function table pointer, function pointers,. Every load of a sensitive pointer is replaced with a load that expects a particular tag to be read, if this is not the case an exception is raised Prevents classic buffer overflow attacks and return-orientatedprogramming Some other related attacks may remain if code has the right/wrong! bugs Overheads and future work
35 Minion cores Will initially support DMA and programmable I/O Use minions to generate I/O signals, pre-processor I/O data etc. Would like to also use minions to support tagged memory Particular tags trigger message to minion from application processor Minion executes security policy in parallel with app. Processor Plan to investigate implementing more of the SoC using minion cores + appropriate shims E.g. memory controller Will use the Pulpino core from Luca Benini s group at ETHZ
36 Open source HW Smaller community, higher barrier to entry Fabricating chips is expensive Verification effort is significant Patching can't be done in the same way typically Of course, all good reasons to produce an open known good SoC design and to promote a community effort
37 Roadmap Create untethered version of SoC with tagged memory Complete core SoC implementation (no GPU initially) First test chips (40 or 28nm) 2 to 4 cores, most probably dual-issue Integrate 3 rd party IP, e.g. mem controller, USB, Ethernet Support early adopters in creating derivative designs Third Party Design Starts 2017 Volume fab. run for community dev. board Strengthen lowrisc IP offerings
38 Research in the open Have lots of ideas, collaborate and share from day one Open development helps to attract best people, even if they contribute remotely (huge amount of good will and enthusiasm for these projects if people know what you are trying to do!) Make it easy for people to get involved, reproduce, extend and improve (this requires significant effort) Work with industry Provide vehicle to evaluate/implement other research ideas
39 Find out more and get involved ORCONF 2015 October 9-11 th, 2015 Ideasquare, Geneva ORCONF began as an annual event for openrisc developers. Now run as a Free and Open Source Silicon (FOSSi) event. lowrisc workshop on Friday Talks on RISC-V
40 Final thoughts Exploring two different approaches to achieving energy efficiency through specialisation: Loki: flexible processor array lowrisc: an open source SoC Opportunities to collaborate with others on both projects More information about lowrisc at See also phab.lowrisc.org Sign up to announcement and discussion lists
41 Acknowledgements Both lowrisc and Loki are team efforts Loki team currently includes Daniel Bates, Alex Bradbury and Alex Chadwick (Recent work on DNNs by Chihang Wang and Sam Tarver. Earlier work on configurable L1 memory system by Andreas Koltes) lowrisc team currently includes Wei Song, Alex Bradbury and numerous external contributors. Contributions on tagged memory and minion core I/O shims by Hongyan Xia and Martin Papadopoulos. Recent work on tagged memory architecture and LLVM support by Lucas Sonnabend and Matthew Toseland Loki is funded by an ERC starter grant (GA n ) This work was previously supported by UK EPSRC grant EP/G033110/1 lowrisc is kindly supported by a private donation and a donation from Google. Thank you for listening!
The lowrisc project Alex Bradbury
The lowrisc project Alex Bradbury lowrisc C.I.C. 3 rd April 2017 lowrisc We are producing an open source Linux capable System-on-a- Chip (SoC) 64-bit multicore Aim to be the Linux of the Hardware world
More informationThe State of Open Source Processors and Open Source Silicon. Stefan Tensilica Day LibreCores. Free and Open Digital Hardware
The State of Open Source Processors and Open Source Silicon Stefan Wallentowitz @ Tensilica Day 2017 LibreCores Free and Open Digital Hardware Karen Arnold, CC0 2 Open Source Processors vs. ISAs Important
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationTechnical Report. Exploiting tightly-coupled cores. Daniel Bates. Number 846. January Computer Laboratory UCAM-CL-TR-846 ISSN
Technical Report UCAM-CL-TR-846 ISSN 1476-2986 Number 846 Computer Laboratory Exploiting tightly-coupled cores Daniel Bates January 2014 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationCreating a Scalable Microprocessor:
Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B.
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder
ESE532: System-on-a-Chip Architecture Day 5: September 18, 2017 Dataflow Process Model Today Dataflow Process Model Motivation Issues Abstraction Basic Approach Dataflow variants Motivations/demands for
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationThe Design and Implementation of a Low-Latency On-Chip Network
The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current
More informationPACE: Power-Aware Computing Engines
PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious
More informationConfigurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc.
Configurable s for SOC Design Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc. Why Listen to This Presentation? Understand how SOC design techniques, now nearly 20 years old, are
More informationEvaluating RISC-V Cores for PULP
Evaluating RISC-V Cores for PULP An Open Parallel Ultra-Low-Power Platform www.pulp.ethz.ch 30 June 2015 Sven Stucki Antonio Pullini Michael Gautschi Frank K. Gürkaynak Andrea Marongiu Igor Loi Davide
More informationA Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013
A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company
More informationCCIX: a new coherent multichip interconnect for accelerated use cases
: a new coherent multichip interconnect for accelerated use cases Akira Shimizu Senior Manager, Operator relations Arm 2017 Arm Limited Arm 2017 Interconnects for different scale SoC interconnect. Connectivity
More informationIntel released new technology call P6P
P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new
More informationProcessor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP
Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationGreenDroid: An Architecture for the Dark Silicon Age
GreenDroid: An Architecture for the Dark Silicon Age Nathan Goulding-Hotta, Jack Sampson, Qiaoshi Zheng, Vikram Bhatt, Joe Auricchio, Steven Swanson, Michael Bedford Taylor University of California, San
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationThe Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic
The Vector-Thread Architecture Ronny Krashinsky, Chriopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Kre Asanovic MIT Computer Science and Artificial Intelligence Laboratory, Cambridge,
More informationDSP ISA Extensions for an Open-Source RISC-V Implementation
DSP ISA Extensions for an Open-Source RISC-V Implementation Davide Schiavone Davide Rossi Michael Gautschi Eric Flamand Andreas Traber Luca Benini Integrated Systems Laboratory Introduction: a typical
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationMapping applications into MPSoC
Mapping applications into MPSoC concurrency & communication Jos van Eijndhoven jos@vectorfabrics.com March 12, 2011 MPSoC mapping: exploiting concurrency 2 March 12, 2012 Computation on general purpose
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationIntroduction to Computing and Systems Architecture
Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little
More informationPortland State University ECE 588/688. Dataflow Architectures
Portland State University ECE 588/688 Dataflow Architectures Copyright by Alaa Alameldeen and Haitham Akkary 2018 Hazards in von Neumann Architectures Pipeline hazards limit performance Structural hazards
More informationCompilers and Code Optimization EDOARDO FUSELLA
Compilers and Code Optimization EDOARDO FUSELLA Contents LLVM The nu+ architecture and toolchain LLVM 3 What is LLVM? LLVM is a compiler infrastructure designed as a set of reusable libraries with well
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCopyright 2016 Xilinx
Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim PowerPC-base Core @3.2GHz 1 VMX vector unit per core 512KB L2 cache 7 x SPE @3.2GHz 7 x 128b 128 SIMD GPRs 7 x 256KB SRAM for SPE 1 of 8 SPEs reserved for redundancy total
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationronny@mit.edu www.cag.lcs.mit.edu/scale Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder
ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators
More informationDesign methodology for multi processor systems design on regular platforms
Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline
More informationA framework for optimizing OpenVX Applications on Embedded Many Core Accelerators
A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationResearch Collection. KISS PULPino - Updates on PULPino updates on PULPino. Other Conference Item. ETH Library
Research Collection Other Conference Item KISS PULPino - Updates on PULPino updates on PULPino Author(s): Pullini, Antonio; Gautschi, Michael; Gürkaynak, Frank Kagan; Glaser, Florian; Mach, Stefan; Rovere,
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationChapter Seven Morgan Kaufmann Publishers
Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be
More informationECE 471 Embedded Systems Lecture 2
ECE 471 Embedded Systems Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 7 September 2018 Announcements Reminder: The class notes are posted to the website. HW#1 will
More informationThe Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006
The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationECE 471 Embedded Systems Lecture 3
ECE 471 Embedded Systems Lecture 3 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 10 September 2018 Announcements New classroom: Stevens 365 HW#1 was posted, due Friday Reminder:
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationHardware and Software Optimisation. Tom Spink
Hardware and Software Optimisation Tom Spink Optimisation Modifying some aspect of a system to make it run more efficiently, or utilise less resources. Optimising hardware: Making it use less energy, or
More informationRISC-V Core IP Products
RISC-V Core IP Products An Introduction to SiFive RISC-V Core IP Drew Barbier September 2017 drew@sifive.com SiFive RISC-V Core IP Products This presentation is targeted at embedded designers who want
More informationOpenPrefetch. (in-progress)
OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),
More informationDESIGN OF A SOFT-CORE PROCESSOR BASED ON OPENCORES WITH ENHANCED PROCESSING FOR EMBEDDED APPLICATIONS
DESIGN OF A SOFT-CORE PROCESSOR BASED ON OPENCORES WITH ENHANCED PROCESSING FOR EMBEDDED APPLICATIONS TEJASWI AGARWAL & MAHESWARI R. School of Computing Sciences and Engineering, VIT University, Vandalur-Kelambakkam
More informationCSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)
CSE 4201 Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) Memory Hierarchy We need huge amount of cheap and fast memory Memory is either fast or cheap; never both. Do as politicians do: fake it Give
More informationSoC Platforms and CPU Cores
SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationHistory. PowerPC based micro-architectures. PowerPC ISA. Introduction
PowerPC based micro-architectures Godfrey van der Linden Presentation for COMP9244 Software view of Processor Architectures 2006-05-25 History 1985 IBM started on AMERICA 1986 Development of RS/6000 1990
More informationSection 6 Blackfin ADSP-BF533 Memory
Section 6 Blackfin ADSP-BF533 Memory 6-1 a ADSP-BF533 Block Diagram Core Timer 64 L1 Instruction Memory Performance Monitor JTAG/ Debug Core Processor LD0 32 LD1 32 L1 Data Memory SD32 DMA Mastered 32
More informationCombining Arm & RISC-V in Heterogeneous Designs
Combining Arm & RISC-V in Heterogeneous Designs Gajinder Panesar, CTO, UltraSoC gajinder.panesar@ultrasoc.com RISC-V Summit 3 5 December 2018 Santa Clara, USA Problem statement Deterministic multi-core
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationARM Processors for Embedded Applications
ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or
More informationIntroduction to gem5. Nizamudheen Ahmed Texas Instruments
Introduction to gem5 Nizamudheen Ahmed Texas Instruments 1 Introduction A full-system computer architecture simulator Open source tool focused on architectural modeling BSD license Encompasses system-level
More informationA Memory System Design Framework: Creating Smart Memories
A Memory System Design Framework: Creating Smart Memories Amin Firoozshahian, Alex Solomatnikov Hicamp Systems Inc. Ofer Shacham, Zain Asgar, http://www.c2s2.org Stephen Richardson, Christos Kozyrakis,
More informationImplementing Flexible Interconnect Topologies for Machine Learning Acceleration
Implementing Flexible Interconnect for Machine Learning Acceleration A R M T E C H S Y M P O S I A O C T 2 0 1 8 WILLIAM TSENG Mem Controller 20 mm Mem Controller Machine Learning / AI SoC New Challenges
More informationExploration of Cache Coherent CPU- FPGA Heterogeneous System
Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationCPU Architecture. HPCE / dt10 / 2013 / 10.1
Architecture HPCE / dt10 / 2013 / 10.1 What is computation? Input i o State s F(s,i) (s,o) s Output HPCE / dt10 / 2013 / 10.2 Input and Output = Communication There are many different types of IO (Input/Output)
More informationSYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS
SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS Embedded System System Set of components needed to perform a function Hardware + software +. Embedded Main function not computing Usually not autonomous
More informationIntroduction to Embedded System Processor Architectures
Introduction to Embedded System Processor Architectures Contents crafted by Professor Jari Nurmi Tampere University of Technology Department of Computer Systems Motivation Why Processor Design? Embedded
More informationNVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)
NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive) NVDLA NVIDIA DEEP LEARNING ACCELERATOR IP Core for deep learning part of NVIDIA s Xavier
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationInput and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state
What is computation? Input and Output = Communication Input State Output i s F(s,i) (s,o) o s There are many different types of IO (Input/Output) What constitutes IO is context dependent Obvious forms
More informationAgile Hardware Design: Building Chips with Small Teams
2017 SiFive. All Rights Reserved. Agile Hardware Design: Building Chips with Small Teams Yunsup Lee ASPIRE Graduate 2016 Co-Founder and CTO 2 2017 SiFive. All Rights Reserved. World s First Single-Chip
More informationTRIPS: Extending the Range of Programmable Processors
TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationIndustry Collaboration and Innovation
Industry Collaboration and Innovation Industry Landscape Key changes occurring in our industry Historical microprocessor technology continues to deliver far less than the historical rate of cost/performance
More informationHomework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures
Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang
More informationIntroduction to RISC-V
Introduction to RISC-V Jielun Tan, James Connolly February, 2019 Overview What is RISC-V Why RISC-V ISA overview Software environment Beta testing What is RISC-V RISC-V (pronounced risk-five ) is an open,
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationA 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators"
A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W ISC-V Processor with Vector Accelerators" Yunsup Lee 1, Andrew Waterman 1, imas Avizienis 1,! Henry Cook 1, Chen Sun 1,2,! Vladimir Stojanovic 1,2, Krste Asanovic
More informationExperiences Using the RISC- V Ecosystem to Design an Accelerator- Centric SoC in TSMC 16nm
Experiences Using the RISC- V Ecosystem to Design an Accelerator- Centric SoC in TSMC 16nm Tutu Ajayi 2, Khalid Al- Hawaj 1, Aporva Amarnath 2, Steve Dai 1, Scott Davidson 4, Paul Gao 4, Gai Liu 1, Anuj
More informationNative Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization
Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Mian-Muhammad Hamayun, Frédéric Pétrot and Nicolas Fournel System Level Synthesis
More informationFast architecture prototyping on FPGAs: frameworks, tools, and challenges
Fast architecture prototyping on FPGAs: frameworks, tools, and challenges Philipp Wagner Technische Universität München Lehrstuhl für Integrierte Systeme 10.04.2017 Our Goal: Improving MPSoC Architectures
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 19 Processor Design Overview Special features in microprocessors provide support for parallel processing Already discussed bus snooping Memory latency becoming
More informationChapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST
Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will
More informationMing Ming Wong Jawad Haj-Yahya Anupam Chattopadhyay
Hardware and Architectural Support for Security and Privacy (HASP 18), June 2, 2018, Los Angeles, CA, USA Ming Ming Wong Jawad Haj-Yahya Anupam Chattopadhyay Computing and Engineering (SCSE) Nanyang Technological
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationREVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.
December 3-6, 2018 Santa Clara Convention Center CA, USA REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND. https://tmt.knect365.com/risc-v-summit 2018 NETRONOME SYSTEMS, INC. 1 @risc_v MASSIVELY PARALLEL
More informationFrom Application to Technology OpenCL Application Processors Chung-Ho Chen
From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationMulti-threaded processors. Hung-Wei Tseng x Dean Tullsen
Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to
More informationSpecializing Hardware for Image Processing
Lecture 6: Specializing Hardware for Image Processing Visual Computing Systems So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.
More informationThe Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun
The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu
More informationOPERA. Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications
OPERA Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications Co-funded by the Horizon 2020 Framework Programme of the
More informationA Cache Hierarchy in a Computer System
A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationRISC-V Rocket Chip SoC Generator in Chisel. Yunsup Lee UC Berkeley
RISC-V Rocket Chip SoC Generator in Chisel Yunsup Lee UC Berkeley yunsup@eecs.berkeley.edu What is the Rocket Chip SoC Generator?! Parameterized SoC generator written in Chisel! Generates Tiles - (Rocket)
More information