Two routes to specialisation: Loki and lowrisc. Robert Mullins, University of Cambridge WEEE September 2015 Espoo, Finland

Size: px
Start display at page:

Download "Two routes to specialisation: Loki and lowrisc. Robert Mullins, University of Cambridge WEEE September 2015 Espoo, Finland"

Transcription

1 Two routes to specialisation: Loki and lowrisc Robert Mullins, University of Cambridge WEEE September 2015 Espoo, Finland

2 Specialisation More transistors but end of Dennard scaling Dark silicon, utilisation wall etc. Specialisation is an answer, but not without problems

3 Specialisation More transistors but end of Dennard scaling Dark silicon, utilisation wall etc. Specialisation is an answer, but not without problems Some possible directions Many heterogeneous SoCs Tackle complexity with open-source? (lowrisc) Explore how to make SoC designs more flexible (target broader markets) Homogeneous sea of resources FPGA -> CGRA -> manycore/mppa (Loki) Specialise software for each application

4 Loki Simple tiled many-core processor 8-cores per tile + 64KB SRAM, 40nm Each core is a complete 32-bit processor 40nm (<2W for 128-cores) Message-passing support at ISA level Every instruction can send its result to a remote location on chip Register mapped FIFOs Fast multicast support within a tile No cache coherency support between tiles at present (can share data via L2) Configurable on-chip memory system Each tile contains SRAM that may be dedicated as scratchpad, L1 or L2 cache

5 Loki Inter-tile routers 8 cores Local interconnects 64KB Chip-wide networks: 1. L1$ to L2$ requests 2. L2$ to memory requests 3. Core to core data 4. Mem/L2$ responses 5. Credit network

6 A Loki tile

7 A Loki tile Sequential consistency is retained within a tile as operations arrive at each bank in the order they entered the network (crossbar) [see Zhang PDCN 05]

8

9

10 Loki s memory system Each bank can service a miss and offers hit-under-miss support Synchronization/atomics Load-and-OP (AND, OR, XOR, ADD), Exchange LL/SC Can access command set at memory banks (sendconfig instruction) Send cache line to another bank Flush, invalidate or prefetch cache lines Bypass L1/L2 Memset cache line Same mechanism can be used to form packets on core-to-core network

11

12

13 Area (approx.) Cores ~50% SRAM 40-45% Routers 4-6% Other ~2-3%

14 Loki pipeline Small custom ISA Incl. support for predicated execution 6 register mapped network FIFOs (blocking reads) Decoupled loads Every instruction can send its result on network Can send instructions too! Channel map table Read in decode stage 16 entry table that maps channel names to network addresses

15 Example uint32_t updatecrc32(uint8_t ch, uint32_t crc) { return table[(crc ˆ ch) & 0xff] ˆ (crc >> 8); } setchmapi 1, r15 [...] fetch r10 xor r11, r14, r13 lli r12, %lo(table) lui r12, %hi(table) andi r11, r11, 255 slli r11, r11, 2 addu r11, r12, r11 ldw 0(r11) -> 1 srli r12, r14, 8 xor.eop r11, r2, r12

16 L0 I$ / scratchpad Fetch stage contains small (64 instruction), fully associative, I$ Can skip tag checks with in buffer jmp Instructions just executed in FIFO order until end of packet (don t have an actual PC) Execute stage contains small (256 word) local scratchpad

17 Execution patterns (within a tile) MIMD DLP (SIMD) DLP with helper core (scalarization) One core is dedicated to provide common data over multicast bus Enables work done by remaining data-parallel cores to be reduced Worker farm Task-level pipelines Dataflow (single persistent instruction packet per core) Can support a single instruction per core [See UCAM-CL-TR-846 for full details]

18 Example: JPEG colour conversation [Bates13]

19 DOACROSS loops [Campanoni et al. ISCA 2014] Substantial speedup available from exploiting DOACROSS parallelism 16 in-order cores ( Atom like) Much improved performance with low-latency communication mechanism ( ring cache RC) for signals and values

20 Example: ADPCM (encoder) We can exploit some DOACROSS parallelism in the case of ADPCM Achieves 2X using 3 cores Can do slightly better by simply splitting loop body across two cores Body then fits in core s L0 I$ ~2.5X on 2 cores Plan to explore simplified HELIX implementation for our Loki LLVM port Fast signals and shared L1 should make Loki a good target

21 ILP Splitting Another approach to grouping/fusing cores LLVM pass to automatically split a program across N cores in a tile using available ILP, communicates values over local tile core to core network Early results: Stencil2D (MachSuite) 1.78X (3 cores) Gemm/Blocked (MachSuite) 1.75X (3 cores) Matrix Multiply (2 cores) Initial attempt 0.72X With use of restrict 1.41X Exploit commutativity 1.86X Currently, exploring optimisations to consider the order basic blocks are visited and some microarchitectural enhancements (inp. FIFO issues) [Alex Bradbury]

22 AES case study AES-128-CTR mode 2 days work for a recent graduate Want to avoid running same code on each core: Would have poor L0 I$ performance Cores would produce less regular memory accesses Instead, the AES code is mapped as a task pipeline Loki Results 5.1 cycles/byte on one tile 2.5 cycles/byte on two tiles 11.5Gps at 450MHz for 128-cores Comparison: ARM + NEON Bitsliced implementation Lower bound is 13 cycles/byte

23 AES example: single tile (8-core) mapping Cores 1-5 address banks 2-5 using 4 separate channels to save 1 or 2 address manipulation instructions in the loop body

24 Current status Loki LLVM compiler implementation ISS + complete SystemC model SystemVerilog implementation is complete ( < 30K LOC) Generates 128-core ASIC version and 32-core FPGA implementation Test infrastructure, including random program generator Promising single-tile and multi-tile results Will tape-out very soon! 4mm x 4mm die, TSMC 40nm (128 cores, 1MB on-chip cache) Off-chip I/O to FPGA Northbridge 4 x 13-bit length matched full-duplex source synchronous DDR channels

25 Development boards Dev. boards will be available next year. Package (352 ball BGA) and board from Michael Taylor s group at UCSD See Community Aim to distribute boards to research groups or provide remote access Support research in compilers, mapping, applications etc.

26 Subject: Redo BBC Micro (2008)

27 Subject: Development of an open-source SoC (2014) Create an open-source SoC capable of running Linux well Make it real to encourage contributions and grow community Volume silicon manufacture Ability to purchase in small quantities Low-cost development board Regular updates to SoC Events, training and documentation lowrisc C.I.C (Not-for-profit company)

28 Why create an open source SoC? Research and teaching Serve the open-source community Demand from industry Remove constraints on use of processor IP Use lots of cores freely to provide flexible implementation Lower costs create proven base for derivatives Why now?

29 Approach to design Aim for simplicity no backwards compatibility issues, no baggage, clean sheet design Think about security from the start Free from commercial influences and release cycles Cores are free and customisable (one ISA) Aim to maximise functionality and flexibility (no trade-offs to create product range)

30 RISC-V RISC-V ISA from UC Berkeley Aim to create open ISA standard for industry Explicitly designed to be extensible Simple base integer ISA (~40 instructions) 32-bit, 64-bit, 128-bit (!) variants Rocket SoC: cores, L1, L2 cache, interconnect Silicon proven (45nm and 28nm) Chisel (open-source HW construction language)

31 lowrisc SoC

32 Current status

33 General purpose tagged memory Prevent control-flow hijacking attacks Accelerate debug tools use-after-free detection Per-word locks, full/empty bits for synchronization Control-flow integrity Assist Garbage collection Dynamic information flow tracking (DIFT) Capabilities Transactional memory Provenance tracking

34 General-purpose tagged memory LLVM pass has been implemented to tag sensitive pointers i.e. code pointers, virtual function table pointer, function pointers,. Every load of a sensitive pointer is replaced with a load that expects a particular tag to be read, if this is not the case an exception is raised Prevents classic buffer overflow attacks and return-orientatedprogramming Some other related attacks may remain if code has the right/wrong! bugs Overheads and future work

35 Minion cores Will initially support DMA and programmable I/O Use minions to generate I/O signals, pre-processor I/O data etc. Would like to also use minions to support tagged memory Particular tags trigger message to minion from application processor Minion executes security policy in parallel with app. Processor Plan to investigate implementing more of the SoC using minion cores + appropriate shims E.g. memory controller Will use the Pulpino core from Luca Benini s group at ETHZ

36 Open source HW Smaller community, higher barrier to entry Fabricating chips is expensive Verification effort is significant Patching can't be done in the same way typically Of course, all good reasons to produce an open known good SoC design and to promote a community effort

37 Roadmap Create untethered version of SoC with tagged memory Complete core SoC implementation (no GPU initially) First test chips (40 or 28nm) 2 to 4 cores, most probably dual-issue Integrate 3 rd party IP, e.g. mem controller, USB, Ethernet Support early adopters in creating derivative designs Third Party Design Starts 2017 Volume fab. run for community dev. board Strengthen lowrisc IP offerings

38 Research in the open Have lots of ideas, collaborate and share from day one Open development helps to attract best people, even if they contribute remotely (huge amount of good will and enthusiasm for these projects if people know what you are trying to do!) Make it easy for people to get involved, reproduce, extend and improve (this requires significant effort) Work with industry Provide vehicle to evaluate/implement other research ideas

39 Find out more and get involved ORCONF 2015 October 9-11 th, 2015 Ideasquare, Geneva ORCONF began as an annual event for openrisc developers. Now run as a Free and Open Source Silicon (FOSSi) event. lowrisc workshop on Friday Talks on RISC-V

40 Final thoughts Exploring two different approaches to achieving energy efficiency through specialisation: Loki: flexible processor array lowrisc: an open source SoC Opportunities to collaborate with others on both projects More information about lowrisc at See also phab.lowrisc.org Sign up to announcement and discussion lists

41 Acknowledgements Both lowrisc and Loki are team efforts Loki team currently includes Daniel Bates, Alex Bradbury and Alex Chadwick (Recent work on DNNs by Chihang Wang and Sam Tarver. Earlier work on configurable L1 memory system by Andreas Koltes) lowrisc team currently includes Wei Song, Alex Bradbury and numerous external contributors. Contributions on tagged memory and minion core I/O shims by Hongyan Xia and Martin Papadopoulos. Recent work on tagged memory architecture and LLVM support by Lucas Sonnabend and Matthew Toseland Loki is funded by an ERC starter grant (GA n ) This work was previously supported by UK EPSRC grant EP/G033110/1 lowrisc is kindly supported by a private donation and a donation from Google. Thank you for listening!

The lowrisc project Alex Bradbury

The lowrisc project Alex Bradbury The lowrisc project Alex Bradbury lowrisc C.I.C. 3 rd April 2017 lowrisc We are producing an open source Linux capable System-on-a- Chip (SoC) 64-bit multicore Aim to be the Linux of the Hardware world

More information

The State of Open Source Processors and Open Source Silicon. Stefan Tensilica Day LibreCores. Free and Open Digital Hardware

The State of Open Source Processors and Open Source Silicon. Stefan Tensilica Day LibreCores. Free and Open Digital Hardware The State of Open Source Processors and Open Source Silicon Stefan Wallentowitz @ Tensilica Day 2017 LibreCores Free and Open Digital Hardware Karen Arnold, CC0 2 Open Source Processors vs. ISAs Important

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Technical Report. Exploiting tightly-coupled cores. Daniel Bates. Number 846. January Computer Laboratory UCAM-CL-TR-846 ISSN

Technical Report. Exploiting tightly-coupled cores. Daniel Bates. Number 846. January Computer Laboratory UCAM-CL-TR-846 ISSN Technical Report UCAM-CL-TR-846 ISSN 1476-2986 Number 846 Computer Laboratory Exploiting tightly-coupled cores Daniel Bates January 2014 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Creating a Scalable Microprocessor:

Creating a Scalable Microprocessor: Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B.

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder ESE532: System-on-a-Chip Architecture Day 5: September 18, 2017 Dataflow Process Model Today Dataflow Process Model Motivation Issues Abstraction Basic Approach Dataflow variants Motivations/demands for

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

The Design and Implementation of a Low-Latency On-Chip Network

The Design and Implementation of a Low-Latency On-Chip Network The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current

More information

PACE: Power-Aware Computing Engines

PACE: Power-Aware Computing Engines PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious

More information

Configurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc.

Configurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc. Configurable s for SOC Design Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc. Why Listen to This Presentation? Understand how SOC design techniques, now nearly 20 years old, are

More information

Evaluating RISC-V Cores for PULP

Evaluating RISC-V Cores for PULP Evaluating RISC-V Cores for PULP An Open Parallel Ultra-Low-Power Platform www.pulp.ethz.ch 30 June 2015 Sven Stucki Antonio Pullini Michael Gautschi Frank K. Gürkaynak Andrea Marongiu Igor Loi Davide

More information

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013 A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company

More information

CCIX: a new coherent multichip interconnect for accelerated use cases

CCIX: a new coherent multichip interconnect for accelerated use cases : a new coherent multichip interconnect for accelerated use cases Akira Shimizu Senior Manager, Operator relations Arm 2017 Arm Limited Arm 2017 Interconnects for different scale SoC interconnect. Connectivity

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

GreenDroid: An Architecture for the Dark Silicon Age

GreenDroid: An Architecture for the Dark Silicon Age GreenDroid: An Architecture for the Dark Silicon Age Nathan Goulding-Hotta, Jack Sampson, Qiaoshi Zheng, Vikram Bhatt, Joe Auricchio, Steven Swanson, Michael Bedford Taylor University of California, San

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic

The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic The Vector-Thread Architecture Ronny Krashinsky, Chriopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Kre Asanovic MIT Computer Science and Artificial Intelligence Laboratory, Cambridge,

More information

DSP ISA Extensions for an Open-Source RISC-V Implementation

DSP ISA Extensions for an Open-Source RISC-V Implementation DSP ISA Extensions for an Open-Source RISC-V Implementation Davide Schiavone Davide Rossi Michael Gautschi Eric Flamand Andreas Traber Luca Benini Integrated Systems Laboratory Introduction: a typical

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Mapping applications into MPSoC

Mapping applications into MPSoC Mapping applications into MPSoC concurrency & communication Jos van Eijndhoven jos@vectorfabrics.com March 12, 2011 MPSoC mapping: exploiting concurrency 2 March 12, 2012 Computation on general purpose

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

Portland State University ECE 588/688. Dataflow Architectures

Portland State University ECE 588/688. Dataflow Architectures Portland State University ECE 588/688 Dataflow Architectures Copyright by Alaa Alameldeen and Haitham Akkary 2018 Hazards in von Neumann Architectures Pipeline hazards limit performance Structural hazards

More information

Compilers and Code Optimization EDOARDO FUSELLA

Compilers and Code Optimization EDOARDO FUSELLA Compilers and Code Optimization EDOARDO FUSELLA Contents LLVM The nu+ architecture and toolchain LLVM 3 What is LLVM? LLVM is a compiler infrastructure designed as a set of reusable libraries with well

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Copyright 2016 Xilinx

Copyright 2016 Xilinx Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim PowerPC-base Core @3.2GHz 1 VMX vector unit per core 512KB L2 cache 7 x SPE @3.2GHz 7 x 128b 128 SIMD GPRs 7 x 256KB SRAM for SPE 1 of 8 SPEs reserved for redundancy total

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

ronny@mit.edu www.cag.lcs.mit.edu/scale Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators

More information

Design methodology for multi processor systems design on regular platforms

Design methodology for multi processor systems design on regular platforms Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline

More information

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

Research Collection. KISS PULPino - Updates on PULPino updates on PULPino. Other Conference Item. ETH Library

Research Collection. KISS PULPino - Updates on PULPino updates on PULPino. Other Conference Item. ETH Library Research Collection Other Conference Item KISS PULPino - Updates on PULPino updates on PULPino Author(s): Pullini, Antonio; Gautschi, Michael; Gürkaynak, Frank Kagan; Glaser, Florian; Mach, Stefan; Rovere,

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

ECE 471 Embedded Systems Lecture 2

ECE 471 Embedded Systems Lecture 2 ECE 471 Embedded Systems Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 7 September 2018 Announcements Reminder: The class notes are posted to the website. HW#1 will

More information

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content

More information

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture

More information

ECE 471 Embedded Systems Lecture 3

ECE 471 Embedded Systems Lecture 3 ECE 471 Embedded Systems Lecture 3 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 10 September 2018 Announcements New classroom: Stevens 365 HW#1 was posted, due Friday Reminder:

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Hardware and Software Optimisation. Tom Spink

Hardware and Software Optimisation. Tom Spink Hardware and Software Optimisation Tom Spink Optimisation Modifying some aspect of a system to make it run more efficiently, or utilise less resources. Optimising hardware: Making it use less energy, or

More information

RISC-V Core IP Products

RISC-V Core IP Products RISC-V Core IP Products An Introduction to SiFive RISC-V Core IP Drew Barbier September 2017 drew@sifive.com SiFive RISC-V Core IP Products This presentation is targeted at embedded designers who want

More information

OpenPrefetch. (in-progress)

OpenPrefetch. (in-progress) OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),

More information

DESIGN OF A SOFT-CORE PROCESSOR BASED ON OPENCORES WITH ENHANCED PROCESSING FOR EMBEDDED APPLICATIONS

DESIGN OF A SOFT-CORE PROCESSOR BASED ON OPENCORES WITH ENHANCED PROCESSING FOR EMBEDDED APPLICATIONS DESIGN OF A SOFT-CORE PROCESSOR BASED ON OPENCORES WITH ENHANCED PROCESSING FOR EMBEDDED APPLICATIONS TEJASWI AGARWAL & MAHESWARI R. School of Computing Sciences and Engineering, VIT University, Vandalur-Kelambakkam

More information

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) CSE 4201 Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) Memory Hierarchy We need huge amount of cheap and fast memory Memory is either fast or cheap; never both. Do as politicians do: fake it Give

More information

SoC Platforms and CPU Cores

SoC Platforms and CPU Cores SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

History. PowerPC based micro-architectures. PowerPC ISA. Introduction PowerPC based micro-architectures Godfrey van der Linden Presentation for COMP9244 Software view of Processor Architectures 2006-05-25 History 1985 IBM started on AMERICA 1986 Development of RS/6000 1990

More information

Section 6 Blackfin ADSP-BF533 Memory

Section 6 Blackfin ADSP-BF533 Memory Section 6 Blackfin ADSP-BF533 Memory 6-1 a ADSP-BF533 Block Diagram Core Timer 64 L1 Instruction Memory Performance Monitor JTAG/ Debug Core Processor LD0 32 LD1 32 L1 Data Memory SD32 DMA Mastered 32

More information

Combining Arm & RISC-V in Heterogeneous Designs

Combining Arm & RISC-V in Heterogeneous Designs Combining Arm & RISC-V in Heterogeneous Designs Gajinder Panesar, CTO, UltraSoC gajinder.panesar@ultrasoc.com RISC-V Summit 3 5 December 2018 Santa Clara, USA Problem statement Deterministic multi-core

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

Introduction to gem5. Nizamudheen Ahmed Texas Instruments

Introduction to gem5. Nizamudheen Ahmed Texas Instruments Introduction to gem5 Nizamudheen Ahmed Texas Instruments 1 Introduction A full-system computer architecture simulator Open source tool focused on architectural modeling BSD license Encompasses system-level

More information

A Memory System Design Framework: Creating Smart Memories

A Memory System Design Framework: Creating Smart Memories A Memory System Design Framework: Creating Smart Memories Amin Firoozshahian, Alex Solomatnikov Hicamp Systems Inc. Ofer Shacham, Zain Asgar, http://www.c2s2.org Stephen Richardson, Christos Kozyrakis,

More information

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration Implementing Flexible Interconnect for Machine Learning Acceleration A R M T E C H S Y M P O S I A O C T 2 0 1 8 WILLIAM TSENG Mem Controller 20 mm Mem Controller Machine Learning / AI SoC New Challenges

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

CPU Architecture. HPCE / dt10 / 2013 / 10.1

CPU Architecture. HPCE / dt10 / 2013 / 10.1 Architecture HPCE / dt10 / 2013 / 10.1 What is computation? Input i o State s F(s,i) (s,o) s Output HPCE / dt10 / 2013 / 10.2 Input and Output = Communication There are many different types of IO (Input/Output)

More information

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS Embedded System System Set of components needed to perform a function Hardware + software +. Embedded Main function not computing Usually not autonomous

More information

Introduction to Embedded System Processor Architectures

Introduction to Embedded System Processor Architectures Introduction to Embedded System Processor Architectures Contents crafted by Professor Jari Nurmi Tampere University of Technology Department of Computer Systems Motivation Why Processor Design? Embedded

More information

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive) NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive) NVDLA NVIDIA DEEP LEARNING ACCELERATOR IP Core for deep learning part of NVIDIA s Xavier

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state What is computation? Input and Output = Communication Input State Output i s F(s,i) (s,o) o s There are many different types of IO (Input/Output) What constitutes IO is context dependent Obvious forms

More information

Agile Hardware Design: Building Chips with Small Teams

Agile Hardware Design: Building Chips with Small Teams 2017 SiFive. All Rights Reserved. Agile Hardware Design: Building Chips with Small Teams Yunsup Lee ASPIRE Graduate 2016 Co-Founder and CTO 2 2017 SiFive. All Rights Reserved. World s First Single-Chip

More information

TRIPS: Extending the Range of Programmable Processors

TRIPS: Extending the Range of Programmable Processors TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Industry Collaboration and Innovation

Industry Collaboration and Innovation Industry Collaboration and Innovation Industry Landscape Key changes occurring in our industry Historical microprocessor technology continues to deliver far less than the historical rate of cost/performance

More information

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang

More information

Introduction to RISC-V

Introduction to RISC-V Introduction to RISC-V Jielun Tan, James Connolly February, 2019 Overview What is RISC-V Why RISC-V ISA overview Software environment Beta testing What is RISC-V RISC-V (pronounced risk-five ) is an open,

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators"

A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W ISC-V Processor with Vector Accelerators" Yunsup Lee 1, Andrew Waterman 1, imas Avizienis 1,! Henry Cook 1, Chen Sun 1,2,! Vladimir Stojanovic 1,2, Krste Asanovic

More information

Experiences Using the RISC- V Ecosystem to Design an Accelerator- Centric SoC in TSMC 16nm

Experiences Using the RISC- V Ecosystem to Design an Accelerator- Centric SoC in TSMC 16nm Experiences Using the RISC- V Ecosystem to Design an Accelerator- Centric SoC in TSMC 16nm Tutu Ajayi 2, Khalid Al- Hawaj 1, Aporva Amarnath 2, Steve Dai 1, Scott Davidson 4, Paul Gao 4, Gai Liu 1, Anuj

More information

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Mian-Muhammad Hamayun, Frédéric Pétrot and Nicolas Fournel System Level Synthesis

More information

Fast architecture prototyping on FPGAs: frameworks, tools, and challenges

Fast architecture prototyping on FPGAs: frameworks, tools, and challenges Fast architecture prototyping on FPGAs: frameworks, tools, and challenges Philipp Wagner Technische Universität München Lehrstuhl für Integrierte Systeme 10.04.2017 Our Goal: Improving MPSoC Architectures

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 19 Processor Design Overview Special features in microprocessors provide support for parallel processing Already discussed bus snooping Memory latency becoming

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

Ming Ming Wong Jawad Haj-Yahya Anupam Chattopadhyay

Ming Ming Wong Jawad Haj-Yahya Anupam Chattopadhyay Hardware and Architectural Support for Security and Privacy (HASP 18), June 2, 2018, Los Angeles, CA, USA Ming Ming Wong Jawad Haj-Yahya Anupam Chattopadhyay Computing and Engineering (SCSE) Nanyang Technological

More information

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA

More information

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND. December 3-6, 2018 Santa Clara Convention Center CA, USA REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND. https://tmt.knect365.com/risc-v-summit 2018 NETRONOME SYSTEMS, INC. 1 @risc_v MASSIVELY PARALLEL

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to

More information

Specializing Hardware for Image Processing

Specializing Hardware for Image Processing Lecture 6: Specializing Hardware for Image Processing Visual Computing Systems So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

OPERA. Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications

OPERA. Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications OPERA Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications Co-funded by the Horizon 2020 Framework Programme of the

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

RISC-V Rocket Chip SoC Generator in Chisel. Yunsup Lee UC Berkeley

RISC-V Rocket Chip SoC Generator in Chisel. Yunsup Lee UC Berkeley RISC-V Rocket Chip SoC Generator in Chisel Yunsup Lee UC Berkeley yunsup@eecs.berkeley.edu What is the Rocket Chip SoC Generator?! Parameterized SoC generator written in Chisel! Generates Tiles - (Rocket)

More information