Introduction to the Tegra SoC Family and the ARM Architecture. Kristoffer Robin Stokke, PhD FLIR UAS

Size: px
Start display at page:

Download "Introduction to the Tegra SoC Family and the ARM Architecture. Kristoffer Robin Stokke, PhD FLIR UAS"

Transcription

1 Introduction to the Tegra SoC Family and the ARM Architecture Kristoffer Robin Stokke, PhD FLIR UAS

2 Goals of Lecture To give you something concrete to start on Simple introduction to ARMv8 NEON programming environment Register environment, instruction syntax «Families» of instructions Important for debugging, writing code and general understanding Programming examples Intrinsics Inline assembly Performance analysis using gprof Introduction to GDB debugging

3 Keep This Under Your Pillow GNU compiler intrinsics list: o ARM Infocenter o infocenter.arm.com -> developer guides (..) -> software development -> Cortex A series Programmer s Guide for arm8 This may also be useful Last but not least GDB You will need it

4 5 Tegra Family of SoCs Tegra family of mobile Systems-on-Chip (SoC) Family of mobile SoCs Quite sophisticated Many knobs for tuning, experimentation and research! Wide operational range: runs at 2 W -> 20 W Frequency (clock) scaling Heterogeneous processors ( CPU / GPU ) CPU Tegra K1 Tegra X1 Tegra X2 Tegra Xavier High Performance 4 x ARM Cortex-A15 4 x ARM Cortex-A57 Dual-core Denver 8 core Carmel CPU Low Power 1 x ARM Cortex-A15 4 x ARM Cortex-A53 4 x ARM Cortex-A57 GPU 192-Core Kepler 256-Core Maxwell 256-Core Pascal 512-Core Volta Memory 2 GB (Jetson-TK1) 4 GB (Jetson-TX1) 8 GB LPDDR4? GB LPDDR4 9/10/2018 armv8

5 Tegra-Powered Servers System-on-Module (SoMs) mounted in a 1U server. Very dynamic power usage depending on load: Jetson TX2 / TX1 blade server with 24 SoMs. Based on my own guess.. I would say anything between 80 W to 500 W, depending on demand for processing power

6 Frequency Scaling Moden chips can changing clock frequency Power <-> Performance Tradeoff Processor / memroy utilisation typically varies throughout time Power and powerformance over processor and memory frequencies. The ondemand governor [1] Idea is to reduce CPU frequency while idle or under busy I/O computation Other governors exist for RAM, GPU, buses etc CPU utilisation over time for the Sort application. [2] [1] Pallipadi, Venkatesh, and Alexey Starikovskiy. "The ondemand governor." Proceedings of the Linux Symposium. Vol. 2. No sn, [2] Ibrahim, Shadi, et al. "Governing energy consumption in Hadoop through CPU frequency scaling: An analysis." Future Generation Computer Systems 54 (2016):

7 CPU interrupt controller Coresight (debug hardware support) IRAM and separate ARM7TDMI core Global interrupt control, Clock controllers, timers etc CPU Complex Bus / Interconnect Mainly AXI (highperformance bus), bridges 256-bit CPU interface to RAM I2C / SPI I/O Hardware filters VIC Post-process video Rotation, scaling, comp. VI (Video Input) Raw (YUV, BAYER) processing 2x ISPs (hidden) NVENC/NVDEC Video encode Video decode NVJPG RAM Tegra X1 Block Diagram SATA, USB, MIPI, HDMI etc. I/O

8 Tegra X1 CPU and Memory Hierarchy Each core with various compute capability Integer, floating point NEON (SIMD) Four Cortex-A57 cores, each.. 48 kb instruction L1 cache 32 kb data L1 cache One shared, 2 MB L2 cache Instruction + data Least-Recently Used (LRU) eviction policy 64 byte cache lines 32-bit interface to RAM.. Tegra K1 CPU and memory overview, which is similar to that of the TX

9 Cache Hierarchies and Performance Let s do an experiment! Reading or writing 800 MB Vary the size to read back-to-back E.g. read 24 kb repeatedly from same buffer, until 800 MB have been read CPU Core 33k 24 kb L1 Cache 32 kb Read 800 MB Buffer size detemines location of data Below 32 kb, all reads are cached in L1 Below 2 MB, all reads are cached in L2 For 10 MB... Nothing gets cached 12k 64 kb L2 Cache 2 MB 1k 800 MB RAM 4 GB

10 Time [ms] Code Example and Profiling Compile with pg Run application:./main Run gprof./main gmon.out NB: Prefetch op Prfm <type><target><policy> reg label Type pld (for load) pst (for store) pli (for instruction) Target L1 or L2 (or L3) Policy keep (normal) stream (use once) prfm pldl1keep [x0] (address in x0)

11 ARMv8 Registers 31 x 64-bit general purpose registers X0 X8 x16 x24 32 x 128-bit vector registers V0 V8 V16 V24 SP WSP Stack pointer WZR XZR Zero registers PC

12 The Vector Registers V0-V31: Packing Data in V0-V31 are packed, and you control how they are packed Example: 16 bytes or 8 bytes Example: 8 half-words or 4 half-words Lanes

13 Example: Vector Packing

14 Instruction Syntax

15 Programming With Intrinsics More in a bit!

16 Programming Example: Intrinsics

17 Inline Assembly Mostly harder than using intrinsics However, gives more control (and better performance?) Not always straightforward to figure out what mnemonics to use Tips: disassemble intrinsics and look with objdump or gdb Operand constraints > «m» : memory address > «r» : general purpose register > «f» : floating point register > «i» : immediate ++ more Specify dirty registers and more

18 Programming Example: Inline Assembly

19 Table Lookup Not straightforward to use for any purpose Vector table lookup: vtbl v0, {v1, v2,..., vn}, vm V0: destination vector {v1, v2,..., vn}: source data vm: data selector v v1 v vm

20 Matrix Transpose tbl v0.4s, {v1.4s}, v2.4s a b a c a c b d v0.4s c d b d v2.4s stride Think like this: For each output row, select increasing column a stride b c d v1.4s

21 Code Profiling Compile with pg Run application:./main Run gprof./main gmon.out Time to Finish 100M computations for Matrix Multiply (MM) and Transpose Operations? Transpose, lazy Transpose, NEON assembly MM, NEON intrinsics Series 1 Column1 Column2 MM, NEONassembly

22 GDB Example

23 Tips Build functions to print out macroblocks from vector registers and memory Start small test out independent parts of the code that are easy to verify When in trouble, step through the code, display the relevant registers, verify with output you know is working Many things to investigate Single versus double precision? Different, possibly more ways to implement e.g. transpose? Re-using vector registers across different functional blocks?..but stick to what the assignment says

24 Good Luck! You re going to need it

25 ARMv7 vs. ARMv8 Armv8 uses the same mnemonics as for general purpose registers E.g., in ARMv7, «mul, r0, r0, r1» (normal) and «vmul d0, d0, d1» (SIMD) In ARMv8: «mul x0, x0, x1» (normal) and «mul v0, v0, v1» (SIMD) Simplifies life, but take care to use correct operands ARMv8 has twice as many 128-bit registers bit registers, vs bit registers for ARMv7 Different instruction syntax

26 27 Tegra K1: a Mobile Heterogeneous Multicore SoC Quad-core highperformance CPU ARM Cortex A15 Frequency scaling Dedicated lowpower core ARM Cortex A15 Frequency scaling 192-core, Keplerbased GPU CUDA programmable Frequency scaling SIMD (NEON) multimedia instructions 2 GB DDR3 RAM CPU-GPU shared Frequency scaling

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018 Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks

More information

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017 Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM - Systems GTC Israel 2017 Agenda Current GPGPU systems NVIDIA Jetson TX1 and TX2 evaluation Conclusions New Products 2 GPGPU Product

More information

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1

More information

THE LEADER IN VISUAL COMPUTING

THE LEADER IN VISUAL COMPUTING MOBILE EMBEDDED THE LEADER IN VISUAL COMPUTING 2 TAKING OUR VISION TO REALITY HPC DESIGN and VISUALIZATION AUTO GAMING 3 BEST DEVELOPER EXPERIENCE Tools for Fast Development Debug and Performance Tuning

More information

Each Milliwatt Matters

Each Milliwatt Matters Each Milliwatt Matters Ultra High Efficiency Application Processors Govind Wathan Product Manager, CPG ARM Tech Symposia China 2015 November 2015 Ultra High Efficiency Processors Used in Diverse Markets

More information

ECE 598 Advanced Operating Systems Lecture 4

ECE 598 Advanced Operating Systems Lecture 4 ECE 598 Advanced Operating Systems Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Announcements HW#1 was due HW#2 was posted, will be tricky Let me know

More information

EEM870 Embedded System and Experiment Lecture 3: ARM Processor Architecture

EEM870 Embedded System and Experiment Lecture 3: ARM Processor Architecture EEM870 Embedded System and Experiment Lecture 3: ARM Processor Architecture Wen-Yen Lin, Ph.D. Department of Electrical Engineering Chang Gung University Email: wylin@mail.cgu.edu.tw March 2014 Agenda

More information

ECE 471 Embedded Systems Lecture 2

ECE 471 Embedded Systems Lecture 2 ECE 471 Embedded Systems Lecture 2 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 3 September 2015 Announcements HW#1 will be posted today, due next Thursday. I will send out

More information

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

SoC Platforms and CPU Cores

SoC Platforms and CPU Cores SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

Chapter 5. Introduction ARM Cortex series

Chapter 5. Introduction ARM Cortex series Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1

More information

Copyright 2016 Xilinx

Copyright 2016 Xilinx Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building

More information

F28HS Hardware-Software Interface: Systems Programming

F28HS Hardware-Software Interface: Systems Programming F28HS Hardware-Software Interface: Systems Programming Hans-Wolfgang Loidl School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh Semester 2 2017/18 0 No proprietary software has

More information

Designing with NXP i.mx8m SoC

Designing with NXP i.mx8m SoC Designing with NXP i.mx8m SoC Course Description Designing with NXP i.mx8m SoC is a 3 days deep dive training to the latest NXP application processor family. The first part of the course starts by overviewing

More information

Universität Dortmund. ARM Architecture

Universität Dortmund. ARM Architecture ARM Architecture The RISC Philosophy Original RISC design (e.g. MIPS) aims for high performance through o reduced number of instruction classes o large general-purpose register set o load-store architecture

More information

ECE 471 Embedded Systems Lecture 2

ECE 471 Embedded Systems Lecture 2 ECE 471 Embedded Systems Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 7 September 2018 Announcements Reminder: The class notes are posted to the website. HW#1 will

More information

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture.

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture. ARM CORTEX-R52 Course Family: ARMv8-R Cortex-R CPU Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture. Duration: 4 days Prerequisites and related

More information

ARMv8-A Software Development

ARMv8-A Software Development ARMv8-A Software Development Course Description ARMv8-A software development is a 4 days ARM official course. The course goes into great depth and provides all necessary know-how to develop software for

More information

KeyStone II. CorePac Overview

KeyStone II. CorePac Overview KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB

More information

ARM Cortex-A9 ARM v7-a. A programmer s perspective Part1

ARM Cortex-A9 ARM v7-a. A programmer s perspective Part1 ARM Cortex-A9 ARM v7-a A programmer s perspective Part1 ARM: Advanced RISC Machine First appeared in 1985 as Acorn RISC Machine from Acorn Computers in Manchester England Limited success outcompeted by

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 13

ECE 571 Advanced Microprocessor-Based Design Lecture 13 ECE 571 Advanced Microprocessor-Based Design Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements More on HW#6 When ask for reasons why cache

More information

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

Design Choices for FPGA-based SoCs When Adding a SATA Storage } U4 U7 U7 Q D U5 Q D Design Choices for FPGA-based SoCs When Adding a SATA Storage } Lorenz Kolb & Endric Schubert, Missing Link Electronics Rudolf Usselmann, ASICS World Services Motivation for SATA Storage

More information

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O The A176 Cyclone is the smallest and most powerful Rugged-GPGPU, ideally suited for distributed systems. Its 256 CUDA cores reach 1 TFLOPS, and it consumes less than 17W at full load (8-10W at typical

More information

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan Processors Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan chanhl@maili.cgu.edu.twcgu General-purpose p processor Control unit Controllerr Control/ status Datapath ALU

More information

Embedded Linux Conference San Diego 2016

Embedded Linux Conference San Diego 2016 Embedded Linux Conference San Diego 2016 Linux Power Management Optimization on the Nvidia Jetson Platform Merlin Friesen merlin@gg-research.com About You Target Audience - The presentation is introductory

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Cortex-R5 Software Development

Cortex-R5 Software Development Cortex-R5 Software Development Course Description Cortex-R5 software development is a three days ARM official course. The course goes into great depth, and provides all necessary know-how to develop software

More information

Multiple Choice Type Questions

Multiple Choice Type Questions Techno India Batanagar Computer Science and Engineering Model Questions Subject Name: Computer Architecture Subject Code: CS 403 Multiple Choice Type Questions 1. SIMD represents an organization that.

More information

HOT CHIPS 2014 NVIDIA S DENVER PROCESSOR. Darrell Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman

HOT CHIPS 2014 NVIDIA S DENVER PROCESSOR. Darrell Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA S DENVER PROCESSOR Darrell Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman TEGRA K1 with Dual Denver CPUs The First 64-bit Android Kepler-Class

More information

AT-501 Cortex-A5 System On Module Product Brief

AT-501 Cortex-A5 System On Module Product Brief AT-501 Cortex-A5 System On Module Product Brief 1. Scope The following document provides a brief description of the AT-501 System on Module (SOM) its features and ordering options. For more details please

More information

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) I/O Devices Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Hardware Support for I/O CPU RAM Network Card Graphics Card Memory Bus General I/O Bus (e.g., PCI) Canonical Device OS reads/writes

More information

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models Jason Andrews Agenda System Performance Analysis IP Configuration System Creation Methodology: Create,

More information

Basic Concepts COE 205. Computer Organization and Assembly Language Dr. Aiman El-Maleh

Basic Concepts COE 205. Computer Organization and Assembly Language Dr. Aiman El-Maleh Basic Concepts COE 205 Computer Organization and Assembly Language Dr. Aiman El-Maleh College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals [Adapted from slides of

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016 Caches and Memory Hierarchy: Review UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only 10-20% of the processor peak Most of the single processor performance loss

More information

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp.

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp. INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp. Computer Vision in Mobile Tegra K1 It s time! AGENDA Use cases categories

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

Introduction to the ARM Architecture. or: a loose set of random facts blatantly copied from tech sheets and the Architecture Ref.

Introduction to the ARM Architecture. or: a loose set of random facts blatantly copied from tech sheets and the Architecture Ref. Introduction to the ARM Architecture or: a loose set of random facts blatantly copied from tech sheets and the Architecture Ref. Manual Glance into the past Initial ARM Processor developed by Acorn Computers,

More information

CS 101, Mock Computer Architecture

CS 101, Mock Computer Architecture CS 101, Mock Computer Architecture Computer organization and architecture refers to the actual hardware used to construct the computer, and the way that the hardware operates both physically and logically

More information

Porting BLIS to new architectures Early experiences

Porting BLIS to new architectures Early experiences 1st BLIS Retreat. Austin (Texas) Early experiences Universidad Complutense de Madrid (Spain) September 5, 2013 BLIS design principles BLIS = Programmability + Performance + Portability Share experiences

More information

Real-time image processing and object recognition for robotics applications. Adrian Stratulat

Real-time image processing and object recognition for robotics applications. Adrian Stratulat Real-time image processing and object recognition for robotics applications Adrian Stratulat What is computer vision? Computer vision is a field that includes methods for acquiring, processing, analyzing,

More information

Virtual Memory - Objectives

Virtual Memory - Objectives ECE232: Hardware Organization and Design Part 16: Virtual Memory Chapter 7 http://www.ecs.umass.edu/ece/ece232/ Adapted from Computer Organization and Design, Patterson & Hennessy Virtual Memory - Objectives

More information

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT Manycore and GPU Channelisers Seth Hall High Performance Computing Lab, AUT GPU Accelerated Computing GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate

More information

ARM Architecture and Assembly Programming Intro

ARM Architecture and Assembly Programming Intro ARM Architecture and Assembly Programming Intro Instructors: Dr. Phillip Jones http://class.ece.iastate.edu/cpre288 1 Announcements HW9: Due Sunday 11/5 (midnight) Lab 9: object detection lab Give TAs

More information

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2 CSE 820 Graduate Computer Architecture Richard Enbody Dr. Enbody 1 st Day 2 1 Why Computer Architecture? Improve coding. Knowledge to make architectural choices. Ability to understand articles about architecture.

More information

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems Designing, developing, debugging ARM and heterogeneous multi-processor systems Kinjal Dave Senior Product Manager, ARM ARM Tech Symposia India December 7 th 2016 Topics Introduction System design Software

More information

A176 C clone. GPGPU Fanless Small FF RediBuilt Supercomputer. Aitech

A176 C clone. GPGPU Fanless Small FF RediBuilt Supercomputer.  Aitech The A176 Cyclone is the smallest and most powerful Rugged-GPGPU, ideally suited for distributed systems. Its 256 CUDA cores reach 1 TFLOPS at a remarkable level of energy efficiency, providing all the

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017 Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the

More information

Hugo Cunha. Senior Firmware Developer Globaltronics

Hugo Cunha. Senior Firmware Developer Globaltronics Hugo Cunha Senior Firmware Developer Globaltronics NB-IoT Product Acceleration Platforms 2018 Speaker Hugo Cunha Project Developper Agenda About us NB IoT Platforms The WIIPIIDO The Gateway FE 1 About

More information

Cortex-A9 MPCore Software Development

Cortex-A9 MPCore Software Development Cortex-A9 MPCore Software Development Course Description Cortex-A9 MPCore software development is a 4 days ARM official course. The course goes into great depth and provides all necessary know-how to develop

More information

EE 354 Fall 2015 Lecture 1 Architecture and Introduction

EE 354 Fall 2015 Lecture 1 Architecture and Introduction EE 354 Fall 2015 Lecture 1 Architecture and Introduction Note: Much of these notes are taken from the book: The definitive Guide to ARM Cortex M3 and Cortex M4 Processors by Joseph Yiu, third edition,

More information

Scaling the Peak: Maximizing floating point performance on the Epiphany NoC

Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese, Gaurav Mitra, Robert Edwards and Alistair Rendell Research School of Computer Science The Australian National

More information

The ARM10 Family of Advanced Microprocessor Cores

The ARM10 Family of Advanced Microprocessor Cores The ARM10 Family of Advanced Microprocessor Cores Stephen Hill ARM Austin Design Center 1 Agenda Design overview Microarchitecture ARM10 o o Memory System Interrupt response 3. Power o o 4. VFP10 ETM10

More information

MYC-C7Z010/20 CPU Module

MYC-C7Z010/20 CPU Module MYC-C7Z010/20 CPU Module - 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor with Xilinx 7-series FPGA logic - 1GB DDR3 SDRAM (2 x 512MB, 32-bit), 4GB emmc, 32MB QSPI Flash - On-board Gigabit

More information

F28HS2 Hardware-Software Interfaces. Lecture 6: ARM Assembly Language 1

F28HS2 Hardware-Software Interfaces. Lecture 6: ARM Assembly Language 1 F28HS2 Hardware-Software Interfaces Lecture 6: ARM Assembly Language 1 CISC & RISC CISC: complex instruction set computer original CPUs very simple poorly suited to evolving high level languages extended

More information

Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms

Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Arizona State University Dhinakaran Pandiyan(dpandiya@asu.edu) and Carole-Jean Wu(carole-jean.wu@asu.edu

More information

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous

More information

Freescale i.mx6 Architecture

Freescale i.mx6 Architecture Freescale i.mx6 Architecture Course Description Freescale i.mx6 architecture is a 3 days Freescale official course. The course goes into great depth and provides all necessary know-how to develop software

More information

Kevin Meehan Stephen Moskal Computer Architecture Winter 2012 Dr. Shaaban

Kevin Meehan Stephen Moskal Computer Architecture Winter 2012 Dr. Shaaban Kevin Meehan Stephen Moskal Computer Architecture Winter 2012 Dr. Shaaban Contents Raspberry Pi Foundation Raspberry Pi overview & specs ARM11 overview ARM11 cache, pipeline, branch prediction ARM11 vs.

More information

The Memory Hierarchy 10/25/16

The Memory Hierarchy 10/25/16 The Memory Hierarchy 10/25/16 Transition First half of course: hardware focus How the hardware is constructed How the hardware works How to interact with hardware Second half: performance and software

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

Slides for Lecture 6

Slides for Lecture 6 Slides for Lecture 6 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 28 January,

More information

Elaborazione dati real-time su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T

More information

LECTURE 12. Virtual Memory

LECTURE 12. Virtual Memory LECTURE 12 Virtual Memory VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a cache for magnetic disk. The mechanism by which this is accomplished

More information

ARMv8-A CPU Architecture Overview

ARMv8-A CPU Architecture Overview ARMv8-A CPU Architecture Overview Chris Shore Training Manager, ARM ARM Game Developer Day, London 03/12/2015 Chris Shore ARM Training Manager With ARM for 16 years Managing customer training for 15 years

More information

ARM Processor Architecture

ARM Processor Architecture Chapters 1 and 3 ARM Processor Architecture Embedded Systems with ARM Cortext-M Updated: Monday, February 5, 2018 A Little about ARM The company Originally Acorn RISC Machine (ARM) Later Advanced RISC

More information

S2C K7 Prodigy Logic Module Series

S2C K7 Prodigy Logic Module Series S2C K7 Prodigy Logic Module Series Low-Cost Fifth Generation Rapid FPGA-based Prototyping Hardware The S2C K7 Prodigy Logic Module is equipped with one Xilinx Kintex-7 XC7K410T or XC7K325T FPGA device

More information

ECE 471 Embedded Systems Lecture 3

ECE 471 Embedded Systems Lecture 3 ECE 471 Embedded Systems Lecture 3 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 10 September 2018 Announcements New classroom: Stevens 365 HW#1 was posted, due Friday Reminder:

More information

Tile Processor (TILEPro64)

Tile Processor (TILEPro64) Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth

More information

Introduction to Embedded System Design using Zynq

Introduction to Embedded System Design using Zynq Introduction to Embedded System Design using Zynq Zynq Vivado 2015.2 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

. SMARC 2.0 Compliant

. SMARC 2.0 Compliant MSC SM2S-IMX8 NXP i.mx8 ARM Cortex -A72/A53 Description The new MSC SM2S-IMX8 module offers a quantum leap in terms of computing and graphics performance. It integrates the currently most powerful i.mx8

More information

5. ARM 기반모니터프로그램사용. Embedded Processors. DE1-SoC 보드 (IntelFPGA) Application Processors. Development of the ARM Architecture.

5. ARM 기반모니터프로그램사용. Embedded Processors. DE1-SoC 보드 (IntelFPGA) Application Processors. Development of the ARM Architecture. Embedded Processors 5. ARM 기반모니터프로그램사용 DE1-SoC 보드 (IntelFPGA) 2 Application Processors Development of the ARM Architecture v4 v5 v6 v7 Halfword and signed halfword / byte support System mode Thumb instruction

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

Hercules ARM Cortex -R4 System Architecture. Processor Overview

Hercules ARM Cortex -R4 System Architecture. Processor Overview Hercules ARM Cortex -R4 System Architecture Processor Overview What is Hercules? TI s 32-bit ARM Cortex -R4/R5 MCU family for Industrial, Automotive, and Transportation Safety Hardware Safety Features

More information

Small is the New Big: Data Analytics on the Edge

Small is the New Big: Data Analytics on the Edge Small is the New Big: Data Analytics on the Edge An overview of processors and algorithms for deep learning techniques on the edge Dr. Abhay Samant VP Engineering, Hiller Measurements Adjunct Faculty,

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

Intel X86 Assembler Instruction Set Opcode Table

Intel X86 Assembler Instruction Set Opcode Table Intel X86 Assembler Instruction Set Opcode Table x86 Instruction Set Reference. Derived from the September 2014 version of the Intel 64 and IA-32 LGDT, Load Global/Interrupt Descriptor Table Register.

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

RM3 - Cortex-M4 / Cortex-M4F implementation

RM3 - Cortex-M4 / Cortex-M4F implementation Formation Cortex-M4 / Cortex-M4F implementation: This course covers both Cortex-M4 and Cortex-M4F (with FPU) ARM core - Processeurs ARM: ARM Cores RM3 - Cortex-M4 / Cortex-M4F implementation This course

More information

ARM ARCHITECTURE. Contents at a glance:

ARM ARCHITECTURE. Contents at a glance: UNIT-III ARM ARCHITECTURE Contents at a glance: RISC Design Philosophy ARM Design Philosophy Registers Current Program Status Register(CPSR) Instruction Pipeline Interrupts and Vector Table Architecture

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information

SOM PRODUCTS BRIEF. S y s t e m o n M o d u l e. Engicam. SOMProducts ver

SOM PRODUCTS BRIEF. S y s t e m o n M o d u l e. Engicam. SOMProducts ver SOM S y s t e m o n M o d u l e PRODUCTS BRIEF GEA M6425IB ARM9 TM Low cost solution Reduced Time to Market Very small form factor Low cost multimedia solutions Industrial Automotive Consumer Single power

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

Computers and Microprocessors. Lecture 34 PHYS3360/AEP3630

Computers and Microprocessors. Lecture 34 PHYS3360/AEP3630 Computers and Microprocessors Lecture 34 PHYS3360/AEP3630 1 Contents Computer architecture / experiment control Microprocessor organization Basic computer components Memory modes for x86 series of microprocessors

More information

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye Negotiating the Maze Getting the most out of memory systems today and tomorrow Robert Kaye 1 System on Chip Memory Systems Systems use external memory Large address space Low cost-per-bit Large interface

More information

TEGRA K1 AND THE AUTOMOTIVE INDUSTRY. Gernot Ziegler, Timo Stich

TEGRA K1 AND THE AUTOMOTIVE INDUSTRY. Gernot Ziegler, Timo Stich TEGRA K1 AND THE AUTOMOTIVE INDUSTRY Gernot Ziegler, Timo Stich Previously: Tegra in Automotive Infotainment / Navigation Digital Instrument Cluster Passenger Entertainment TEGRA K1 with Kepler GPU GPU:

More information

Zynq-7000 All Programmable SoC Product Overview

Zynq-7000 All Programmable SoC Product Overview Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform

More information

Multimedia SoC System Solutions

Multimedia SoC System Solutions Multimedia SoC System Solutions Presented By Yashu Gosain & Forrest Picket: System Software & SoC Solutions Marketing Girish Malipeddi: IP Subsystems Marketing Agenda Zynq Ultrascale+ MPSoC and Multimedia

More information

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Deep Learning: Transforming Engineering and Science The MathWorks, Inc. Deep Learning: Transforming Engineering and Science 1 2015 The MathWorks, Inc. DEEP LEARNING: TRANSFORMING ENGINEERING AND SCIENCE A THE NEW RISE ERA OF OF GPU COMPUTING 3 NVIDIA A IS NEW THE WORLD S ERA

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5 th Edition Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic

More information

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Computer Organization and Design, 5th Edition: The Hardware/Software Interface Computer Organization and Design, 5th Edition: The Hardware/Software Interface 1 Computer Abstractions and Technology 1.1 Introduction 1.2 Eight Great Ideas in Computer Architecture 1.3 Below Your Program

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 3

ECE 571 Advanced Microprocessor-Based Design Lecture 3 ECE 571 Advanced Microprocessor-Based Design Lecture 3 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 January 2018 Homework #1 was posted Announcements 1 Microprocessors Also

More information

x86 Architectures; Assembly Language Basics of Assembly language for the x86 and x86_64 architectures

x86 Architectures; Assembly Language Basics of Assembly language for the x86 and x86_64 architectures x86 Architectures; Assembly Language Basics of Assembly language for the x86 and x86_64 architectures topics Preliminary material a look at what Assembly Language works with - How processors work»a moment

More information

Operating Systems, Fall

Operating Systems, Fall Input / Output & Real-time Scheduling Chapter 5.1 5.4, Chapter 7.5 1 I/O Software Device controllers Memory-mapped mapped I/O DMA & interrupts briefly I/O Content I/O software layers and drivers Disks

More information

Last 2 Classes: Introduction to Operating Systems & C++ tutorial. Today: OS and Computer Architecture

Last 2 Classes: Introduction to Operating Systems & C++ tutorial. Today: OS and Computer Architecture Last 2 Classes: Introduction to Operating Systems & C++ tutorial User apps OS Virtual machine interface hardware physical machine interface An operating system is the interface between the user and the

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

A Framework for Memory Hierarchies

A Framework for Memory Hierarchies Associativity schemes Scheme Number of sets Blocks per set Direct mapped Number of blocks in cache 1 Set associative Blocks in cache / Associativity Associativity (2-8) Fully associative 1 Number Blocks

More information