The ARM10 Family of Advanced Microprocessor Cores

Similar documents
ARM Processors for Embedded Applications

Contents of this presentation: Some words about the ARM company

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

ARM Processors ARM ISA. ARM 1 in 1985 By 2001, more than 1 billion ARM processors shipped Widely used in many successful 32-bit embedded systems

PowerPC 740 and 750

The Nios II Family of Configurable Soft-core Processors

ARM ARCHITECTURE. Contents at a glance:

Chapter 15 ARM Architecture, Programming and Development Tools

Intel XScale Microarchitecture

SH-Mobile3: Application Processor for 3G Cellular Phones on a Low-Power SoC Design Platform

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Computer and Digital System Architecture

Chapter 4. Enhancing ARM7 architecture by embedding RTOS

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Digital Semiconductor. StrongARMARM

CS146 Computer Architecture. Fall Midterm Exam

Chapter 5. Introduction ARM Cortex series

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

SH4 RISC Microprocessor for Multimedia

Jim Keller. Digital Equipment Corp. Hudson MA

15CS44: MICROPROCESSORS AND MICROCONTROLLERS. QUESTION BANK with SOLUTIONS MODULE-4

Next Generation Multi-Purpose Microprocessor

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture.

Cortex-R5 Software Development

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

LEON4: Fourth Generation of the LEON Processor

Digital Semiconductor Alpha Microprocessor Product Brief

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7

Age nda. Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications

A 1-GHz Configurable Processor Core MeP-h1

AMULET3i an Asynchronous System-on-Chip

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43

Copyright 2016 Xilinx

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

RM3 - Cortex-M4 / Cortex-M4F implementation

ARM Processor Architecture (II)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Hercules ARM Cortex -R4 System Architecture. Processor Overview

COMPUTER ORGANIZATION AND DESI

RM4 - Cortex-M7 implementation

EEM870 Embedded System and Experiment Lecture 3: ARM Processor Architecture

Introducing the Superscalar Version 5 ColdFire Core

1. PowerPC 970MP Overview

Modular ARM System Design

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

RA3 - Cortex-A15 implementation

Cortex A8 Processor. Richard Grisenthwaite ARM Ltd

NXP Unveils Its First ARM Cortex -M4 Based Controller Family

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

32- bit Microprocessor-Intel 80386

Please state clearly any assumptions you make in solving the following problems.

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

ARM Processor Architecture

Course Introduction. Purpose: Objectives: Content: Learning Time:

ECE 471 Embedded Systems Lecture 2

ARM920T. Technical Reference Manual. (Rev 1) Copyright 2000, 2001 ARM Limited. All rights reserved. ARM DDI 0151C

Pipelining and Vector Processing

ARM-Based Embedded Processor Device Overview

VLSI Signal Processing

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

General Purpose Signal Processors

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

Topics in computer architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Excalibur ARM-Based. Embedded Processors PLDs. Hardware Reference Manual January 2001 Version 1.0

CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces

The ARM Architecture. Outline. History. Introduction. Seng Lin Shee 20 th May 2004

LECTURE 3: THE PROCESSOR

Hardware and Software Architecture. Chapter 2

LECTURE 10. Pipelining: Advanced ILP

Sam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation

Digital Signal Processor Core Technology

Camellia Getting Started with ARM922T

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

COMPUTER ORGANIZATION AND DESIGN

COMP 4211 Seminar Presentation

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

SA-1500: A 300 MHz RISC CPU with Attached Media Processor*

The new Intel Xscale Microarchitecture

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

LOW-COST SIMD. Considerations For Selecting a DSP Processor Why Buy The ADSP-21161?

A hardware operating system kernel for multi-processor systems

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

ARMv8-A Software Development

The ARM Architecture

1. (10) True or False: (1) It is possible to have a WAW hazard in a 5-stage MIPS pipeline.

Processor (IV) - advanced ILP. Hwansoo Han

Fatima Michael College of Engineering & Technology

MICROPROCESSORS AND MICROCONTROLLERS 15CS44 MODULE 4 ARM EMBEDDED SYSTEMS & ARM PROCESSOR FUNDAMENTALS ARM EMBEDDED SYSTEMS

1.1 MPC860T PowerQUICC Key Features

The Processor: Instruction-Level Parallelism

ECE331: Hardware Organization and Design

64-Bit Data Bus. Instruction Address IF. Instruction Data ID2 AC1 AC2 EX WB. X Data Byte Instruction Line Cache. Y Pipe.

ARM processor organization

Embedded Systems Ch 15 ARM Organization and Implementation

Design and Implementation of a FPGA-based Pipelined Microcontroller

Structure of Computer Systems

Transcription:

The ARM10 Family of Advanced Microprocessor Cores Stephen Hill ARM Austin Design Center 1

Agenda Design overview Microarchitecture ARM10 o o Memory System Interrupt response 3. Power o o 4. VFP10 ETM10 Summary Dynamic power Power down modes 2

ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um LV MIPS/MHz: 1.25 500 MIPS @ 400MHz Dhrystone 2.1 Active power consumption: 0.51mA/MIPS Room Temp / Typical / 1.1V Average when running Dhrystone 2.1 Area ARM1022E (2x16KB): 6.9mm 2 ARM1020E (2x32KB): 10.3mm 2 3

ARM10200 System Overview Bridge SDRAM CTL ARM10200 ETM10 I-Side AHB D-Side AHB BIU VFP10 BIU WriteBuffer I-Side MMU ARM10E D-Side MMU 16 or 32k Data 16 or 32k Cache Cache 16 or 32k Data Cache ARM1020E 64-bit data 4

ARM10E Microarchitecture 64-bit instruction and data interfaces Static branch prediction with branch folding Parallel load/store pipeline Dedicated machine for LDM/STM execution; all but the first cycle of these instructions are hidden if no dependencies are encountered Parallel execution of multi-cycle coprocessor operations Multiply 16 bits per cycle 1-3 cycle throughput and 2-4 cycle latency No data-dependent MUL cycle counts 5

ARM7 Pipeline versus ARM10 ARM7TDMI Fetch Thumb ARM decompress ARM decode Reg Select Register Read Shift ALU Reg Write ARM10 FETCH DECODE EXECUTE Branch Predictor Address Generator Fetch ARM or Thumb Decode Coprocessor Issue Register Read + Result Forward + Scoreboard Data + Branch Address Generator Shift + ALU Multiply Data Cache Interface Coprocessor Data Interface Multiply Add Reg Write FETCH ISSUE DECODE EXECUTE MEMORY WRITE 6

ARM9 Pipeline versus ARM10 ARM9TDMI Fetch Reg Decode Reg Read Shift + ALU Memory Access Reg Write FETCH DECODE EXECUTE MEMORY WRITE ARM10 ARM or Thumb Inst Decode Branch Predictor Address Generator Fetch ARM or Thumb Decode Coprocessor Issue Register Read + Result Forward + Scoreboard Data + Branch Address Generator Shift + ALU Multiply Data Cache Interface Coprocessor Data Interface Multiply Add Reg Write FETCH ISSUE DECODE EXECUTE MEMORY WRITE 7

ARM1020E Memory System & Data Caches 32Kbyte and Data caches Virtually addressed, 64-way set-associative, 32-byte lines, 64-bit R/W Configurable for Write Through or Write Back operation Lockable by line (1/64 of the cache) MMUs Two fully associative (I and D) 64-entry TLBs Lockable by entry Support for software loadable TLBs Write Buffer Eight 64-bit entries, plus 32-byte cache line castout buffer AHB Bus Interface 64-bit wide data transfers, split transactions Multi-layer AHB support (separate I and D-side system interfaces) 8

ARM1020E Memory System Performance features: Critical word first Non-blocking data cache Hit-under-miss (H-U-M) Data cache streaming (forwarding) from linefills Data cache store merging into linefills 9

ARM1020E Interrupts Interrupts taken in Execute stage Fast interrupt mode: First load miss stops further memory ops but not other instructions Limit write buffer depth Recommended measures for fast interrupt response: Lock handler code into Caches and TLB Set data cache to write-though (no cast outs) Limit LDM length to 9 registers (spans only 2 cache lines) 10

ARM1020E Interrupts Worst case Interrupt response time to enter handler (G:H Clock 1:1) Worst case of outstanding memory operations (LDM just started) 3 table walks needed (unless TLB locked down) Write buffer full, bus not granted by default CYCLES (approx) 171 Fast Interrupt mode Max Regs in LDM 16 Write through D cache TLB locked down 148 16 129 9 63 16 48 9 11

ARM1020E Dynamic Power ARM1020E Dhrystone 2.1 Data Cache 24% MMU 5% Cache 22% Data MMU 5% BIU 5% Clocks 6% Buffers 1% ARM10E Core 32% (PowerMill simulation) 12

ARM10E Dynamic Power ARM10E Core Dhrystone 2.1 Prefetch 15% Decode and Sequencer 33% LSU 10% Multiplier 2% Other 12% Clock 8% Execute 7% RegBank 5% ETM Interface 3% Forwarding 3% CP Interface 2% (PowerMill simulation) 13

Power Down Resume CPU executing (Fast/Normal/Slow) Active Power RUN >10 0 cycles CPU clock stopped. Wake on interrupt or debug event. STANDBY >10 1 cycles CPU state lost. Cache state preserved. DORMANT >10 2 cycles CPU & Cache state lost. All core power removed. SHUTDOWN Cycle Delay to Resume >10 4 cycles 14

Power Down RAM ROM Power Management Controller PM Layer PM Layer Interrupt Controller ARM High Performance Bus PM Layer DSP Core Debug Hardware PM Layer PM Layer PM Layer Clocks CORE and SCC Clocks CORE and SCC PLL PM Layer PM Layer Cache Cache 15

VFP10 Full IEEE 754 compliant (with SW support) Performance: 236 MFLOPS Linpack (SAxPY) @ 400MHz 400M FIR Taps (800 Peak MFLOPS) @ 400MHz Functions supported in hardware Multiply, add, multiply-add, subtract, multiply-subtract, negate, negate multiply, negate multiply-add, negate multiply-subtract, absolute value, compare, convert, divide and square root, conversions Most IEEE 754 exceptions handled in hardware RunFast mode No trapping enabled (Denormals flush to +0) NaN fractions not propagated (not typical) 16

VFP10 7 Stage pipeline Fetch - Issue - Decode - Execute (E 1) - E 2 - E 3 - E 4/WB 32 Single precision / 16 Double precision registers High performance short vector operations Register banks operate as hardware circular queues and can be addressed as short vectors (up to 8 values) Separate divide/square root unit Supports load/store, and arithmetic operation in parallel with divide/square root operation Separate load/store unit Load/store operations may be done in parallel with data processing operations 64-bit unidirectional data interfaces Area: ~1.6mm 2 in 0.13um 17

ETM10 Embedded Trace Macrocell ARM1020E EmbeddedICE Logic TAP Data Control Address ETM10 Multi-ICE JTAG Port SOC Trace Port Trace Port Analyzer 18

ETM10 Full real-time instruction and data tracing Monitors the core s internal buses Zero performance overhead Supports high frequency trace with demux-port Configurable synthesis for optimum: area features pin count Programmed non-intrusively through JTAG 19

ARM1020E Family Summary ARM1020E: 500 DMIPS @400 MHz 0.51 ma/mips 10.3mm 2 / 6.9mm 2 VFP10 VFP10: 236 MFLOPS @400MHz IEEE 754 Compatible IMMU Integer Unit DMMU ETM10: Full speed, real time embedded trace I-Cache D-Cache (ARM10200r1) 20