EEM870 Embedded System and Experiment Lecture 3: ARM Processor Architecture Wen-Yen Lin, Ph.D. Department of Electrical Engineering Chang Gung University Email: wylin@mail.cgu.edu.tw March 2014
Agenda Introduction ARM Processor Overview ARM Architecture Version ARM Processor Pipeline Design ARM7TDMI & ARM9TDMI ARM10 v.s. ARM11 Cortex-A8 ARM Programmer s Model ARM Instruction Set (To be Cont d) Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 2
Introduction - ARM Advanced RISC Machines (ARM) the world's first commercial RISC processor developed by the Acorn Computer Group in 1985, spin out to form as a company. The ARM Instruction Set Used as the example in chapters 2 and 3 Most popular 32-bit instruction set in the world (www.arm.com) 4 Billion shipped in 2008 Large share of embedded core market Applications include mobile phones, consumer electronics, network/storage equipment, cameras, printers, Typical of many modern RISC ISAs See ARM Assembler instructions, their encoding and instruction cycle timings in appendixes B1,B2 and B3 (CD-ROM) 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 Other SPARC Hitachi SH PowerPC Motorola 68K MIPS IA-32 ARM 1998 1999 2000 2001 2002 Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 3
ARM Ltd Founded in November 1990 Spun out of Acorn Computers Initial funding from Apple, Acorn and VLSI Design the ARM range of RISC processor cores License ARM core designs to semiconductor partners who fabricate and sell to their customers ARM does not fabricate silicon itself Also develop technologies to assist with the design-in of the ARM architecture Software tools, boards, debug hardware Application softwares Bus architectures Peripherals, etc Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 4
ARM s Activities Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 5
Huge Range of Applications Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 6
Intellectual Property (IP) ARM provides hard and soft views to licencees RTL and synthesis flows <- soft view GDSII layout <- hard view Licencees have the right to use hard or soft views of the IP Soft views include gate level netlists Hard views are DSMs (distributed shared memory models) OEM must use hard views To protect ARM IP Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 7
ARM Core Family ARMv8 is a 64-bit architecture, but not yet has any commercial products. Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 8
ARM Architecture Versions Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 9
ARM Architecture Versions Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 10
ARM Architecture Version Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 11
ARM Architecture Version Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 12
Development of the ARM architecture Processor Architecture = Instruction Set + Programmer s model Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 13
ARM Architecture v7 Profiles Application profile (ARMv7-A) Memory management support (MMU) Highest performance at low power Influenced by multi-tasking OS system requirements TrustZone and Jazelle-RCT for a safe, extensible system e.g. Cortex-A5, Cortex-A9 Real-time profile (ARMv7-R) Protected memory (MPU) Low latency and predictability real-time needs Evolutionary path for traditional embedded business e.g. Cortex-R4 Microcontroller profile (ARMv7-M, ARMv7E-M, ARMv6-M) Lowest gate count entry point Deterministic and predictable behavior a key priority Deeply embedded use e.g. Cortex-M3 Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 14
ARM Processor Overview Apple A5 Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 16
Product Code Demystified Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 17
ARM Processor Cores Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 18
ARM Processor Cores Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 19
ARM Processor Cores Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 20
ARM Processor Cores Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 21
ARM Architecture Version Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 22
ARM Architecture Versions Information from WiKi: Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 23
Relative Performance Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 24
Application Processors Application Processors are defined by the processor s ability to execute complex operation systems, such as Linux, Android, Microsoft Windows (CE/Mobile), and Symbian Applications: Smartphones, Feature Phones, Netbooks, ereaders, Digital TV, Set-top Boxes, etc. Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 25
Embedded Processors Embedded Processors are primarily focused on delivering highly deterministic real-time behavior in a wide range of power sensitive applications, often execute a RTOS along with user applications. Applications: Merchant Microcontrollers, Automotive Control Systems, Moto Control Systems, Wireless and Wired Sensor Networks, Mass Storage Controllers, Printers, etc. Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 26
Real-time Processors ARM Cortex -R real-time processors offer high-performance computing solutions for deeply embedded systems with demanding real-time response constraints. Target applications are: Mobile handset processing in smart-phones and baseband modems Enterprise systems such as hard disk drives, networking and printing Home consumer electronics, set top boxes, digital TV, media players, cameras Embedded microcontrollers for dependable systems in medical, industrial and automotive Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 27
Cortex Family Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 28
Agenda Introduction ARM Processor Overview ARM Architecture Version ARM Processor Pipeline Design ARM7TDMI & ARM9TDMI ARM10 v.s. ARM11 Cortex-A8 ARM Programmer s Model ARM Instruction Set (To be Cont d) Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 29
5-Stage Pipeline Organization Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 30
5-Stage Pipeline Organization Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 31
Pipeline Changes for ARM9TDMI ARM7TDMI Instruction Fetch Thumb AR M decompres s ARM decode Reg Select Reg Read Shift ALU Reg Write FETCH DECODE EXECUTE ARM9TDMI Instruction Fetch ARM or Thumb Inst Decode Reg Decode Reg Read Shift + ALU Memory Access Reg Write FETCH DECODE EXECUTE MEMORY WRITE Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 32
ARM10 vs. ARM11 Pipelines ARM10 Branch Prediction Instruction Fetch ARM11 ARM or Thumb Instruction Decode Reg Read Shift + ALU Multiply Shift ALU Saturate Memory Access Multiply Add Reg Write FETCH ISSUE DECODE EXECUTE MEMORY WRITE Fetch 1 Fetch 2 Decode Issue MAC 1 MAC 2 MAC 3 Write back Address Data Cache 1 Data Cache 2 Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 33
8-Stage Pipeline (v6 Architecture) Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 34
Cortex-A8 Block Diagram Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 35
ARM Cortex-A Architecture Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 36
Full Cortex-A8 Pipeline Diagram Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 37
What is NEON? NEON is a wide SIMD data processing architecture Extension of the ARM instruction set (v7 -A) 32 x 64-bit wide registers (can also be used as 16 x 128-bit wide registers) NEON instructions perform Packed SIMD processing Registers are considered as vectors of elements of the same data type Data types available: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float Instructions usually perform the same operation in all lanes Elements Dn Dm Source Registers Operation Dd Destination Register Lane Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 38
Agenda Introduction ARM Processor Overview ARM Architecture Version ARM Processor Pipeline Design ARM7TDMI & ARM9TDMI ARM10 v.s. ARM11 Cortex-A8 ARM v7a Programmer s Model ARM Instruction Set (To be Cont d) Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 39
Data Size and Instruction Sets The ARM is a 32-bit architecture When used in relation to the ARM: Byte means 8 bits Halfword means 16 bits (two bytes) Word means 32 bits (four bytes) Most ARM s implement two instruction sets 32-bit ARM Instruction Set 16-bit Thumb Instruction Set Jazelle cores can also execute Java bytecode Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 40
ARM and Thumb Performance Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 41
The Thumb-2 Instruction Set Variable-length instructions ARM instructions are a fixed length of 32 bits Thumb instructions are a fixed length of 16 bits Thumb-2 instruction can be either 16-bit or 32-bit Thumb-2 gives approximately 26% improvement in code density over ARM Thumb-2 gives approximately 25% improvement in performance over Thumb Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 42
Cortex-A8 Processor Modes Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 43
Cortex-A8 Register File Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 44
Cortex-A8 Exception Handling Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 45
Cortex-A8 Program Status Register Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 46
Conditional Execution and Flags Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 47
Memory Types Each defined memory region will specify a memory type The memory type controls the following: Memory access ordering rules Caching and buffering behaviour There are 3 mutually exclusive memory types: Normal Device Strongly Ordered Normal and Device memory allow additional attributes for specifying The cache policy Whether the region is Shared Normal memory allows you to separately configure Inner and Outer cache policies (discussed in the Caches and TCMs module) Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 48
L1 and L2 Caches I-Cache RAM L2 Cache MMU/MPU ARM Core BIU On-chip SRAM Off-chip Memory D-Cache RAM L1 L2 L3 Typical memory system can have multiple levels of cache Level 1 memory system typically consists of L1-caches, MMU/MPU and TCMs Level 2 memory system (and beyond) depends on the system design Memory attributes determine cache behavior at different levels Controlled by the MMU/MPU (discussed later) Inner Cacheable attributes define memory access behavior in the L1 memory system Outer Cacheable attributes define memory access behavior in the L2 memory system (if external) and beyond (as signals on the bus) Before caches can be used, software setup must be performed Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 49
ARM Cache Features Harvard Implementation for L1 caches Separate Instruction and Data caches Cache Lockdown Prevents line Eviction from a specified Cache Way (discussed later) Pseudo-random and Round-robin replacement strategies Unused lines can be allocated before considering replacement Non-blocking data cache Cache Lookup can hit before a Linefill is complete (also checks Linefill buffer) Streaming, Critical-Word-First Cache data is forwarded to the core as soon as the requested word is received in the Linefill buffer Any word in the cache line can be requested first using a WRAP burst on the bus ECC or parity checking Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 50
Example 32KB ARM Cache Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 51
Cortex-A8 Memory Management Memory Protection Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 52
Memory Allocation Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 53
Memory Management Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 54
Agenda Introduction ARM Processor Overview ARM Architecture Version ARM Processor Pipeline Design ARM7TDMI & ARM9TDMI ARM10 v.s. ARM11 Cortex-A8 ARM v7a Programmer s Model ARM Instruction Set (To be Cont d) Embedded System and Experiment, 102/2, EE/CGU, W.Y. Lin 55