SH-X3 Flexible SuperH Multi-core for High-performance and Low-power Embedded Systems

Size: px
Start display at page:

Download "SH-X3 Flexible SuperH Multi-core for High-performance and Low-power Embedded Systems"

Transcription

1 SH-X3 Flexible SuperH Multi-core for High-performance and Low-power Embedded Systems Shinichi Shibahara 1, Masashi Takada 2, Tatsuya Kamei 1, Kiyoshi Hayase 1, Yutaka Yoshida 1, Osamu Nishii 1, Toshihiro Hattori 1 Hot Chips 19, 2007/8/20 1 Renesas Technology Corp. 2 Hitachi Ltd.

2 Requirement for Embedded Systems Trend Total scale is increasing by the introduction of advanced features. Mobile Phone 2D / 3D Mail W-CDMA GSM Camera Video Sound JAVA Requirement High Performance (for advanced features) Small Area (for smaller gadgets) Low Power (for long duration of battery) Solution: On-chip Multi-processor Process technology allows to produce easily. MP can be performance and power effective. 2

3 Multi-processor Approaches for Embedded Systems Features Application Specific: Optimize hardware & software for each system Low Power: Less than 1W (in the case of battery-run) Approaches Heterogeneous / Homogeneous AMP Integration of sub-systems / Deterministic behavior Homogeneous SMP Relatively easy programming model / Performance oriented Hybrid (Mixed system of AMP and SMP) Automatic Parallelizing Compiler AMP SMP Hybrid (AMP + SMP) Video JAVA GPS GSM Car Navigation Car Navigation GPS RTOS GPOS RTOS RTOS SMP OS SMP OS RTOS CPU0 CPU1 CPU2 CPU3 CPU0 CPU1 CPU2 CPU3 CPU0 CPU1 CPU2 CPU3 3

4 SuperH Processor Core Roadmap SH-4 SH3-DSP 5-stage pipeline up to 266MHz G LP 1.8 MIPS/MHz SH-X Superscalar 7-stage pipeline Released in 2003 Products: SH-Mobile3 SH-Navi1 G LP 1.8 MIPS/MHz SH-X2 Superscalar 8-stage pipeline Released in 2005 Products: SH-Mobile G1, G2 SH-Navi2 G LP 7.2 MIPS/MHz (4-CPU) SH-X3 MP-ready (up to 4-CPU) Superscalar (each) 8-stage pipeline First Target: Car Information Systems 4

5 SH-X3 Block Diagram Clock Controller Interrupt Controller On-chip Debugger CPU Fetch Inst. Bus I$ I- LRAM Bus I/F Controller CPU Exec UTLB Cache-RAM Bus Data Transfer Unit D- LRAM CPU #3 CPU #2 CPU #1 CPU #0 FPU / DSP Data Bus D$ URAM Snoop Bus Snoop Controller On-chip System Bus (SuperHyway Bus TM On-chip Interconnect) D-LRAM: Also called XY-RAM in SH4AL-DSP 5

6 Pipeline Structure Eight-stage dual-issue superscalar pipeline (Inherited from SH-X2) Instruction fetch Pre-dec Decode & Issue Fwd ALU I1 I2 I3 ID E1 E2 E3 Write back WB Arithmetic Execution M1 M2 M3 WB Memory Load-Store Addr calc Mem access Align Write back FPU pipe F1 F2 F3 F4 F5 F6 WB Reg read Operation Write back DSP pipe D1 D2 D3 D4 WB Decode Reg read Operation Write back 6

7 Specification Features Efficient for both SMP and AMP Cache coherency (Snoop Controller) for SMP Today s Topic Local memories (LRAM, URAM) and data transfer unit for AMP Realization of hybrid MP model Fine power management for each CPU Low-power modes according to workload (sleep, light sleep, standby etc.) Flexible clock ratio (CPU Clock : System Bus Clock = m:n (m>n), 1:n) Hierarchical clock gating Configurable and synthesizable Number of CPU (up to 4-CPU), Co-processor (DSP, FPU) Cache (8KB~64KB/4way) Local memory (LRAM: 4KB~128KB, URAM: 128KB~1MB) 7

8 Cache Coherency for Embedded Systems Problems in applying bus snooping (used in HPC servers) Performance degradation by system bus occupation Unnecessary power dissipation by snooping activity Fixed write-cache mode: MESI (Copy-back) or ESI (Write-through) MESI Fixed: HW accelerator on system bus cannot access the latest data. ESI Fixed: CPU cannot run at the best performance due to store accesses. Solution Separation of system bus and snoop bus (Reduce bus occupation) Centralized coherency control by snoop controller (Reduce bus activity) Support of mixed cache coherency protocol (Each CPU can select mode) 8

9 Cache Coherency Maintenance (Conventional: Bus Snooping) Data Bus CPU 0 CPU 3 Data Bus Cache Controller 0 Cache Controller 3 Controller Operand Cache Controller All transactions appear on bus Bus occupation, Wasting power Operand Cache Bus I/F Controller 0 Bus I/F Controller 3 Cache Operation (Fill, WB, ) Monitor All Transactions (Bus Snooping) Memory System Bus 9

10 Cache Coherency Maintenance (MESI Protocol) Data Bus (1) Operand Access (2) Search Controller (3) Fill Request CPU 0 Cache Controller 0 Operand Cache Initiator Target Initiator Target Snoop Bus (8) Response (9) Update Controller (5) Request Reduce bus occupation No bus monitoring (6) Search & Update Cache Controller 3 (7) Response (8) Dirty Transfer Snoop Controller CPU 3 Operand Cache BIC 0 (4) Search & Update (CPU0) (CPU1) (CPU2) (CPU3) BIC 3 Initiator Initiator Target Initiator 10 (8) WB Request Receive Snoop Request from PCI Express etc. System Bus BIC: Bus I/F Controller, : Duplicated Address Array

11 Snoop Latency Optimization (ESI Protocol) Controller (1) WT Request Cache Controller 0 Operand Cache Initiator Target Initiator Target (3) Response Controller (3) Request (Erase the stale) Cache Controller 3 Operand Cache No affection to WT request Execute on background (2) Search & Update BIC 0 Initiator Initiator Cache Controller 0 (3) WT Request (1) WT Request Snoop Controller BIC 3 Target Initiator Optimized System Bus (3) WT Response Response Cache Controller 3 (3) Erase Request Response Snoop Controller (2) Search Latency (3) WT Request No need to wait for snoop response 11

12 Mixed Coherency Protocol Need to consider the latest in other CPUs Write-through Copy-back Controller BIC 0 Initiator Cache Controller 0 Operand Cache Initiator Target Initiator Target (CPU1) Controller (1) WT Request (2) Request AA Merge Initiator (CPU0) WB Request AA 03 Dirty Data (CPU2) Target (3) Response (with dirty) Snoop Controller (CPU3) Cache Controller 3 Operand Cache Dirty BIC 3 Initiator System Bus 12

13 Difficulty of SMP OS Development Cache operation after process migration (Caused by time sharing processing) Conventional measure Flushing cache entries, accessed before, via inter-processor interrupt after process releases memory or virtual-physical address map is changed. Synonym problem (Caused by more than one virtual-physical address maps) Conventional measure Flushing the cache of synonym page during page allocation Preventing the synonym occurrence by using page coloring Problem in conventional measure Complicated to implement software Large software overhead Solution Broadcast of operand cache operating instructions (OCBI, OCBP, OCBWB) Hardware implementation of synonym detection and eviction 13

14 Operand Cache Operating Instructions (Conventional) Conventional Specification Operate only my own cache line Cannot operate other CPUs cache line Need inter-processor interrupt to operate others Example: Process Migration Process Migration Process X MOV.L VOID Not executed yet CPU #0 CPU #3 Dirty D$ Snoop Controller Cannot write back the dirty line D$ Process X MOV.L Executed Memory System Bus Cannot access the latest DMAC Memory Access after Process X 14 OCBWB: Write-back Cache Block

15 Broadcast of Operand Cache Operating Instructions Extended Specification Operate all CPUs cache by broadcast No need inter-processor interrupt to operate Reduce software overhead of using interrupt Example: Process Migration CPU #0 CPU #3 Dirty Broadcast via Snoop Controller D$ Snoop Controller D$ Process X MOV.L Executed System Bus Write Back Available the latest Memory DMAC Memory Access after Process X 15 OCBWB: Write-back Cache Block

16 Synonym Detection and Eviction (In the case of 4KB/Page) VPN[31:12] PPN[31:12] 1. Read Miss Adr(A) D$: 32KB/4way (Virtual Index - Physical Tag) V-Index [12:5] 0x00 A P-Tag[31:10] P-Index [12:5] Way 0 Way 1 Way 2 Way 3 (Physical Index - Physical Tag) P-Tag[31:10] V[12] Way 0 Way 1 Way 2 Way 3 Way 0 Way 1 Way 2 Way 3 0x80 A 0 2. Read Miss Adr(B) Suppose = Physical Address (A) = Physical Address (B) 0x00 0x80 Synonym (Differs in VPN[12]) 3. Request from SNC 0x00 0x80 A A 0x00 Must not register Memory same address tag B Data 0x80 A B 0 1 Same Detection by SNC 0x00 Purge Request B 0x80 A B 0 1 from SNC Delete Delete SNC: SNoop Controller 16

17 RP1: Experimental Chip SH-4A CPU (including FPU) RAM K D$ 32K Shared Memory 128K I$ 32K SH-4A CPU (including FPU) RAM K D$ 32K I$ 32K SH-4A CPU (including FPU) RAM K D$ 32K SuperHyway Bus TM On-chip Interconnect DDR I/F SRAM I/F IP IP I$ 32K SH-4A CPU (including FPU) RAM K Bus Bridge I/O D$ 32K I$ 32K PCIe CPU #0 SNC CPU #2 Shared Memory CPU #1 CPU #3 IPs IPs PCIe CPG Process Technology Area I/D Cache Local Memory Performance Power Consumption 90-nm, 8-layer, Triple-Vth, Generic CMOS, 1.0V 3.88mm 2 (Each CPU excluding all memories), 7.28mm 2 (Each CPU) 32KB/4way set-associative (Each) I-LRAM 8KB, D-LRAM 16KB, URAM 128KB (Each CPU) 1.8 MIPS/MHz/CPU (Dhrystone 2.1) MHz (4-CPU Total) MHz 17

18 Application of Synonym-related Function to SMP OS OS Enhancement for SMP Applied: Linux Kernel (Not MP-ready for SuperH multi-core) Measurement: When kernel detects a synonym page, it flushes all entries of the page. Experiment (On evaluation board) Enhanced for hardware implementation of synonym-related function Executed shell command find for each CPU in parallel (Whole is stored in DDR) Not Enhanced (Using original synonym measurement) CPU Execution Time (sec) CPU CPU CPU CPU Enhanced CPU Execution Time (sec) CPU CPU CPU CPU % Performance Improvement 18

19 Summary SH-X3: SuperH multi-core for high-performance and low-power systems Efficient for both SMP and AMP Specification features for cache coherency Separation of system bus and snoop bus Centralized cache coherency by snoop controller Support of mixed cache coherency protocol Specification features for SMP OS development Broadcast of operand cache operating instructions (OCBI, OCBP, OCBWB) Hardware implementation for synonym problem (53% performance improved) Acknowledgement This work was supported by NEDO (New Energy and Industrial Technology Development Organization) P03022, a joint project of Renesas Technology Corp., Hitachi Ltd., and Waseda University. 19

20

21 Backup Slides 21

22 Synonym Detection and Eviction (In the case of 4KB/Page) VPN[31:12] PPN[31:12] 1. Read Miss Adr(A) D$: 32KB/4way (Virtual Index - Physical Tag) V-Index [12:5] P-Tag[31:10] Way X 0x00 A (Physical Index - Physical Tag) P-Index [12:5] P-Tag[31:10] Way X V[12] 0x80 A 0 2. Read Miss Adr(B) 0x00 0x80 A 0x00 Must keep Overwrite 1:1 correspondence B 0x80 A B Request from SNC 0x00 0x80 A Purge Request 0x00 B 0x80 B 1 from SNC Different Detection by SNC SNC: SNoop Controller 22

23 Synonym Detection and Eviction (In the case of 4KB/Page) VPN[31:12] PPN[31:12] 1. Read Miss Adr(A) D$: 32KB/4way (Virtual Index - Physical Tag) V-Index [12:5] P-Tag[31:10] Way X P[12] (in P-Tag) (Physical Index - Physical Tag) P-Index [12:5] P-Tag[31:10] 0x00 Way X A 0x80 A 0 2. Read Miss Adr(B) 0x80 Overwrite A B Must keep 1:1 correspondence 0 1 0x00 0x80 A B 3. Request from SNC 0x80 B 1 Different Detection by Cache Controller 0x00 Invalidate Request to SNC 0x80 A B SNC: SNoop Controller 23

24 Application of Synonym-related Function to SMP OS OS Enhancement for SMP Applied: Linux Kernel (Not MP-ready for SuperH multi-core) Measurement: When kernel detects a synonym page, it flushes all entries of the page. Experiment (On evaluation board) Enhanced for hardware implementation of synonym-related function Freq. = (# of HW activation) / (# of request to SNC) Executed shell command find for each CPU in parallel (Whole is stored in DDR) Not Enhanced (Using original synonym measurement) CPU Activation Frequency of Function Execution Time (sec) CPU0 0.06% CPU1 0.05% CPU2 0.05% CPU3 0.05% Enhanced Activated by VPN[12]!= PPN[12] page mapping CPU Activation Frequency of Function Execution Time (sec) CPU0 0.16% CPU1 0.13% CPU2 0.17% CPU3 0.13% % Performance Improvement 24

SH-Mobile3: Application Processor for 3G Cellular Phones on a Low-Power SoC Design Platform

SH-Mobile3: Application Processor for 3G Cellular Phones on a Low-Power SoC Design Platform SH-Mobile3: Application Processor for 3G Cellular Phones on a Low-Power SoC Design Platform H. Mizuno, N. Irie, K. Uchiyama, Y. Yanagisawa 1, S. Yoshioka 1, I. Kawasaki 1, and T. Hattori 2 Hitachi Ltd.,

More information

MPSoC Approaches for Low-power Embedded Soc's

MPSoC Approaches for Low-power Embedded Soc's MPSoC Approaches for Low-power Embedded Soc's Development Dept. 1 Toshihiro Hattori, June 26, 2007 Outline Approaches for Embedded MPSoC SH-MobileG1 : AP+BB Onechip SH-X3: SuperH SH-4A Quad-Core EXREAL

More information

SH-MobileG1: A Single-Chip Application and Dual-mode Baseband Processor

SH-MobileG1: A Single-Chip Application and Dual-mode Baseband Processor SH-MobileG1: A Single-Chip Application and Dual-mode Baseband Processor Masayuki Ito 1, Takahiro Irita 1, Eiji Yamamoto 1, Kunihiko Nishiyama 1, Takao Koike 1, Yoshihiko Tsuchihashi 1, Hiroyuki Asano 1,

More information

A 1-GHz Configurable Processor Core MeP-h1

A 1-GHz Configurable Processor Core MeP-h1 A 1-GHz Configurable Processor Core MeP-h1 Takashi Miyamori, Takanori Tamai, and Masato Uchiyama SoC Research & Development Center, TOSHIBA Corporation Outline Background Pipeline Structure Bus Interface

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

A 65nm Dual-mode Baseband and Multimedia Application Processor SoC with Advanced Power and Memory Management

A 65nm Dual-mode Baseband and Multimedia Application Processor SoC with Advanced Power and Memory Management A 65nm Dual-mode Baseband and Multimedia Application Processor SoC with Advanced Power and Memory Management Tatsuya Kamei, Tetsuhiro Yamada, Takao Koike, Masayuki Ito, Takahiro Irita, Kenichi Nitta, Toshihiro

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Achieves excellent performance of 1,920 MIPS and a single-chip solution for nextgeneration car information systems

Achieves excellent performance of 1,920 MIPS and a single-chip solution for nextgeneration car information systems Renesas Technology to Release SH7776 (SH-Navi3), Industry s First Dual-Core SoC with Built-in Image Recognition Processing Function for Car Information Terminals Achieves excellent performance of 1,920

More information

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki Communications and Computer Engineering II: Microprocessor 2: Processor Micro-Architecture Lecturer : Tsuyoshi Isshiki Dept. Communications and Computer Engineering, Tokyo Institute of Technology isshiki@ict.e.titech.ac.jp

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

SH4 RISC Microprocessor for Multimedia

SH4 RISC Microprocessor for Multimedia SH4 RISC Microprocessor for Multimedia Fumio Arakawa, Osamu Nishii, Kunio Uchiyama, Norio Nakagawa Hitachi, Ltd. 1 Outline 1. SH4 Overview 2. New Floating-point Architecture 3. Length-4 Vector Instructions

More information

SoC for Car Navigation Systems with a 53.3 GOPS Image Recognition Engine

SoC for Car Navigation Systems with a 53.3 GOPS Image Recognition Engine Session 5D : Designer s Forum : State-of-the-art SoCs 5D-4 SoC for Car Navigation Systems with a 53.3 GOPS Image Recognition Engine Jan. 20. 2010 Hiroyuki Hamasaki*, Yasuhiko Hoshi*, Atsushi Nakamura *,

More information

ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.5

ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.5 ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.5 14.5 A 600MHz Single-Chip Multiprocessor with 4.8GB/s Internal Shared Pipelined Bus and 512kB Internal Memory Satoshi Kaneko, Katsunori Sawai, Norio

More information

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp. 13 1 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas 110 Winter 2009 CMPE Cache Direct-mapped cache Reads and writes Cache associativity Cache and performance Textbook Edition: 7.1 to 7.3 Third

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

KeyStone II. CorePac Overview

KeyStone II. CorePac Overview KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Computer Systems Architecture Spring 2016

Computer Systems Architecture Spring 2016 Computer Systems Architecture Spring 2016 Lecture 01: Introduction Shuai Wang Department of Computer Science and Technology Nanjing University [Adapted from Computer Architecture: A Quantitative Approach,

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp. Cache associativity Cache and performance 12 1 CMPE110 Spring 2005 A. Di Blas 110 Spring 2005 CMPE Cache Direct-mapped cache Reads and writes Textbook Edition: 7.1 to 7.3 Second Third Edition: 7.1 to 7.3

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

SH-5: A First 64-bit SuperH Core with Multimedia Extension. Fumio Arakawa Hitachi, Ltd.

SH-5: A First 64-bit SuperH Core with Multimedia Extension. Fumio Arakawa Hitachi, Ltd. SH-5: A First 64-bit SuperH Core with Multimedia Extension Fumio Arakawa Hitachi, Ltd. SuperH Roadmap SH-5: First 64-bit Architecture FPU,Superscalar MMU SH-4 480 SH-3 Cache, DSP SH3-DSP SH-2 260 MIPS

More information

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

Venezia: a Scalable Multicore Subsystem for Multimedia Applications Venezia: a Scalable Multicore Subsystem for Multimedia Applications Takashi Miyamori Toshiba Corporation Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and

More information

Heterogeneous Multicore Processor Technologies for Embedded Systems

Heterogeneous Multicore Processor Technologies for Embedded Systems Heterogeneous Multicore Processor Technologies for Embedded Systems Kunio Uchiyama Fumio Arakawa Hironori Kasahara Tohru Nojiri Hideyuki Noda Yasuhiro Tawara Akio Idehara Kenichi Iwata Hiroaki Shikano

More information

Course Introduction. Purpose: Objectives: Content: Learning Time:

Course Introduction. Purpose: Objectives: Content: Learning Time: Course Introduction Purpose: This course provides an overview of the Renesas SuperH series of 32-bit RISC processors, especially the microcontrollers in the SH-2 and SH-2A series Objectives: Learn the

More information

Tile Processor (TILEPro64)

Tile Processor (TILEPro64) Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large

More information

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB) CS6290 Memory Views of Memory Real machines have limited amounts of memory 640KB? A few GB? (This laptop = 2GB) Programmer doesn t want to be bothered Do you think, oh, this computer only has 128MB so

More information

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O T R O L ALU CTL ISTRUCTIO FETCH ISTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMOR ACCESS WRITE BACK A D D A D D A L U

More information

doi: /MM

doi: /MM doi: 10.1109/MM.2011.93 : A MULTICORE COMMUNICATION SOC WITH Sugako Otani 1, Hiroyuki Kondo 1, Itaru Nonomura 1 Toshihiro Hanawa 2, Shin'ichi Miura 2, Taisuke Boku 2 1 Renesas Electronics Corporation 2

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor Lecture 12 Architectures for Low Power: Transmeta s Crusoe Processor Motivation Exponential performance increase at a low cost However, for some application areas low power consumption is more important

More information

Low-Power Processor Solutions for Always-on Devices

Low-Power Processor Solutions for Always-on Devices Low-Power Processor Solutions for Always-on Devices Pieter van der Wolf MPSoC 2014 July 7 11, 2014 2014 Synopsys, Inc. All rights reserved. 1 Always-on Mobile Devices Mobile devices on the move Mobile

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Virtual Memory. Samira Khan Apr 27, 2017

Virtual Memory. Samira Khan Apr 27, 2017 Virtual Memory Samira Khan Apr 27, 27 Virtual Memory Idea: Give the programmer the illusion of a large address space while having a small physical memory So that the programmer does not worry about managing

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers Course Introduction Purpose: This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers Objectives: Learn about error detection and address errors

More information

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner CS104 Computer Organization and rogramming Lecture 20: Superscalar processors, Multiprocessors Robert Wagner Faster and faster rocessors So much to do, so little time... How can we make computers that

More information

Mapping applications into MPSoC

Mapping applications into MPSoC Mapping applications into MPSoC concurrency & communication Jos van Eijndhoven jos@vectorfabrics.com March 12, 2011 MPSoC mapping: exploiting concurrency 2 March 12, 2012 Computation on general purpose

More information

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous

More information

OPENSPARC T1 OVERVIEW

OPENSPARC T1 OVERVIEW Chapter Four OPENSPARC T1 OVERVIEW Denis Sheahan Distinguished Engineer Niagara Architecture Group Sun Microsystems Creative Commons 3.0United United States License Creative CommonsAttribution-Share Attribution-Share

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding ST20 icore and architectures D Albis Tiziano 707766 Architectures for multimedia systems Politecnico di Milano A.A. 2006/2007 Outline ST20-iCore Introduction Introduction Architecture overview Multi-cluster

More information

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems Designing, developing, debugging ARM and heterogeneous multi-processor systems Kinjal Dave Senior Product Manager, ARM ARM Tech Symposia India December 7 th 2016 Topics Introduction System design Software

More information

Chapter 5. Introduction ARM Cortex series

Chapter 5. Introduction ARM Cortex series Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1

More information

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem Cache Coherence Bryan Mills, PhD Slides provided by Rami Melhem Cache coherence Programmers have no control over caches and when they get updated. x = 2; /* initially */ y0 eventually ends up = 2 y1 eventually

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture. Chapter Overview.

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture. Chapter Overview. Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Slides prepared by Kip R. Irvine Revision date: 09/25/2002 Chapter corrections (Web) Printing

More information

Interconnecting Components

Interconnecting Components Interconnecting Components Need interconnections between CPU, memory, controllers Bus: shared communication channel Parallel set of wires for data and synchronization of data transfer Can become a bottleneck

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently

More information

The ARM10 Family of Advanced Microprocessor Cores

The ARM10 Family of Advanced Microprocessor Cores The ARM10 Family of Advanced Microprocessor Cores Stephen Hill ARM Austin Design Center 1 Agenda Design overview Microarchitecture ARM10 o o Memory System Interrupt response 3. Power o o 4. VFP10 ETM10

More information

Chapter 15 ARM Architecture, Programming and Development Tools

Chapter 15 ARM Architecture, Programming and Development Tools Chapter 15 ARM Architecture, Programming and Development Tools Lesson 07 ARM Cortex CPU and Microcontrollers 2 Microcontroller CORTEX M3 Core 32-bit RALU, single cycle MUL, 2-12 divide, ETM interface,

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions. Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Computer Architecture. Lecture 6.1: Fundamentals of

Computer Architecture. Lecture 6.1: Fundamentals of CS3350B Computer Architecture Winter 2015 Lecture 6.1: Fundamentals of Instructional Level Parallelism Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much

More information

Purpose This course provides an overview of the SH-2A 32-bit RISC CPU core built into newer microcontrollers in the popular SH-2 series

Purpose This course provides an overview of the SH-2A 32-bit RISC CPU core built into newer microcontrollers in the popular SH-2 series Course Introduction Purpose This course provides an overview of the SH-2A 32-bit RISC CPU core built into newer microcontrollers in the popular SH-2 series Objectives Acquire knowledge about the CPU s

More information

Lecture 4: RISC Computers

Lecture 4: RISC Computers Lecture 4: RISC Computers Introduction Program execution features RISC characteristics RISC vs. CICS Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

ECE 485/585 Midterm Exam

ECE 485/585 Midterm Exam ECE 485/585 Midterm Exam Time allowed: 100 minutes Total Points: 65 Points Scored: Name: Problem No. 1 (12 points) For each of the following statements, indicate whether the statement is TRUE or FALSE:

More information

Technology Trends Presentation For Power Symposium

Technology Trends Presentation For Power Symposium Technology Trends Presentation For Power Symposium 2006 8-23-06 Darryl Solie, Distinguished Engineer, Chief System Architect IBM Systems & Technology Group From Ingenuity to Impact Copyright IBM Corporation

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Lecture 11 Cache. Peng Liu.

Lecture 11 Cache. Peng Liu. Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O 6.823, L21--1 Cache Coherence Protocols: Implementation Issues on SMP s Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Coherence Issue in I/O 6.823, L21--2 Processor Processor

More information

SA-1500: A 300 MHz RISC CPU with Attached Media Processor*

SA-1500: A 300 MHz RISC CPU with Attached Media Processor* and Bridges Division SA-1500: A 300 MHz RISC CPU with Attached Media Processor* Prashant P. Gandhi, Ph.D. and Bridges Division Computing Enhancement Group Intel Corporation Santa Clara, CA 95052 Prashant.Gandhi@intel.com

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2015 Lecture 15 LAST TIME! Discussed concepts of locality and stride Spatial locality: programs tend to access values near values they have already accessed

More information

PC I/O. May 7, Howard Huang 1

PC I/O. May 7, Howard Huang 1 PC I/O Today wraps up the I/O material with a little bit about PC I/O systems. Internal buses like PCI and ISA are critical. External buses like USB and Firewire are becoming more important. Today also

More information

Handout 4 Memory Hierarchy

Handout 4 Memory Hierarchy Handout 4 Memory Hierarchy Outline Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options (MMU Sub-system) Conclusion 2012/11/7 2 Since 1980, CPU has outpaced

More information

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures

More information

Memory Hierarchy Requirements. Three Advantages of Virtual Memory

Memory Hierarchy Requirements. Three Advantages of Virtual Memory CS61C L12 Virtual (1) CS61CL : Machine Structures Lecture #12 Virtual 2009-08-03 Jeremy Huddleston Review!! Cache design choices: "! Size of cache: speed v. capacity "! size (i.e., cache aspect ratio)

More information

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II) EEC 581 Computer Architecture Memory Hierarchy Design (II) Department of Electrical Engineering and Computer Science Cleveland State University Topics to be covered Cache Penalty Reduction Techniques Victim

More information

Assembly Language for x86 Processors 7 th Edition. Chapter 2: x86 Processor Architecture

Assembly Language for x86 Processors 7 th Edition. Chapter 2: x86 Processor Architecture Assembly Language for x86 Processors 7 th Edition Kip Irvine Chapter 2: x86 Processor Architecture Slides prepared by the author Revision date: 1/15/2014 (c) Pearson Education, 2015. All rights reserved.

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Contents of this presentation: Some words about the ARM company

Contents of this presentation: Some words about the ARM company The architecture of the ARM cores Contents of this presentation: Some words about the ARM company The ARM's Core Families and their benefits Explanation of the ARM architecture Architecture details, features

More information