William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Similar documents
Comp. Org II, Spring

Parallel Processing & Multicore computers

Comp. Org II, Spring

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Chapter 18 - Multicore Computers

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Computer Systems Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Introducing Multi-core Computing / Hyperthreading

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Fundamentals of Quantitative Design and Analysis

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Multi-core Architectures. Dr. Yingwu Zhu

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

HW Trends and Architectures

Multithreaded Processors. Department of Electrical Engineering Stanford University

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Multi-core Architectures. Dr. Yingwu Zhu

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

WHY PARALLEL PROCESSING? (CE-401)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

COSC4201 Multiprocessors

Introduction to Microprocessor

MICROPROCESSOR ARCHITECTURE

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Multiprocessors & Thread Level Parallelism

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Parallelism in Hardware

Hyperthreading Technology

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Kaisen Lin and Michael Conley

Multi-core Programming Evolution

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Computer Architecture

Computer Systems Architecture

Advanced Processor Architecture

Parallel Computing: Parallel Architectures Jin, Hai

Computer Architecture

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

SGI Challenge Overview

Superscalar Processors

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

MICROPROCESSOR TECHNOLOGY

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Memory Systems IRAM. Principle of IRAM

Parallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200

Chapter 1 Computer System Overview

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Microprocessor Trends and Implications for the Future

Advanced Parallel Programming I

CSC501 Operating Systems Principles. OS Structure

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multicore Hardware and Parallelism

Parallel Processing SIMD, Vector and GPU s cont.

Tutorial 11. Final Exam Review

ECE 588/688 Advanced Computer Architecture II

Keywords and Review Questions

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science 13 September 2018

Application Note 228

Memory Hierarchy. Slides contents from:

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

The ARM Cortex-A9 Processors

Lecture 1: Introduction

Parallel Architecture. Hwansoo Han

Systems Programming and Computer Architecture ( ) Timothy Roscoe

PowerPC 740 and 750

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

SH-X3 Flexible SuperH Multi-core for High-performance and Low-power Embedded Systems

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Adapted from David Patterson s slides on graduate computer architecture

CSE 392/CS 378: High-performance Computing - Principles and Practice

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

CS425 Computer Systems Architecture

Transcription:

William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers

Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved organization Increased clock frequency Increase in Parallelism Pipelining Superscalar Simultaneous multithreading (SMT) Diminishing returns More complexity requires more logic Increasing chip area for coordinating and signal transfer logic Harder to design, make and debug

Alternative Chip Organizations

Intel Hardware Trends

Increased Complexity Power requirements grow exponentially with chip density and clock frequency Can use more chip area for cache Smaller Order of magnitude lower power requirements By 2015 100 billion transistors on 300mm 2 die Cache of 100MB 1 billion transistors for logic Pollack s rule: Performance is roughly proportional to square root of increase in complexity Double complexity gives 40% more performance Multicore has potential for near-linear improvement Unlikely that one core can use all cache effectively

Power and Memory Considerations

Chip Utilization of Transistors

Software Performance Issues Performance benefits dependent on effective exploitation of parallel resources Even small amounts of serial code impact performance 10% inherently serial on 8 processor system gives only 4.7 times performance Communication, distribution of work and cache coherence overheads Some applications effectively exploit multicore processors

Effective Applications for Multicore Processors Database Servers handling independent transactions Multi-threaded native applications Lotus Domino, Siebel CRM Multi-process applications Oracle, SAP, PeopleSoft Java applications Java VM is multi-thread with scheduling and memory management Sun s Java Application Server, BEA s Weblogic, IBM Websphere, Tomcat Multi-instance applications One application running multiple times E.g. Value Game Software

Multicore Organization Number of core processors on chip Number of levels of cache on chip Amount of shared cache Next slide examples of each organization: (a) ARM11 MPCore (b) AMD Opteron (c) Intel Core Duo (d) Intel Core i7

Multicore Organization Alternatives

Advantages of shared L2 Cache Constructive interference reduces overall miss rate Data shared by multiple cores not replicated at cache level With proper frame replacement algorithms mean amount of shared cache dedicated to each core is dynamic Threads with less locality can have more cache Easy inter-process communication through shared memory Cache coherency confined to L1 Dedicated L2 cache gives each core more rapid access Good for threads with strong locality Shared L3 cache may also improve performance

Individual Core Architecture Intel Core Duo uses superscalar cores Intel Core i7 uses simultaneous multithreading (SMT) Scales up number of threads supported 4 SMT cores, each supporting 4 threads appears as 16 core

Intel x86 Multicore Organization - Core Duo (1) 2006 Two x86 superscalar, shared L2 cache Dedicated L1 cache per core 32KB instruction and 32KB data Thermal control unit per core Manages chip heat dissipation Maximize performance within constraints Improved ergonomics Advanced Programmable Interrupt Controlled (APIC) Inter-process interrupts between cores Routes interrupts to appropriate core Includes timer so OS can interrupt core

Intel x86 Multicore Organization - Core Duo (2) Power Management Logic Monitors thermal conditions and CPU activity Adjusts voltage and power consumption Can switch individual logic subsystems 2MB shared L2 cache Dynamic allocation MESI support for L1 caches Extended to support multiple Core Duo in SMP L2 data shared between local cores or external Bus interface

Intel x86 Multicore Organization - Core i7 November 2008 Four x86 SMT processors Dedicated L2, shared L3 cache Speculative pre-fetch for caches On chip DDR3 memory controller Three 8 byte channels (192 bits) giving 32GB/s No front side bus QuickPath Interconnection Cache coherent point-to-point link High speed communications between processor chips 6.4G transfers per second, 16 bits per transfer Dedicated bi-directional pairs Total bandwidth 25.6GB/s

ARM11 MPCore Up to 4 processors each with own L1 instruction and data cache Distributed interrupt controller Timer per CPU Watchdog Warning alerts for software failures Counts down from predetermined values Issues warning at zero CPU interface Interrupt acknowledgement, masking and completion acknowledgement CPU Single ARM11 called MP11 Vector floating-point unit FP co-processor L1 cache Snoop control unit L1 cache coherency

ARM11 MPCore Block Diagram

ARM11 MPCore Interrupt Handling Distributed Interrupt Controller (DIC) collates from many sources Masking Prioritization Distribution to target MP11 CPUs Status tracking Software interrupt generation Number of interrupts independent of MP11 CPU design Memory mapped Accessed by CPUs via private interface through SCU Can route interrupts to single or multiple CPUs Provides inter-process communication Thread on one CPU can cause activity by thread on another CPU

DIC Routing Direct to specific CPU To defined group of CPUs To all CPUs OS can generate interrupt to: All but self Self Other specific CPU Typically combined with shared memory for inter-process communication 16 interrupt ids available for inter-process communication

Interrupt States Inactive Non-asserted Completed by that CPU but pending or active in others Pending Asserted Processing not started on that CPU Active Started on that CPU but not complete Can be pre-empted by higher priority interrupt

Interrupt Sources Inter-process Interrupts (IPI) Private to CPU ID0-ID15 Software triggered Priority depends on target CPU not source Private timer and/or watchdog interrupt ID29 and ID30 Legacy FIQ line Legacy FIQ pin, per CPU, bypasses interrupt distributor Directly drives interrupts to CPU Hardware Triggered by programmable events on associated interrupt lines Up to 224 lines Start at ID32

ARM11 MPCore Interrupt Distributor

Cache Coherency Snoop Control Unit (SCU) resolves most shared data bottleneck issues L1 cache coherency based on MESI Direct data Intervention Copying clean entries between L1 caches without accessing external memory Reduces read after write from L1 to L2 Can resolve local L1 miss from rmote L1 rather than L2 Duplicated tag RAMs Cache tags implemented as separate block of RAM Same length as number of lines in cache Duplicates used by SCU to check data availability before sending coherency commands Only send to CPUs that must update coherent data cache Migratory lines Allows moving dirty data between CPUs without writing to L2 and reading back from external memory

Recommended Reading Stallings chapter 18 ARM web site

Intel Core i& Block Diagram

Intel Core Duo Block Diagram

Performance Effect of Multiple Cores

Recommended Reading Multicore Association web site ARM web site