Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Similar documents
The Challenges of System Design. Raising Performance and Reducing Power Consumption

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

Mainstream Computer System Components

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

Getting the Most out of Advanced ARM IP. ARM Technology Symposia November 2013

ARM big.little Technology Unleashed An Improved User Experience Delivered

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Building blocks for 64-bit Systems Development of System IP in ARM

Computer System Components

An introduction to SDRAM and memory controllers. 5kk73

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Maximizing heterogeneous system performance with ARM interconnect and CCIX

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Copyright 2016 Xilinx

Each Milliwatt Matters

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Effective System Design with ARM System IP

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Main Memory Supporting Caches

Addressing the Memory Wall

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

EEM 486: Computer Architecture. Lecture 9. Memory

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

Computer Architecture Memory hierarchies and caches

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

LECTURE 5: MEMORY HIERARCHY DESIGN

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Copyright 2012, Elsevier Inc. All rights reserved.

Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

KeyStone II. CorePac Overview

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Copyright 2012, Elsevier Inc. All rights reserved.

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Improving DRAM Performance by Parallelizing Refreshes with Accesses

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

Developed Hybrid Memory System for New SoC. -Why choose Wide I/O?

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture: Memory Technology Innovations

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Input/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Memory Hierarchy Basics

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Mainstream Computer System Components

Fast, Scalable and Energy Efficient IO Solutions: Accelerating infrastructure SoC time-to-market

Memories: Memory Technology

ECE 485/585 Microprocessor System Design

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Five Emerging DRAM Interfaces You Should Know for Your Next Design

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

SEMICON Solutions. Bus Structure. Created by: Duong Dang Date: 20 th Oct,2010

Multimedia in Mobile Phones. Architectures and Trends Lund

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

Computer Systems Laboratory Sungkyunkwan University

Next Generation Enterprise Solutions from ARM

Copyright 2012, Elsevier Inc. All rights reserved.

Buses. Maurizio Palesi. Maurizio Palesi 1

Adapted from David Patterson s slides on graduate computer architecture

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

TMS320C6678 Memory Access Performance

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

Pollard s Attempt to Explain Cache Memory

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Memory Hierarchy. Slides contents from:

Embedded Systems 1: On Chip Bus

On-chip Networks Enable the Dark Silicon Advantage. Drew Wingard CTO & Co-founder Sonics, Inc.

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Operating System Review Part

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Transcription:

Negotiating the Maze Getting the most out of memory systems today and tomorrow Robert Kaye 1

System on Chip Memory Systems Systems use external memory Large address space Low cost-per-bit Large interface bandwidth required Access to memory depends on accesses from other processing elements Challenges: Manage the flow of data to and from external memory Provide best bandwidth and latency characteristics to each processing element Choose the memory solution that best matches the system requirements 2

Performance, Power & Cost Memory choice driven by performance, power and cost Relative importance of these factor varies between systems Available memories offer tradeoffs There isn t a panacea offering a single solution that is best for all scenarios Performance High Cost Low Low Power 3

DRAM Choices Today Performance DDR2 today s benchmark DDR3 lower power than DDR2 higher performance price cross over forecast: early 2011? High LPDDR2 equivalent performance to DDR2 lower power/pin count than DDR2 price premium over DDR2/3 Cost Low (not to scale!) Low Power 4

Predicting utilisation and bandwidth MODELLING THE MEMORY INTERFACE 5

Memory Topology 101 DRAMs are divided into BANKS. Typically 4 or 8 in current devices (older devices may have 2, future 4) Each bank is subdivided into ROWs. Each row contains multiple COLUMNS. To access a particular location the DRAM is sent a BRC (Bank, Row, Column) address. Before accessing the data the bank & row must be prepared Memory Bank 0 Storage ROW C ROW B ROW C ROW A ROW B Buffer Col 7 Col 6 Col 5 Col 4 Col 3 Col 2 Col 1 Col 0 6

Efficiency Measurement A bank access cycle is characterised by a number of timing parameters, and the variable transfer amount tras Activate Access Precharge Activate trcd twr / trtp trp In order to ensure continuous streams of data, banks are accessed in an interleaved fashion trc Activate Precharge Precharge Activate Activate Precharge Activate Bank 0 Activate Activate Precharge Precharge Activate Activate Precharge Activate Bank 1 Activate Precharge Precharge Activate Activate Bank 2 Activate Precharge Precharge Activate Activate Bank 3 Data As access length increases, efficiency improves 7

Utilisation Prediction Theoretical model used to predict utilisation and bandwidth Memory type & speed grade Interface Width Interface clock frequency Packet access length Bank distribution Read/write distribution Max. Theoretical Utilisation Max. Theoretical Bandwidth 8

Bank Access & Hit Length Effects On Utilisation Bytes per activate improves utilisation Bank utilisation improves utilisation 9

Other Factors Affecting Utilisation Choice of memory, including interface widths Different memories with different timing parameters Wider interface widths & faster memories require more bytes per activate to achieve maximum efficiency Managing (minimising) the DQ turnaround overhead Predictive model assumes that memory controller can always order accesses for maximum efficiency Almost always not possible: hazards, barriers, system prioritisation on requests will interfere A measure of memory controller quality Percentage of predicted utilisation it can achieve across all system traffic profiles 10

Real world impact on memory interface efficiency SYSTEM REQUIREMENTS 11

Quality of Service (QoS) Contracts Minimum Bandwidth Minimize Latency Maximum Latency or Minimum Bandwidth Minimum Bandwidth Maximum Latency Maximum Latency Minimize Latency Contract: Process requests to memory with minimum latency possible. Reduced latency = increased performance AKA Best Effort Minimum Bandwidth Minimum Bandwidth Minimum Bandwidth Contract: Provide master with minimum bandwidth. Low priority unless bandwidth is not being achieved. Provide additional bandwidth IF system loading permits. Maximum Latency Contract: Provide master with specified bandwidth within stated latency. Low priority until maximum latency approaches. 12

QoS Objectives Allocate system capacity (latency and bandwidth) to each master to meet the contract Dynamically vary the priority to react to changes in bus traffic If there is excess capacity Allocate excess to where it can offer the most improvement Usually reducing the CPU latency Allocate excess to masters that can reduce performance later If there is insufficient capacity Remove capacity from masters that have the least impact on system performance Task of the memory controller Meet these objectives whilst minimising the impact on memory efficiency, performance and power 13

Population System Latency Latency is added throughout the system in two forms: Static latency the delay through pipeline stages Constant and specific to the path from master to slave Queuing latency the delay at arbitration points in the system Transaction delay depends on: Number of transactions ahead of it in the queue Rate at which they are processed Queue length dictated by Capacity of the slave (memory type and efficiency) Desired throughput Memory interface efficiency is a function of: Queue length Burst length Read-write mix Address distribution 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 Static Latency System Latency Queuing Latency Latency/clocks 14

Memory Controller Queue & Topology A queue is implemented in the memory controller Allows re-ordering of transactions to maximize efficiency If the queue fills it extends through the interconnect Interconnect arbitration only operates when the queue extends through the interconnect For effective QoS, the arbitration policy should be consistent throughout the system System topology influences the performance CPU and LCD Ctrl placed close to the memory controller Lower latency Graphics and DMA on a separatelevel Hierarchy improves performance Video Graphics DMA Ctrl LCD Ctrl CPU Interconnect Memory Ctrl Peripheral Peripheral 15

Memory Controller and QoS Memory controller utilizes QoS information passed by system Stratagem: Requests from minimum latency (best effort) contracts enter the queue with highest priority Higher priority = less time in queue = lowest latency Requests from maximum latency (typically stream processing or real time) contracts enter queue with mid range priority Requests from minimum bandwidth contracts enter queue with lowest priority Entry Priority Minimum Latency / Best-effort Maximum Latency / Real Time / Stream Processing Minimum Bandwidth / Batch processing 16

Survival Real Time Masters For real time masters adding latency does not affect performance While latency is less than the maximum Real time requests carry maximum latency (time out) information into the controller queue If the transaction is still waiting after a time-out period Promoted to highest priority Only higher priority than best-effort masters when necessary 100% 80% 60% 40% 20% 0% Time-out 0 20 40 60 80 100 120 140 Latency/clocks Adjusted Priority Time Out Minimum Latency / Best-effort Maximum Latency / Real Time / Stream Processing Minimum Bandwidth / Batch processing Maximum latency 17

Bandwidth/MB/s Batch Processing Masters Average latency, bandwidth and queue length related by Little s Law E(L)=λ.E(S) Hold queue length constant Measure average latency and control priority Priority controls latency which controls bandwidth Excess bandwidth is used Priority only exceeds other masters Insufficient bandwidth obtained Minimizes transactions prioritized over Best-effort masters 2000 1500 1000 500 0 0 100 200 300 400 500 600 700 Latency/clocks Adjusted Priority Time Out Minimum Latency / Best-effort Maximum Latency / Real Time / Stream Processing Minimum Bandwidth / Batch processing 18

Memory Controller Design Considerations Many design considerations in order to achieve best possible efficiency and meet system QoS requirements Examples Buffering and queuing System feedback Versatile re-ordering What can be re-ordered without impacting utilisation? When can utilisation considerations override QoS? Where, when and how to arbitrate between requests Read/write prioritisation Memory request scheduling Measurement against efficiency model Assess alternatives Feedback on quality 19

Memory Controller Efficiency Example Results compare observed efficiency with predicted theoretical maximum Planar plot shows ideal (red grid) vs. observed Temperature map shows achieved percentage of ideal performance Example considers >90% as pass (green) condition Efficiency plotted against theoretical Test conditions: 50/50 read write mixture Test system x32 LPDDR2-800 20

8 1 24 40 4 56 72 7 88 104 120 Memory Interface & AMBA 4 AMBA 4 brings changes the access pattern and controller requirements Coherency & snooping Reduces the number of off- chip accesses Increases the average hit length higher proportion of cache fills and evictions QoS and Virtual Networks Long Bursts Barriers Utilisation Must be managed efficiently by the controller, or Through blocking in the interconnect (less efficient) 100% 80% 60% 40% 20% 0% Sample Bank Distribution SampleHitLength (bytes) 21

Cortex-A15 in a CoreLink subsystem System coherency scaling to many cores I/O coherency reduces cache maintenance Low latency CPU to memory, and high bandwidth GPU to memory New highly efficient memory controller 1/2/4 channels, up to 1066MHz >90% interface utilization Roadmap to future memory interfaces System MMU supports I/O virtualization Efficient hardware access for multiple OS Quad Cortex-A15 GIC-400 Quad Cortex-A15 MMU-400 I/O device AMBA 4 Cache Coherent Interconnect CCI-400 DMC-400 Dynamic Memory Controller PHY DDR3/ LPDDR2 Mali Graphics PHY DDR3/ LPDDR2 MMU-400 NIC-400 Network Interconnect Slaves Video NIC-400 Network Interconnect MMU-400 Slaves LCD 22

DMC-400 Dynamic Memory Controller Fourth Generation ARM/AMBA Dynamic Memory Controller High efficiency interface from AMBA 3 and AMBA 4 systems to shared off chip LPDDR2 and DDR3 Delivers the high bandwidth and low latency needed by high performance multimedia systems implementing Cortex-A15 and Mali-T604 Effective management of power in memory sub-system Complete master to memory traffic management for AMBA 4 systems Designed and verified with AXI Network Interconnect and Banks CCI System wide QoS QoS Virtual Networks Barrier Management Compatible with AXI3 interconnects DMC-400 architecture scales to Future DRAM standards Quad Cortex-A15 GIC-400 Quad Cortex-A15 MMU-400 I/O device AMBA 4 Cache Coherent Interconnect CCI-400 DMC-400 Dynamic Memory Controller PHY DDR3/ LPDDR2 Mali Graphics PHY DDR3/ LPDDR2 MMU-400 NIC-400 Network Interconnect Slaves Video NIC-400 Network Interconnect MMU-400 Slaves LCD 23

New memory architectures are proposed to provide 4X available bandwidth without expending additional energy FUTURE MOBILE MEMORIES 24

Future Mobile Memory Candidates Wide-IO: low speed (200MHz SDR), wide (4 x 128b) interface to stacked memory die (1Gb to 32Gb) through direct chip-to-chip connects Serial (aka Low Power Serial I0): high speed (5-8GHz), full duplex, multi-channel links to LPDDR2 core Standardisation stalled conflicting PHY specs; no consensus; no strong pull, replaced by Hybrid (aka Low Power Dual Mode Memory): standard LPDDR2 at low bandwidth, high speed serial at high bandwidth over LPDDR2 bus LPDDR3: proposed evolutionary development of LPDDR2 standard parallel DDR; spec development in progress 25

Wide-IO Achieves higher bandwidth through wider data bus: x512 I/O arranged as 4x128b channels Lower power through lower frequency (200MHz), SDR Stacking challenges: manufacturing, thermal Controller and system design for multi-channel & large I/O count Spec finalisation target: 2011 Production forecast 2012-2013 Initial implementations: 1-die stack Production 4-die stacks 26

Power Low Power Dual Mode Memory (LPDMM) Hybrid parallel/serial interface using LPDDR2 pin-out Parallel mode LPDDR2 start-up latency and low power consumption in low bandwidth environments (single channel up to 1.6 GB/s) Serial mode high bandwidth (up to 6.4 GB/s) over same pins LPDMM challenges: Hybrid PHY design / power LPDDR2 Managing parallel/serial switchover Spec finalisation target: 2011 SPMT SPMT (Serial Port Memory Technology) developing solution Bandwidth Parallel Mode Serial Mode Single Channel LPDDR2 Single Channel SPMT 27

Comparing Wide IO & LPDMM with LPDDR2 Performance LPDDR2 Wide-IO LPDMM High Cost Low (not to scale!) Low Power 28

29 CONCLUSIONS

Conclusions Understanding memory interface efficiency is complex Many variables and constraints Getting the right solution is critical to system power, performance and cost System requirements can adversely effect efficiency Memory controller can mitigate these when implemented as part of a system wide solution Many choices of memory Future memory proposals will bring new design challenges ARM is committed to delivering memory controller solutions to enable optimal memory subsystems today and tomorrow 30

Thank You Please visit www.arm.com for ARM related technical details For any queries contact < Salesinfo-IN@arm.com > 31