MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

Similar documents
Eastern Mediterranean University School of Computing and Technology CACHE MEMORY. Computer memory is organized into a hierarchy.

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from:

Cache memory. Lecture 4. Principles, structure, mapping

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Lecture 2: Memory Systems

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (4 th Week)

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Multithreaded Processors. Department of Electrical Engineering Stanford University

CS 475: Parallel Programming Introduction

Chapter 2: Memory Hierarchy Design Part 2

MICROPROCESSOR MEMORY ORGANIZATION

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

12 Cache-Organization 1

CPE300: Digital System Architecture and Design

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Chapter 8. Virtual Memory

CPE300: Digital System Architecture and Design

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Computer-System Organization (cont.)

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Final Lecture. A few minutes to wrap up and add some perspective

Page 1. Multilevel Memories (Improving performance using a little cash )

WEEK 7. Chapter 4. Cache Memory Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Operating system Dr. Shroouq J.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 2: Memory Hierarchy Design Part 2

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Systems Programming and Computer Architecture ( ) Timothy Roscoe

LECTURE 11. Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSE502: Computer Architecture CSE 502: Computer Architecture

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Lecture notes for CS Chapter 2, part 1 10/23/18

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

The University of Adelaide, School of Computer Science 13 September 2018

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 1: Introduction. Operating System Concepts 9 th Edit9on

Parallel Computing: Parallel Architectures Jin, Hai

Outline Marquette University

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Chapter 1 Computer System Overview

Lecture 1 Introduction (Chapter 1 of Textbook)

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Operating Systems. Introduction & Overview. Outline for today s lecture. Administrivia. ITS 225: Operating Systems. Lecture 1

Addressing the Memory Wall

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

An Introduction to Parallel Programming

Contents. Memory System Overview Cache Memory. Internal Memory. Virtual Memory. Memory Hierarchy. Registers In CPU Internal or Main memory

Scheduling the Intel Core i7

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Computer Architecture and OS. EECS678 Lecture 2

CSC501 Operating Systems Principles. OS Structure

Performance and Optimization Issues in Multicore Computing

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Computer Systems Architecture

Computer Science 146. Computer Architecture

Introduction. Memory Hierarchy

vsphere Resource Management Update 1 11 JAN 2019 VMware vsphere 6.7 VMware ESXi 6.7 vcenter Server 6.7

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Q.1 Explain Computer s Basic Elements

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 1: Introduction Dr. Ali Fanian. Operating System Concepts 9 th Edit9on

Lecture 9: MIMD Architectures

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Real-Time Mixed-Criticality Wormhole Networks

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

Stack Machines. Towards Scalable Stack Based Parallelism. 1 of 53. Tutorial Organizer: Dr Chris Crispin-Bailey

Virtual Memory. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

a process may be swapped in and out of main memory such that it occupies different regions

Asymmetry-aware execution placement on manycore chips

Multiprocessor Systems. Chapter 8, 8.1

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

Uniprocessor Computer Architecture Example: Cray T3E

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Chapter 1: Introduction. Operating System Concepts 8th Edition,

CS370 Operating Systems

Optimizing Replication, Communication, and Capacity Allocation in CMPs

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

Lecture 9: MIMD Architecture

William Stallings Computer Organization and Architecture 8th Edition. Chapter 5 Internal Memory

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Transcription:

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR

WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing two or more execution cores within a single package with private and or shared cache. Multi-core chips do more work per clock cycle, are able to run at a lower frequency, and may enhance user experience in several ways such as improving performance of compute- and bandwidth-intensive activities.

The Move to Multi-Core Architecture The doubling of transistors every couple of years has been maintained for several years. But scaling out cannot not continue indefinitely because of several laws of physics. Example: A 1993-era Intel Pentium Processor had around 3 million transistors, while today s Intel Itanium 2 processor has nearly 1 billion transistors. If this rate continued, Intel processors would soon be producing more heat per square centimeter than the surface of the sun-which is why the problem of heat is already setting hard limits to frequency increases. Despite these challenges, users continue to demand increased performance.

TYPES OF MEMORY Cache Memory: Cache memory is a very high speed semiconductor memory which can speed up CPU. It acts as a buffer between the CPU and main memory. It is used to hold those parts of data and program which are most frequently used by CPU. Primary Memory/Main Memory: Primary memory holds only those data and instructions on which computer is currently working. It has limited capacity and data is lost when power is switched off. It is generally made up of semiconductor device. These memories are not as fast as cache. The data and instruction required to be processed reside in main memory Secondary Memory: This type of memory is also known as external memory or non-volatile. It is the slowest kind of memory. These are used for storing data/information permanently.

TYPES OF CACHE LEVELS Level 1 (L1) cache is extremely fast but relatively small, and is usually embedded in the processor chip (CPU). Level 2 (L2) cache is larger than L1; it may be located on the CPU or on a separate chip or coprocessor with a high-speed alternative system bus interconnecting the cache to the CPU, so as not to be slowed by traffic on the main system bus. Level 3 (L3) cache is typically specialized memory that works to improve the performance of L1 and L2. It can be significantly slower than L1 or L2, but is usually double the speed of RAM. In the case of multicore processors, each core may have its own dedicated L1 and or L2 cache, but share a common L3 cache (L2 is shared if L3 is not present).

ADVANTAGES OF SHARED CACHE IN MULTICORE SYSTEMS Efficient usage of the last-level cache If one core idles, the other core takes all the shared cache Reduces resource underutilization Flexibility for programmers Allows more data-sharing opportunities for threads running on separate cores that are sharing cache One core can pre-/post-process data for the other core Alternative communication mechanisms between cores Reduce cache-coherency complexity Reduced false sharing because of the shared cache Less workload to maintain coherency, compared to the private cache architecture Reduce data-storage redundancy The same data only needs to be stored once Reduce front-side bus traffic Effective data sharing between cores allows data requests to be resolved at the shared-cache level instead of going all the way to the system memory

PROBLEM IN SHARED CACHE Example: In a simple dual-core processor configuration, each core has its own CPU and L1 cache. Both cores share an L2 cache. In this configuration, applications executing on Core 0 compete for the entire L2 cache with applications executing on Core 1. If application A on Core 0 uses data that maps to the same cache line(s) as application B on Core 1, then a collision occurs. Solution: To overcome this problem, the shared cache can be partitioned among all the cores. This can be done in a static or dynamic way.

STATIC CACHE PARTITIONING: In a simple dual-core processor configuration, each core has its own CPU and L1 cache and both cores share an L2 cache. The OS partitions the L2 cache such that each core has its own segment of L2, meaning that data used by applications on Core 0 will only be cached in Core 0 s L2 partition. Similarly, data used by applications on Core 1 will only be cached in Core 1 s L2 partition. This partitioning eliminates the possibility of applications on different cores interfering with one another via L2 collisions. But depending upon the work on different cores some may over utilize it s cache partition and some may under utilize it. To efficiently use the cache, the partitioning can be done in a dynamic way.

DYNAMIC CACHE PARTITIONING: DISTANCE AWARE CACHE PARTITIONING Many dynamic cache partitioning results in high cache access latency and higher offchip accesses. Distance aware Round Robin mapping is an OS-managed policy which addresses the trade-off between cache access latency and number of off-chip accesses. This is done by mapping the pages accessed by a core to its closest (local) bank. It also introduces an upper bound on the deviation of the distribution of memory pages among cache banks, which lessens the number of off-chip accesses. This tradeoff is addressed without requiring any extra hardware structure

The OS starts assigning physical addresses to the requested pages according to the physical address chosen by the OS. The OS stores a counter for each cache bank which is increased whenever a new physical page is assigned to this bank. In this way, banks with more physical pages assigned to them will have higher value for the counter. To minimize the amount of off-chip accesses we define an upper bound on the deviation of the distribution of pages among cache banks controlled by a threshold value. If the counter of the bank where a page should map has reached the threshold value, the page is assigned to another bank.

The algorithm starts checking the counters of the banks at one hop from the initial placement. The bank with smaller value is chosen. If all banks at one hop have reached the threshold value, then the banks at a distance of two hops are checked. This algorithm iterates until a bank whose value is under the threshold is found. The policy ensures that at least one of the banks has always a value smaller than the threshold value by decreasing by one unit all counters when all of them have values different than zero. A small threshold value, reduces the number of off-chip accesses. A high threshold reduces the average latency of the accesses to the cache.

ACCESS AWARE CACHE PARTITIONING CACHEBALANCER The Cache Balancer consists of two parts: Access Rate based memory allocator Pain-driven task mapper.

CACHEBALANCER: Access Rate based Memory Allocation: Cache Balancer uses a metric called Access Rate to determine the relative utilization of cache banks. Access rate is defined as the ratio of the number of accesses made to a particular cache bank versus the total number of shared cache accesses in the system. A high access rate indicates that a cache bank is frequently accessed, i.e. a hot cache. Conversely, a low access rates indicates a relatively less accessed cache bank, or a cold cache.

Cache Balancer uses two access rate thresholds to determine if cache banks are hot or cold. These are determined as: thot = α MR tcold = MR _ β HC(pi,mj) Here thot and tcold represent the access rate thresholds for a cache bank MR is the mid-range value of the access rate across N cache banks in the system HC(pi,mj) is the hop-count along the interconnect path between Processing Element i and cache bank j.

The α and β parameters are integer constants that control the spread of Cache Balancer s memory allocations. α for instance influences the memory allocator s sensitivity to hot caches. β on the other hand influences the spread of data by limiting candidate cache banks based on their distance from the allocating PE. A high value forces the localization of data, while a low value enables the allocation of data across a larger number of cache banks. Due to their characteristics, these parameters could be used within a runtime manager to adapt memory allocations based on system utilization, power and reliability.

Access rates are measured using two 8 bit access counters per cache bank. In order to prevent temporal fluctuations in access rates from causing oscillations in memory allocation targets, access rate samples are filtered using a walking average filter. Thus, the counters are sampled every 500 cycles and averaged with a finite number of previous samples in order to determine the current access rate. The calculated value is subsequently stored in a dedicated section of the address space accessible through the memory hierarchy. In order to allocate memory, the memory allocator function is called from tasks executing on PEs. This function first computes the thot and tcold thresholds using the cache bank access rates, and invokes the selection algorithm listed below in Algorithm 1 to obtain a target cache bank. Memory is subsequently allocated in a section that corresponds to this target cache bank.

Pain-driven Task Mapping: Memory-intensive tasks are more likely to utilize cache banks heavily than their counterparts with smaller data sets. The performance of tasks can become interdependent when they share a cache, especially when their execution characteristics vary significantly. It is therefore essential for task characteristics to be taken into consideration during task mapping in order to uniformly utilize system resources and minimize inter-task interference.

Cache Balancer uses an adapted version of the Pain metric that guides the mapping of tasks onto PEs. The metric describes, for each executing task, the performance degradation that would be experienced if a certain new task were to be mapped onto a particular PE. Pain is measured in terms of three parameters: 1. Cache Intensity: This parameter indicates how aggressively a cache bank is used by tasks. 2. Cache Sensitivity: This parameter indicates the sensitivity of tasks to contention in shared cache banks. 3. Communication Pain: The communication cost for a task consists of the pain it experiences as a result of sharing the system interconnect with other tasks.

ANALYSIS Cache Balancer is evaluated using a cycle-accurate SystemC model of a 3D stacked, network-on-chip based tiled multiprocessor array. In order to emulate worst case conditions, Cache Balancer was tested with a synthetic workload that uses parallel tasks to repeatedly allocate and write to memory pages, thus stressing the shared caches.

RESULTS Distance Aware partitioning Impact of the changes in the indexing of the private L1 caches for the workloads evaluated

CACHE BALANCER

CONCLUSION Cache Balancer, a runtime resource management scheme for chip multiprocessors that uses the access rate and pain metrics to guide memory allocation and task mapping. The scheme achieved up to 60% lower contention by improving utilization of available cache banks, and reduced execution time by upto 22% as compared to competing proposals. The improved cache utilization further resulted in an energy density reduction of up to 50%.

REFERENCES 1. Alberto Ros, Marcelo Cintra, Manuel Acacio, Jose Garcia (2009) Distance-aware Round Robin Mapping for Large NUCA Caches. School of Informatics, University of Edinburgh. 2. Jurrien de Klerk, Sumeet Kumar, Rene Leuken (2014) CacheBalancer: Access Rate and Pain Based Resource Management for Chip Multiprocessors. Delft University of Technology, The Netherlands. 3. https://software.intel.com/en-us/articles/frequently-asked-questions-intel-multi-coreprocessor-architecture 4. http://www.tutorialspoint.com/computer_fundamentals/computer_memory.htm