Abhishek Pandey Aman Chadha Aditya Prakash

Similar documents
ECE Advanced Computer Architecture I

Energy-centric DVFS Controlling Method for Multi-core Platforms

Modeling CPU Energy Consumption for Energy Efficient Scheduling

Understanding The Effects of Wrong-path Memory References on Processor Performance

A Comparison of Capacity Management Schemes for Shared CMP Caches

Power Measurement Using Performance Counters

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Power Measurements using performance counters

COL862 Programming Assignment-1

Shared Cache Aware Task Mapping for WCRT Minimization

Energy Models for DVFS Processors

CS3350B Computer Architecture CPU Performance and Profiling

TerraSwarm. An Integrated Simulation Tool for Computer Architecture and Cyber-Physical Systems. Hokeun Kim 1,2, Armin Wasicek 3, and Edward A.

MCD: A Multiple Clock Domain Microarchitecture

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

MAGPIE TUTORIAL. Configuration and usage. Abdoulaye Gamatié, Pierre-Yves Péneau. LIRMM / CNRS-UM, Montpellier

XEEMU: An Improved XScale Power Simulator

Design of Experiments - Terminology

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

Microarchitecture Overview. Performance

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Managing Hardware Power Saving Modes for High Performance Computing

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Memory Consistency and Multiprocessor Performance

Chip-Multithreading Systems Need A New Operating Systems Scheduler

A Study on Performance Benefits of Core Morphing in an Asymmetric Multicore Processor

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Understanding Reduced-Voltage Operation in Modern DRAM Devices

Workload Prediction for Adaptive Power Scaling Using Deep Learning. Steve Tarsa, Amit Kumar, & HT Kung Harvard, Intel Labs MRL May 29, 2014 ICICDT 14

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Power and Energy Management. Advanced Operating Systems, Semester 2, 2011, UNSW Etienne Le Sueur

Power and Energy Management

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

Co-Optimization of Memory Access and Task Scheduling on MPSoC Architectures with Multi-Level Memory

CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Introduction to gem5. Nizamudheen Ahmed Texas Instruments

COL862 - Low Power Computing

A Framework for Modeling GPUs Power Consumption

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Network-Driven Processor (NDP): Energy-Aware Embedded Architectures & Execution Models. Princeton University

Comparing Memory Systems for Chip Multiprocessors

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems

XPU A Programmable FPGA Accelerator for Diverse Workloads

Accurate and Stable Empirical CPU Power Modelling for Multi- and Many-Core Systems

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Multithreaded Value Prediction

A Simple Model for Estimating Power Consumption of a Multicore Server System

Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

CoolCloud: improving energy efficiency in virtualized data centers

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

ibench: Quantifying Interference in Datacenter Applications

CSE 560M Computer Systems Architecture I

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Variability in Architectural Simulations of Multi-threaded

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

ECE 486/586. Computer Architecture. Lecture # 3

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters

Intel released new technology call P6P

dist-gem5 Nam Sung Kim, Mohammad Alian, Daehoon Kim UIUC Stephan Diestelhorst, Gabor Dozsa ARM

Predicting Performance Impact of DVFS for Realistic Memory Systems

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

Smart Port Allocation in Adaptive NoC Routers

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Microarchitecture Overview. Performance

Alternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model

Virtualized and Flexible ECC for Main Memory

Using Dynamic Voltage Frequency Scaling and CPU Pinning for Energy Efficiency in Cloud Compu1ng. Jakub Krzywda Umeå University

UNDERCLOCKED SOFTWARE PREFETCHING: MORE CORES, LESS ENERGY

Cache/Memory Optimization. - Krishna Parthaje

Nonblocking Memory Refresh. Kate Nguyen, Kehan Lyu, Xianze Meng, Vilas Sridharan, Xun Jian

Performance Characteristics. i960 CA SuperScalar Microprocessor

TDT4255 Computer Design. Lecture 1. Magnus Jahre

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Memory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL

Cache-Aware Utilization Control for Energy-Efficient Multi-Core Real-Time Systems

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

A Hybrid Adaptive Feedback Based Prefetcher

CS146: Computer Architecture Spring 2004 Homework #2 Due March 10, 2003 (Wednesday) Evening

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

Static and Dynamic Frequency Scaling on Multicore CPUs

Transcription:

Abhishek Pandey Aman Chadha Aditya Prakash

System: Building Blocks Motivation: Problem: Determining when to scale down the frequency at runtime is an intricate task. Proposed Solution: Use Machine learning algorithm to predict intervals where memory intensive tasks occur. Phase 1: gem5 + McPAT Phase 2: ILP + Offline DVFS Phase 3: Machine Learning

System: Overview gem5 + McPAT Integrated Model ILP + Offline DVFS MLA gem5 <XML> McPAT <Power, Delay> Modified Offline DVFS <Frequency> Machine Learning Algorithm Mode <L1/L2 Miss Stats, ALU/FPU Accesses, CPU Idle Cycles, L2 Miss Latency> corresponding to the selected frequency gem5 + McPAT Integrated Model gem5 <XML> McPAT <Power, Delay> Monitor <Frequency> MLA Machine Learning Algorithm <Parameters> Power Execution Time

Testbed System Specifications Parameter Specification ISA ALPHA Execution Mode Out of Order Execution CPU Base Frequency 2 GHz L1 I-Cache and D-Cache Size 32 kb L2 Cache Size 256 kb Number of cores 1 Other parameters at their default values

Phase 1: gem5 + McPAT integration Architecture Specification Energy per interval for all intervals, for each frequency gem5 BaseMcPAT class Event Scheduler Stats Parser McPAT Freq, N cycles per interval ILP

Phase 1: gem5 + McPAT SimObject feedintervalstats() parsegem5statstoxml( ) BaseMcPAT schedule( ) computeenergy( ) GEM5 displayenergy( ) McPAT hooks

Phase 1: gem5 + McPAT differentiating parameters chosen We chose the following parameters to profile memory characteristics of a workload Basically, we aimed at identifying the memory-boundedness of the workload in a particular interval of execution Parameter L2 Miss Stats Functional Unit Busy Rate # of FPU Accesses Miss Latency for L2 # of ALU Accesses CPU Idle Cycles

Phase 2: ILP + Offline DVFS Input Set Modified Offline DVFS Output Set Power value for all intervals and for all frequencies Delay value for all intervals for all frequencies Performance constraint = δ Execution time = 1/(Performance) P ij where i is the interval number and j is the frequency number Hence, P 11 means interval #1 and frequency number #1 ( = 800 MHz) P 12 means interval #1 and frequency number #2 ( = 1 GHz) Considering 2 intervals I1 and I2: I1: P 11, P 12, P 13, P 14 I2: P 21, P 22, P 23, P 24 I1: D 11, D 12, D 13, D 14 I2: D 21, D 22, D 23, D 24 Select P 11 for I1. x 11 = 1. x 12 = x 13 = x 14 = 0. Select P 23 for I2. x 23 = 1. x 21 = x 22 = x 24 = 0.... where x ij is a binary variable.

Phase 2: ILP + Offline DVFS We adopt the offline DVFS algorithm proposed by Kim et. al. in System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators (HPCA 08) The DVFS control problem is formulated as an Integer Linear Programming (ILP) optimization problem We seek to reduce the total power consumption of the processor within specific performance constraints (δ) The application runtime is divided into N intervals A total of L = 2 frequency levels are considered For each runtime interval i and frequency j, the power consumption, P ij, is calculated. The delay for each interval and V/F level, D ij is also calculated The equations below formulate the ILP, which is solved using Gurobi Optimizer min( Px ) N i= 1 j= 1 j= 1 N i= 1 j= 1 L ( Dx) < δ L L ij ij ij x = N i = 1 to N ij ij... where x ij is a binary variable having values 1 or 0.

Phase 3: Machine Learning TRAINING Stage #1 Stage #2 Input Parameters and Frequency Machine Learning Algorithm TESTING Stage #1 Input Parameters Stage #2 Machine Learning Algorithm Stage #3 Machine Learning Algorithm Frequency Interval #0 Interval #1 Interval #2 Interval #3 Interval #4 Interval #5 Interval #6 Interval #7 Z 10, Z 20, Z 30 Z 11, Z 21, Z 31.................. Z ij... where i is the parameter number (there are n such parameters) and j is the interval

Phase 3: Machine Learning learning and classification 1 FEATURE EXTRACTION 2 CLASSIFICATION 3 FREQUENCY OUTPUT f 1 = f 2 = f 3 = f 4 = Input Space Feature Space

Contributions from our end... ILP Formulation and Solving Python implementation for solving the ILP formulation using Gurobi Optimizer Gem5 modifications McPAT integration Dynamic Frequency Scaling in gem5 Integrating the SVM Model with gem5 Developed scripts Parsing the output of gem5 in a format acceptable to the Machine Learning Algorithm C implementation to fetch execution statistics corresponding to the down-scaled frequency obtained after performing Dynamic Frequency Scaling In progress is a single script to automate the entire process

Experimental Results power 1.2 1 Normalized Power 0.8 0.6 0.4 No constraint 10% Constraint 20% Constraint 0.2 0 ocean mcf gcc Benchmarks

Experimental Results delay 2.5 2 Normalized Delay 1.5 1 No constraint 10% Constraint 20% Constraint 0.5 0 ocean mcf gcc Benchmarks

Future Work Avenues for future optimization: Finding the optimum interval size and number of intervals for training Other MLAs can be evaluated and a comparative analysis of all algorithms can be presented Several steps of frequencies can be introduced System performance can be evaluated with the permutation-combination of training and testing with some more benchmarks Additional memory/cpu parameters can be explored which can help to better train the MLA Memory profile of each benchmark can be analyzed to perform selective training for the MLA

Thank You...... questions are welcome