Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Similar documents
Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Addressing the Memory Wall

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Staged Memory Scheduling

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

COL862 Programming Assignment-1

Spring 2009 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

DRAM Tutorial Lecture. Vivek Seshadri

ARM Multimedia IP: working together to drive down system power and bandwidth

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Vertex Shader Design I

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Deep Learning Accelerators

Mali GPU acceleration of HEVC and VP9 Decoder

Research Faculty Summit Systems Fueling future disruptions

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Portland State University ECE 588/688. Graphics Processors

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps

Overview. Technology Details. D/AVE NX Preliminary Product Brief

Module Introduction. Content 15 pages 2 questions. Learning Time 25 minutes

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Graphics Hardware. Instructor Stephen J. Guy

Practical Near-Data Processing for In-Memory Analytics Frameworks

Parallel Computing: Parallel Architectures Jin, Hai

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

PowerVR Hardware. Architecture Overview for Developers

CS 426 Parallel Computing. Parallel Computing Platforms

Fundamental CUDA Optimization. NVIDIA Corporation

Understanding Reduced-Voltage Operation in Modern DRAM Devices

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Understanding Sources of Inefficiency in General-Purpose Chips. Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE

Design-Induced Latency Variation in Modern DRAM Chips:

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

The Bifrost GPU architecture and the ARM Mali-G71 GPU

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video

Modern Processor Architectures. L25: Modern Compiler Design

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

ORBX 2 Technical Introduction. October 2013

Using Virtual Texturing to Handle Massive Texture Data

Windowing System on a 3D Pipeline. February 2005

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

Improving DRAM Performance by Parallelizing Refreshes with Accesses

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Design, Implementation and Evaluation of a Task-parallel JPEG Decoder for the Libjpeg-turbo Library

Characterization of Native Signal Processing Extensions

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Working with Metal Overview

GPUs and GPGPUs. Greg Blanton John T. Lubia

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

PVRTC & Texture Compression. User Guide

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.

COMP 3221: Microprocessors and Embedded Systems

Memory Systems IRAM. Principle of IRAM

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

Anatomy of AMD s TeraScale Graphics Engine

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Managing GPU Concurrency in Heterogeneous Architectures

Comparing Memory Systems for Chip Multiprocessors

GPU Memory Model. Adapted from:

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

HPC VT Machine-dependent Optimization

Computer Architecture Lecture 24: Memory Scheduling

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

OpenGL Status - November 2013 G-Truc Creation

Bifrost - The GPU architecture for next five billion

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Mobile HW and Bandwidth

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases"

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Texture Compression in Memory- and Performance- Constrained Embedded Systems. Jens Ogniewski Information Coding Group Linköpings University

PARCEL: Proxy Assisted BRowsing in Cellular networks for Energy and Latency reduction

CS427 Multicore Architecture and Parallel Computing

Transcription:

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu

Consumer Devices Consumer devices are everywhere! Energy consumption is a first-class concern in consumer devices 2

Popular Google Consumer Workloads Chrome Google s web browser TensorFlow Mobile Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 3

Energy Cost of Data Movement 1 st key observation: 62.7% of the total system energy is spent on data movement SoC Data Movement CPU CPU CPU CPU L1 L2 Compute Unit DRAM Processing-in-Memory (PIM) Potential solution: move computation close to data Challenge: limited area and energy budget 4

Using PIM to Reduce Data Movement 2 nd key observation: a significant fraction of data movement often comes from simple functions We can design lightweight logic to implement these simple functions in memory Small embedded low-power core PIM Core Small fixed-function accelerators PIM Accelerator PIM PIM Accelerator Accelerator Offloading to PIM logic reduces energy by 55.4% and improves performance by 54.2% on average 5

6 Goals 1 2 Understand the data movement related bottlenecks in modern consumer workloads Analyze opportunities to mitigate data movement by using processing-in-memory (PIM) 3 Design PIM logic that can maximize energy efficiency given the limited area and energy budget in consumer devices

7 Outline Introduction Background Analysis Methodology Workload Analysis Evaluation Conclusion

8 Potential Solution to Address Data Movement Processing-in-Memory (PIM) A potential solution to reduce data movement Idea: move computation close to data Reduces data movement Exploits large in-memory bandwidth Exploits shorter access latency to memory Enabled by recent advances in 3D-stacked memory

9 Outline Introduction Background Analysis Methodology Workload Analysis Evaluation Conclusion

10 Workload Analysis Methodology Workload Characterization Chromebook with an Intel Celeron SoC and 2GB of DRAM Extensively use performance counters within SoC Energy Model Sum of the energy consumption within the CPU, all caches, off-chip interconnects, and DRAM CPU CPU L1 L2 DRAM

11 PIM Logic Implementation SoC DRAM Logic Layer PIM Core Customized embedded general-purpose core No aggressive ILP techniques 256-bit SIMD unit PIM Accelerator PIM PIM Accelerator Accelerator N Small fixed-function accelerators Multiple copies of customized in-memory logic unit

Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video Codec 12

Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 13

14 How Chrome Renders a Web Page Loading and Parsing Painting HTML CSS HTML Parser CSS Parser Render Tree Layout Rasterization Composit- Compositing

15 How Chrome Renders a Web Page Loading and Parsing Layouting Painting HTML CSS HTML Parser CSS Parser Render Tree Layout assembles all layers into a final screen image Rasterization Compositing calculates the visual elements and position of each object paints those objects and generates the bitmaps

16 Browser Analysis To satisfy user experience, the browser must provide: Fast loading of webpages Smooth scrolling of webpages Quick switching between browser tabs We focus on two important user interactions: 1) Page Scrolling 2) Tab Switching Both include page loading

Scrolling 17

What Does Happen During Scrolling? HTML CSS rasterization uses color blitters to convert the basic primitives into bitmaps HTML Parser CSS Parser Render Tree Layout Color Blitting Texture Tiling Rasterization itng Compositing to minimize cache misses during compositing, the graphics driver reorganizes the bitmaps 18

19 Scrolling Energy Analysis Frac%on of Total Energy 100% 80% 60% 40% 20% 0% Google Docs Texture Tiling Color Bli6ng Other Gmail Google Calendar Word- Press Twitter Animation AVG 41.9% of page scrolling energy is spent on texture tiling and color blitting

Total Energy (pj) Scrolling a Google Docs Web Page 18 10 12 15 10 12 12 10 12 9 10 12 6 10 12 3 10 12 0 10 12 Texture Tiling Color Bli6ng Other CPU L1 LLC Interconnect Mem Ctrl A significant portion of total data movement comes from texture tiling and color blitting DRAM Frac%on of Total Energy 37.7% of total system energy 40% 30% 20% 10% 0% 77% of total energy consumption goes to data movement Data Movement Texture Tiling Compute Color Bli6ng 20

Total Energy (pj) Scrolling a Google Docs Web Page 18 10 12 15 10 12 12 10 12 9 10 12 6 10 12 3 10 12 0 10 12 Texture Tiling Color Bli6ng Other A significant portion of total data movement comes from texture tiling and color blitting 37.7% of total system energy 40% 30% 20% 10% 0% 77% of total energy consumption goes to data movement CPU L1 LLC Interconnect Mem DRAM Ctrl for texture tiling and color Data Movement blitting? Compute Can we use PIM to mitigate the data movement cost Frac%on of Total Energy Texture Tiling Color Bli6ng 21

Can We Use PIM for Texture Tiling? CPU-Only!me CPU Memory CPU + PIM CPU PIM Texture Tiling Rasterization Read Bitmap Linear Bitmap Rasterization idle Conversion high data movement Invoke Compositing Write Back Texture Invoke Tiles Compositing Major sources of data movement: 1) poor data locality 2) large rasterized bitmap PIM execution size (e.g. 4MB) Texture tiling is a good candidate for Linear Bitmap Conversion Texture Tiles 22

Can We Implement Texture Tiling in PIM Logic? Linear Bitmap Texture Tiling Tiled Texture Requires simple primitives: memcopy, bitwise operations, and simple arithmetic operations PIM Core PIM Accelerator 9.4% of the area available for PIM logic 7.1% of the area available for PIM logic PIM core and PIM accelerator are feasible to implement in-memory Texture Tiling 23

24 Color Blitting Analysis Generates a large amount of data movement Accounts for 19.1% of the total system energy during scrolling Color blitting is a good candidate for PIM execution Requires low-cost operations: Memset, simple arithmetic, and shift operations It is feasible to implement color blitting in PIM core and PIM accelerator

25 Scrolling Wrap Up Texture tiling and color blitting account for a significant portion (41.9%) of energy consumption 37.7% of total system energy goes to data movement generated by these functions 1 Both functions can benefit significantly from PIM execution 2 Both functions are feasible to implement as PIM logic

Tab Switching 26

27 What Happens During Tab Switching? Chrome employs a multi-process architecture Each tab is a separate process Chrome Process Tab 1 Process Tab 2 Process Tab N Process Main operations during tab switching: Context switch Load the new page

28 Memory Consumption Primary concerns during tab switching: How fast a new tab loads and becomes interactive Memory consumption Chrome uses compression to reduce each tab s memory footprint CPU Compressed Inactive Tab Tab Decompression Compression DRAM ZRAM

29 Data Movement Study To study data movement during tab switching, we emulate a user switching through 50 tabs We make two key observations: 1 2 Compression and decompression contribute to18.1% of the total system energy 19.6 GB of data moves between CPU and ZRAM

30 Can We Use PIM to Mitigate the Cost? compression CPU CPU-Only Memory Swap out N pages Read N Pages Compress Write back Other tasks Uncompressed Pages high data movement ZRAM!me CPU CPU + PIM PIM Swap out N pages Other tasks No off-chip data movement Uncompressed Pages Compress ZRAM PIM core and PIM accelerator are feasible to implement in-memory compression/decompression

31 Tab Switching Wrap Up A large amount of data movement happens during tab switching as Chrome attempts to compress and decompress tabs Both functions can benefit from PIM execution and can be implemented as PIM logic 2

Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 32

Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 33

34 TensorFlow Mobile Inference Prediction 57.3% of the inference energy is spent on data movement 54.4% of the data movement energy comes from packing/unpacking and quantization

36 Packing Matrix Packing Packed Matrix Reorders elements of matrices to minimize cache misses during matrix multiplication Up to 40% of the inference energy and 31% of inference execution time Packing s data movement accounts for up to 35.3% of the inference energy A simple data reorganization process that requires simple arithmetic

36 Quantization floating point Quantization integer Converts 32-bit floating point to 8-bit integers to improve inference execution time and energy consumption Up to 16.8% of the inference energy and 16.1% of inference execution time Majority of quantization energy comes from data movement A simple data conversion operation that requires shift, addition, and multiplication operations

37 Quantization floating point Quantization integer Converts 32-bit floating point to 8-bit integers to improve inference execution time and energy consumption Based on our analysis, we conclude that: Both functions are good candidates for PIM execution It is feasible to implement them in PIM logic Up to 16.8% of the inference energy and 16.1% of inference execution time Majority of quantization energy comes from data movement A simple data conversion operation that requires shift, addition, and multiplication operations

38 Video Playback and Capture Compressed video VP9 Decoder Display Captured video VP9 Encoder Compressed video Majority of energy is spent on data movement Majority of data movement comes from simple functions in decoding and encoding pipelines

39 Outline Introduction Background Analysis Methodology Workload Analysis Evaluation Conclusion

We study each target in isolation and emulate each separately and run them in our simulator 40 Evaluation Methodology System Configuration (gem5 Simulator) SoC: 4 OoO cores, 8-wide issue, 64 kb L1cache, 2MB L2 cache PIM Core: 1 core per vault, 1-wide issue, 4-wide SIMD, 32kB L1 cache 3D-Stacked Memory: 2GB cube, 16 vaults per cube Internal Bandwidth: 256GB/S Off-Chip Channel Bandwidth: 32 GB/s Baseline Memory: LPDDR3, 2GB, FR-FCFS scheduler

Normalized Energy Normalized Energy 1 0.8 0.6 0.4 0.2 0 Texture Tiling CPU-Only PIM-Core PIM-Acc Color Blitting Compression Decompression Packing Quantization Sub-Pixel Interpolation Deblocking Filter Motion Estimation 77.7% Chrome and 82.6% Browser of energy TensorFlow reduction for Video texture Playback tiling and Mobile Capture and packing comes from eliminating data movement PIM core and PIM accelerator reduces energy consumption on average by 49.1% and 55.4% 41

42 Normalized Runtime 1.0 CPU-Only PIM-Core PIM-Acc Normalized Run%me 0.8 0.6 0.4 0.2 0.0 Texture Tiling Color Bli6ng Compression Chrome Browser Decompression Sub-Pixel InterpolaRon Deblocking Filter Video Playback and Capture MoRon EsRmaRon TensorFlow TensorFlow Mobile Offloading these kernels to PIM core and PIM accelerator improves performance on average by 44.6% and 54.2%

43 Conclusion Energy consumption is a major challenge in consumer devices We conduct an in-depth analysis of popular Google consumer workloads 62.7% of the total system energy is spent on data movement Most of the data movement often come from simple functions that consists of simple operations We use PIM to reduce data movement cost We design lightweight logic to implement simple operations in DRAM PIM Core Reduces total energy by 55.4% on average PIM Accelerator PIM PIM Accelerator Accelerator Reduces execution time by 54.2% on average

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu

Backup

Video Playback Compressed video VP9 Decoder Display 63.5% of the system energy is spent on data movement 80.4% of the data movement energy comes from sub-pixel interpolation and deblocking filter Sub-pixel interpolation: interpolates the value of pixels at non-integer location Deblocking filter: a simple lowpass filter that attempts to remove discontinuity in pixels 39

Video Playback Compressed video VP9 Decoder Display 63.5% of the system energy is spent on data movement Both Functions can benefit from PIM execution They can be implemented in PIM logic Based on our analysis, we conclude that: 80.4% of the data movement energy comes from sub-pixel interpolation and deblocking filter Sub-pixel interpolation: interpolates the value of pixels at non-integer location Deblocking filter: a simple lowpass filter that attempts to remove discontinuity in pixels 40

Video Capture Captured video VP9 Encoder Compressed 59.1% of the system energy is spent on data movement Majority of the data movement energy comes from motion estimation which accounts for 21.3% of total system energy Motion estimation: compresses the frames using temporal redundancy between them 41

Video Capture Captured video VP9 Encoder Compressed 59.1% of the system energy is spent on Motion Estimation is a good candidate for PIM data movement execution and is feasible to be implemented at PIM logic Majority of the data movement energy comes from motion estimation which accounts for 21.3% of total system energy Motion estimation: compresses the frames using temporal redundancy between them 42

TensorFlow Mobile Inference Prediction 57.3% of the inference energy is spent on data movement 54.4% of the data movement energy comes from packing/unpacking and quantization Packing: A simple Quantization: A simple data data re-organization process conversion operation that that reorders elements of matrices converts 32-bit floating point to 8- bit integers 35

TensorFlow Mobile Inference Prediction Based on our analysis, we conclude that: 57.3% of the inference energy is spent on Both Functions are good candidate for PIM execution data movement It is feasible to implement them in PIM logic 54.4% of the data movement energy comes from packing/unpacking and quantization Packing: A simple Quantization: A simple data data re-organization process conversion operation that converts that reorders elements of matrices 32-bit floating point to 8-bit integers 32

28 Data Movement Study To study data movement during tab switching, we did an experiment: A user opens 50 tabs (most-accessed websites) Scrolls through each for a few second Switches to the next tab We make two key observations: Compression and decompression contribute to 18.1% of the total system energy 19.6 GB of data swapped in and out of ZRAM

37 Other Details and Results in the paper Detailed discussion of the other workloads Tensorflow Mobile Video Playback Video Capture Detailed discussion of VP9 hardware decoder and encoder System Integration Software interface Coherence

27 Color Blitting Analysis Color blitting generates a large amount of data movement Accounts for 19.1% of the total system energy used during page scrolling 63.9% of the energy consumed by color blitting is due to data movement Based on our analysis, we conclude that: Majority of data movement comes from: Streaming Color blitting access is pattern a good candidate for PIM execution The It is large feasible sizes to of implement the bitmaps color (e.g., blitting 1024x1024 in PIM pixels) logic Color blitting require low-cost computation operations Memset, simple arithmetic for alpha blending, shift operations

Data Movement During Tab-Switching Data Swapped Out to ZRAM (MB/s) Data Swapped In from ZRAM (MB/s) 250 200 150 100 7.8 GB of data is swapped in from ZRAM and 50 gets 0 decompressed, at a rate of up to 227 MB/s When 0 switching 20 40 60 between 80 100 50 120 tabs, 140 160 compression 180 200 220 and decompression Time contribute (seconds) to 18.1% the 250 total system energy 200 11.7 GB of data is compressed and swapped 150 out to ZRAM, at a rate of up to 201 MB/s 100 50 0 0 20 40 60 80 100 120 140 160 180 200 220 Time (seconds) 32

Methodology: Implementing PIM Targets DRAM SoC Vault PIM Core Cache OR... PIM- Accelerator 1 PIM- Accelerator N PIM Logic PIM Logic... PIM Logic Vault Mem Ctrl Vault Mem Ctrl... Vault Mem Ctrl General-purpose Fixed-function PIM PIM core Accelerator A customized A customized low-power 3.5 4.4 in-memory single mm issue logic 2 area core unit budget for the performs a single thread of the PIM logic target A 4-wide SIMD unit 15

Cost Analysis: Texture Tiling PIM core SIMD 4-wide Texture Tiling Texture Tiling (int* src, int* dst, ) { for (y = k*columnwidth, y<maxwidth, y++) { int swizzle = swizzle1; mem_copy (dst + (y*swizzle), src + x, k for (x=x1; x<x2; x+=tile) { mem_copy ((x+y)^swizzle, src + x, x3- swizzle ^= swizzle_bit; PIM } accelelerator Requires simple primitives: memcopy, bitwise 0.33 mm operations, 2 0.25 mm and simple arithmetic operations 2 (9.4% of the area available per vault ) Multiple In-memory tiling unit (7.1% of the area available per vault ) 57

30 Memory Consumption Potential Solution: kill inactive background tabs When the user accesses those tabs again, reload the tab pages from the disk Downsides: 1) high latency 2) loss of data Chrome uses compression to reduce each tab s memory footprint Uses compression to compress inactive pages and places them into a DRAM-based memory pool (ZRAM) When the user switches to a previously-inactive tab, Chrome loads the data from ZRAM and decompresses it

Methodology: Identifying PIM 59 targets We pick a function as PIM target candidate if: Consumes the most energy out of the all functions Significant data movement cost (MPKI > 10) Bounded by data movement, not computation We drop any candidate if: Incurs any performance loss when runs on PIM logic Requires more area than is available in the logic layer of 3Dstacked memory