Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand
|
|
- Jessie Wade
- 5 years ago
- Views:
Transcription
1 Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu
2 Consumer Devices Consumer devices are everywhere! Energy consumption is a first-class concern in consumer devices 2
3 Popular Google Consumer Workloads Chrome Google s web browser TensorFlow Mobile Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 3
4 Energy Cost of Data Movement 1 st key observation: 62.7% of the total system energy is spent on data movement SoC Data Movement CPU CPU CPU CPU L1 L2 Compute Unit DRAM Processing-in-Memory (PIM) Potential solution: move computation close to data Challenge: limited area and energy budget 4
5 Using PIM to Reduce Data Movement 2 nd key observation: a significant fraction of data movement often comes from simple functions We can design lightweight logic to implement these simple functions in memory Small embedded low-power core PIM Core Small fixed-function accelerators PIM Accelerator PIM PIM Accelerator Accelerator Offloading to PIM logic reduces energy by 55.4% and improves performance by 54.2% on average 5
6 6 Goals 1 2 Understand the data movement related bottlenecks in modern consumer workloads Analyze opportunities to mitigate data movement by using processing-in-memory (PIM) 3 Design PIM logic that can maximize energy efficiency given the limited area and energy budget in consumer devices
7 7 Outline Introduction Background Analysis Methodology Workload Analysis Evaluation Conclusion
8 8 Potential Solution to Address Data Movement Processing-in-Memory (PIM) A potential solution to reduce data movement Idea: move computation close to data Reduces data movement Exploits large in-memory bandwidth Exploits shorter access latency to memory Enabled by recent advances in 3D-stacked memory
9 9 Outline Introduction Background Analysis Methodology Workload Analysis Evaluation Conclusion
10 10 Workload Analysis Methodology Workload Characterization Chromebook with an Intel Celeron SoC and 2GB of DRAM Extensively use performance counters within SoC Energy Model Sum of the energy consumption within the CPU, all caches, off-chip interconnects, and DRAM CPU CPU L1 L2 DRAM
11 11 PIM Logic Implementation SoC DRAM Logic Layer PIM Core Customized embedded general-purpose core No aggressive ILP techniques 256-bit SIMD unit PIM Accelerator PIM PIM Accelerator Accelerator N Small fixed-function accelerators Multiple copies of customized in-memory logic unit
12 Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video Codec 12
13 Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 13
14 14 How Chrome Renders a Web Page Loading and Parsing Painting HTML CSS HTML Parser CSS Parser Render Tree Layout Rasterization Composit- Compositing
15 15 How Chrome Renders a Web Page Loading and Parsing Layouting Painting HTML CSS HTML Parser CSS Parser Render Tree Layout assembles all layers into a final screen image Rasterization Compositing calculates the visual elements and position of each object paints those objects and generates the bitmaps
16 16 Browser Analysis To satisfy user experience, the browser must provide: Fast loading of webpages Smooth scrolling of webpages Quick switching between browser tabs We focus on two important user interactions: 1) Page Scrolling 2) Tab Switching Both include page loading
17 Scrolling 17
18 What Does Happen During Scrolling? HTML CSS rasterization uses color blitters to convert the basic primitives into bitmaps HTML Parser CSS Parser Render Tree Layout Color Blitting Texture Tiling Rasterization itng Compositing to minimize cache misses during compositing, the graphics driver reorganizes the bitmaps 18
19 19 Scrolling Energy Analysis Frac%on of Total Energy 100% 80% 60% 40% 20% 0% Google Docs Texture Tiling Color Bli6ng Other Gmail Google Calendar Word- Press Twitter Animation AVG 41.9% of page scrolling energy is spent on texture tiling and color blitting
20 Total Energy (pj) Scrolling a Google Docs Web Page Texture Tiling Color Bli6ng Other CPU L1 LLC Interconnect Mem Ctrl A significant portion of total data movement comes from texture tiling and color blitting DRAM Frac%on of Total Energy 37.7% of total system energy 40% 30% 20% 10% 0% 77% of total energy consumption goes to data movement Data Movement Texture Tiling Compute Color Bli6ng 20
21 Total Energy (pj) Scrolling a Google Docs Web Page Texture Tiling Color Bli6ng Other A significant portion of total data movement comes from texture tiling and color blitting 37.7% of total system energy 40% 30% 20% 10% 0% 77% of total energy consumption goes to data movement CPU L1 LLC Interconnect Mem DRAM Ctrl for texture tiling and color Data Movement blitting? Compute Can we use PIM to mitigate the data movement cost Frac%on of Total Energy Texture Tiling Color Bli6ng 21
22 Can We Use PIM for Texture Tiling? CPU-Only!me CPU Memory CPU + PIM CPU PIM Texture Tiling Rasterization Read Bitmap Linear Bitmap Rasterization idle Conversion high data movement Invoke Compositing Write Back Texture Invoke Tiles Compositing Major sources of data movement: 1) poor data locality 2) large rasterized bitmap PIM execution size (e.g. 4MB) Texture tiling is a good candidate for Linear Bitmap Conversion Texture Tiles 22
23 Can We Implement Texture Tiling in PIM Logic? Linear Bitmap Texture Tiling Tiled Texture Requires simple primitives: memcopy, bitwise operations, and simple arithmetic operations PIM Core PIM Accelerator 9.4% of the area available for PIM logic 7.1% of the area available for PIM logic PIM core and PIM accelerator are feasible to implement in-memory Texture Tiling 23
24 24 Color Blitting Analysis Generates a large amount of data movement Accounts for 19.1% of the total system energy during scrolling Color blitting is a good candidate for PIM execution Requires low-cost operations: Memset, simple arithmetic, and shift operations It is feasible to implement color blitting in PIM core and PIM accelerator
25 25 Scrolling Wrap Up Texture tiling and color blitting account for a significant portion (41.9%) of energy consumption 37.7% of total system energy goes to data movement generated by these functions 1 Both functions can benefit significantly from PIM execution 2 Both functions are feasible to implement as PIM logic
26 Tab Switching 26
27 27 What Happens During Tab Switching? Chrome employs a multi-process architecture Each tab is a separate process Chrome Process Tab 1 Process Tab 2 Process Tab N Process Main operations during tab switching: Context switch Load the new page
28 28 Memory Consumption Primary concerns during tab switching: How fast a new tab loads and becomes interactive Memory consumption Chrome uses compression to reduce each tab s memory footprint CPU Compressed Inactive Tab Tab Decompression Compression DRAM ZRAM
29 29 Data Movement Study To study data movement during tab switching, we emulate a user switching through 50 tabs We make two key observations: 1 2 Compression and decompression contribute to18.1% of the total system energy 19.6 GB of data moves between CPU and ZRAM
30 30 Can We Use PIM to Mitigate the Cost? compression CPU CPU-Only Memory Swap out N pages Read N Pages Compress Write back Other tasks Uncompressed Pages high data movement ZRAM!me CPU CPU + PIM PIM Swap out N pages Other tasks No off-chip data movement Uncompressed Pages Compress ZRAM PIM core and PIM accelerator are feasible to implement in-memory compression/decompression
31 31 Tab Switching Wrap Up A large amount of data movement happens during tab switching as Chrome attempts to compress and decompress tabs Both functions can benefit from PIM execution and can be implemented as PIM logic 2
32 Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 32
33 Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 33
34 34 TensorFlow Mobile Inference Prediction 57.3% of the inference energy is spent on data movement 54.4% of the data movement energy comes from packing/unpacking and quantization
35 36 Packing Matrix Packing Packed Matrix Reorders elements of matrices to minimize cache misses during matrix multiplication Up to 40% of the inference energy and 31% of inference execution time Packing s data movement accounts for up to 35.3% of the inference energy A simple data reorganization process that requires simple arithmetic
36 36 Quantization floating point Quantization integer Converts 32-bit floating point to 8-bit integers to improve inference execution time and energy consumption Up to 16.8% of the inference energy and 16.1% of inference execution time Majority of quantization energy comes from data movement A simple data conversion operation that requires shift, addition, and multiplication operations
37 37 Quantization floating point Quantization integer Converts 32-bit floating point to 8-bit integers to improve inference execution time and energy consumption Based on our analysis, we conclude that: Both functions are good candidates for PIM execution It is feasible to implement them in PIM logic Up to 16.8% of the inference energy and 16.1% of inference execution time Majority of quantization energy comes from data movement A simple data conversion operation that requires shift, addition, and multiplication operations
38 38 Video Playback and Capture Compressed video VP9 Decoder Display Captured video VP9 Encoder Compressed video Majority of energy is spent on data movement Majority of data movement comes from simple functions in decoding and encoding pipelines
39 39 Outline Introduction Background Analysis Methodology Workload Analysis Evaluation Conclusion
40 We study each target in isolation and emulate each separately and run them in our simulator 40 Evaluation Methodology System Configuration (gem5 Simulator) SoC: 4 OoO cores, 8-wide issue, 64 kb L1cache, 2MB L2 cache PIM Core: 1 core per vault, 1-wide issue, 4-wide SIMD, 32kB L1 cache 3D-Stacked Memory: 2GB cube, 16 vaults per cube Internal Bandwidth: 256GB/S Off-Chip Channel Bandwidth: 32 GB/s Baseline Memory: LPDDR3, 2GB, FR-FCFS scheduler
41 Normalized Energy Normalized Energy Texture Tiling CPU-Only PIM-Core PIM-Acc Color Blitting Compression Decompression Packing Quantization Sub-Pixel Interpolation Deblocking Filter Motion Estimation 77.7% Chrome and 82.6% Browser of energy TensorFlow reduction for Video texture Playback tiling and Mobile Capture and packing comes from eliminating data movement PIM core and PIM accelerator reduces energy consumption on average by 49.1% and 55.4% 41
42 42 Normalized Runtime 1.0 CPU-Only PIM-Core PIM-Acc Normalized Run%me Texture Tiling Color Bli6ng Compression Chrome Browser Decompression Sub-Pixel InterpolaRon Deblocking Filter Video Playback and Capture MoRon EsRmaRon TensorFlow TensorFlow Mobile Offloading these kernels to PIM core and PIM accelerator improves performance on average by 44.6% and 54.2%
43 43 Conclusion Energy consumption is a major challenge in consumer devices We conduct an in-depth analysis of popular Google consumer workloads 62.7% of the total system energy is spent on data movement Most of the data movement often come from simple functions that consists of simple operations We use PIM to reduce data movement cost We design lightweight logic to implement simple operations in DRAM PIM Core Reduces total energy by 55.4% on average PIM Accelerator PIM PIM Accelerator Accelerator Reduces execution time by 54.2% on average
44 Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu
45 Backup
46 Video Playback Compressed video VP9 Decoder Display 63.5% of the system energy is spent on data movement 80.4% of the data movement energy comes from sub-pixel interpolation and deblocking filter Sub-pixel interpolation: interpolates the value of pixels at non-integer location Deblocking filter: a simple lowpass filter that attempts to remove discontinuity in pixels 39
47 Video Playback Compressed video VP9 Decoder Display 63.5% of the system energy is spent on data movement Both Functions can benefit from PIM execution They can be implemented in PIM logic Based on our analysis, we conclude that: 80.4% of the data movement energy comes from sub-pixel interpolation and deblocking filter Sub-pixel interpolation: interpolates the value of pixels at non-integer location Deblocking filter: a simple lowpass filter that attempts to remove discontinuity in pixels 40
48 Video Capture Captured video VP9 Encoder Compressed 59.1% of the system energy is spent on data movement Majority of the data movement energy comes from motion estimation which accounts for 21.3% of total system energy Motion estimation: compresses the frames using temporal redundancy between them 41
49 Video Capture Captured video VP9 Encoder Compressed 59.1% of the system energy is spent on Motion Estimation is a good candidate for PIM data movement execution and is feasible to be implemented at PIM logic Majority of the data movement energy comes from motion estimation which accounts for 21.3% of total system energy Motion estimation: compresses the frames using temporal redundancy between them 42
50 TensorFlow Mobile Inference Prediction 57.3% of the inference energy is spent on data movement 54.4% of the data movement energy comes from packing/unpacking and quantization Packing: A simple Quantization: A simple data data re-organization process conversion operation that that reorders elements of matrices converts 32-bit floating point to 8- bit integers 35
51 TensorFlow Mobile Inference Prediction Based on our analysis, we conclude that: 57.3% of the inference energy is spent on Both Functions are good candidate for PIM execution data movement It is feasible to implement them in PIM logic 54.4% of the data movement energy comes from packing/unpacking and quantization Packing: A simple Quantization: A simple data data re-organization process conversion operation that converts that reorders elements of matrices 32-bit floating point to 8-bit integers 32
52 28 Data Movement Study To study data movement during tab switching, we did an experiment: A user opens 50 tabs (most-accessed websites) Scrolls through each for a few second Switches to the next tab We make two key observations: Compression and decompression contribute to 18.1% of the total system energy 19.6 GB of data swapped in and out of ZRAM
53 37 Other Details and Results in the paper Detailed discussion of the other workloads Tensorflow Mobile Video Playback Video Capture Detailed discussion of VP9 hardware decoder and encoder System Integration Software interface Coherence
54 27 Color Blitting Analysis Color blitting generates a large amount of data movement Accounts for 19.1% of the total system energy used during page scrolling 63.9% of the energy consumed by color blitting is due to data movement Based on our analysis, we conclude that: Majority of data movement comes from: Streaming Color blitting access is pattern a good candidate for PIM execution The It is large feasible sizes to of implement the bitmaps color (e.g., blitting 1024x1024 in PIM pixels) logic Color blitting require low-cost computation operations Memset, simple arithmetic for alpha blending, shift operations
55 Data Movement During Tab-Switching Data Swapped Out to ZRAM (MB/s) Data Swapped In from ZRAM (MB/s) GB of data is swapped in from ZRAM and 50 gets 0 decompressed, at a rate of up to 227 MB/s When 0 switching between tabs, compression and decompression Time contribute (seconds) to 18.1% the 250 total system energy GB of data is compressed and swapped 150 out to ZRAM, at a rate of up to 201 MB/s Time (seconds) 32
56 Methodology: Implementing PIM Targets DRAM SoC Vault PIM Core Cache OR... PIM- Accelerator 1 PIM- Accelerator N PIM Logic PIM Logic... PIM Logic Vault Mem Ctrl Vault Mem Ctrl... Vault Mem Ctrl General-purpose Fixed-function PIM PIM core Accelerator A customized A customized low-power in-memory single mm issue logic 2 area core unit budget for the performs a single thread of the PIM logic target A 4-wide SIMD unit 15
57 Cost Analysis: Texture Tiling PIM core SIMD 4-wide Texture Tiling Texture Tiling (int* src, int* dst, ) { for (y = k*columnwidth, y<maxwidth, y++) { int swizzle = swizzle1; mem_copy (dst + (y*swizzle), src + x, k for (x=x1; x<x2; x+=tile) { mem_copy ((x+y)^swizzle, src + x, x3- swizzle ^= swizzle_bit; PIM } accelelerator Requires simple primitives: memcopy, bitwise 0.33 mm operations, mm and simple arithmetic operations 2 (9.4% of the area available per vault ) Multiple In-memory tiling unit (7.1% of the area available per vault ) 57
58 30 Memory Consumption Potential Solution: kill inactive background tabs When the user accesses those tabs again, reload the tab pages from the disk Downsides: 1) high latency 2) loss of data Chrome uses compression to reduce each tab s memory footprint Uses compression to compress inactive pages and places them into a DRAM-based memory pool (ZRAM) When the user switches to a previously-inactive tab, Chrome loads the data from ZRAM and decompresses it
59 Methodology: Identifying PIM 59 targets We pick a function as PIM target candidate if: Consumes the most energy out of the all functions Significant data movement cost (MPKI > 10) Bounded by data movement, not computation We drop any candidate if: Incurs any performance loss when runs on PIM logic Requires more area than is available in the logic layer of 3Dstacked memory
Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationOverview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size
Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationSort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot
Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack
More informationCOL862 Programming Assignment-1
Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,
More informationSpring 2009 Prof. Hyesoon Kim
Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on
More informationMAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015
MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015 Video Codecs 70% of internet traffic will be video in 2018 [CISCO] Video
More informationDRAM Tutorial Lecture. Vivek Seshadri
DRAM Tutorial 18-447 Lecture Vivek Seshadri DRAM Module and Chip 2 Goals Cost Latency Bandwidth Parallelism Power Energy 3 DRAM Chip Bank I/O 4 Sense Amplifier top enable Inverter bottom 5 Sense Amplifier
More informationARM Multimedia IP: working together to drive down system power and bandwidth
ARM Multimedia IP: working together to drive down system power and bandwidth Speaker: Robert Kong ARM China FAE Author: Sean Ellis ARM Architect 1 Agenda System power overview Bandwidth, bandwidth, bandwidth!
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationVertex Shader Design I
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More informationOptimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service
Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications
More informationA 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications
A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationMali GPU acceleration of HEVC and VP9 Decoder
Mali GPU acceleration of HEVC and VP9 Decoder 2 Web Video continues to grow!!! Video accounted for 50% of the mobile traffic in 2012 - Citrix ByteMobile's 4Q 2012 Analytics Report. Globally, IP video traffic
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationA Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps
A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps Nandita Vijaykumar Gennady Pekhimenko Adwait Jog Saugata Ghose Abhishek Bhowmick Rachata Ausavarungnirun Chita Das Mahmut Kandemir
More informationOverview. Technology Details. D/AVE NX Preliminary Product Brief
Overview D/AVE NX is the latest and most powerful addition to the D/AVE family of rendering cores. It is the first IP to bring full OpenGL ES 2.0/3.1 rendering to the FPGA and SoC world. Targeted for graphics
More informationModule Introduction. Content 15 pages 2 questions. Learning Time 25 minutes
Purpose The intent of this module is to introduce you to the multimedia features and functions of the i.mx31. You will learn about the Imagination PowerVR MBX- Lite hardware core, graphics rendering, video
More informationLecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014
Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms Visual Computing Systems Review: mechanisms to reduce aliasing in the graphics pipeline When sampling visibility?! -
More informationGraphics Hardware. Instructor Stephen J. Guy
Instructor Stephen J. Guy Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability! Programming Examples Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability!
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationThe Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006
The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content
More informationExploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:
More informationReal-Time Rendering (Echtzeitgraphik) Michael Wimmer
Real-Time Rendering (Echtzeitgraphik) Michael Wimmer wimmer@cg.tuwien.ac.at Walking down the graphics pipeline Application Geometry Rasterizer What for? Understanding the rendering pipeline is the key
More information2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don
RAMP-IV: A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae,, and Hoi-Jun Yoo oratory Dept. of EECS,
More informationPowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationUnderstanding Reduced-Voltage Operation in Modern DRAM Devices
Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationUnderstanding Sources of Inefficiency in General-Purpose Chips. Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE
Understanding Sources of Inefficiency in General-Purpose Chips Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE 1 Outline Motivation H.264 Basics Key ideas Implementation & Evaluation Summary
More informationDesign-Induced Latency Variation in Modern DRAM Chips:
Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms Donghyuk Lee 1,2 Samira Khan 3 Lavanya Subramanian 2 Saugata Ghose 2 Rachata Ausavarungnirun
More informationIMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner
IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner Sandbridge Technologies, 1 North Lexington Avenue, White Plains, NY 10601 sjinturkar@sandbridgetech.com
More informationLow-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM
Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data
More informationThe Bifrost GPU architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our
More informationA Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video
A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video Amrita Mazumdar Armin Alaghi Jonathan T. Barron David Gallup Luis Ceze Mark Oskin Steven M. Seitz University of Washington Google
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationORBX 2 Technical Introduction. October 2013
ORBX 2 Technical Introduction October 2013 Summary The ORBX 2 video codec is a next generation video codec designed specifically to fulfill the requirements for low latency real time video streaming. It
More informationUsing Virtual Texturing to Handle Massive Texture Data
Using Virtual Texturing to Handle Massive Texture Data San Jose Convention Center - Room A1 Tuesday, September, 21st, 14:00-14:50 J.M.P. Van Waveren id Software Evan Hart NVIDIA How we describe our environment?
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationScalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India
Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India Outline of Presentation MPEG-2 to H.264 Transcoding Need for a multiprocessor
More informationImproving DRAM Performance by Parallelizing Refreshes with Accesses
Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes
More informationChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce
More informationArchitectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1
Architectures Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Overview of today s lecture The idea is to cover some of the existing graphics
More informationOptimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research
Optimizing Games for ATI s IMAGEON 2300 Aaftab Munshi 3D Architect ATI Research A A 3D hardware solution enables publishers to extend brands to mobile devices while remaining close to original vision of
More informationA Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than
More informationDesign, Implementation and Evaluation of a Task-parallel JPEG Decoder for the Libjpeg-turbo Library
Design, Implementation and Evaluation of a Task-parallel JPEG Decoder for the Libjpeg-turbo Library Jingun Hong 1, Wasuwee Sodsong 1, Seongwook Chung 1, Cheong Ghil Kim 2, Yeongkyu Lim 3, Shin-Dug Kim
More informationCharacterization of Native Signal Processing Extensions
Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationInternational Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)
A Technical Analysis Towards Digital Video Compression Rutika Joshi 1, Rajesh Rai 2, Rajesh Nema 3 1 Student, Electronics and Communication Department, NIIST College, Bhopal, 2,3 Prof., Electronics and
More information0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV
0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV Rajeev Jayavant Cyrix Corporation A National Semiconductor Company 8/18/98 1 0;L$UFKLWHFWXUDO)HDWXUHV ¾ Next-generation Cayenne Core Dual-issue pipelined
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationTETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural
More informationWorking with Metal Overview
Graphics and Games #WWDC14 Working with Metal Overview Session 603 Jeremy Sandmel GPU Software 2014 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationPVRTC & Texture Compression. User Guide
Public Imagination Technologies PVRTC & Texture Compression Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty
More informationManycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.
phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this
More informationVulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques
Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques Yu Cai, Saugata Ghose, Yixin Luo, Ken Mai, Onur Mutlu, Erich F. Haratsch February 6, 2017
More informationPROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.
PROFESSIONAL WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB Andreas Anyuru WILEY John Wiley & Sons, Ltd. INTRODUCTION xxl CHAPTER 1: INTRODUCING WEBGL 1 The Basics of WebGL 1 So Why Is WebGL So Great?
More informationCOMP 3221: Microprocessors and Embedded Systems
COMP 3: Microprocessors and Embedded Systems Lectures 7: Cache Memory - III http://www.cse.unsw.edu.au/~cs3 Lecturer: Hui Wu Session, 5 Outline Fully Associative Cache N-Way Associative Cache Block Replacement
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More information15-740/ Computer Architecture, Fall 2011 Midterm Exam II
15-740/18-740 Computer Architecture, Fall 2011 Midterm Exam II Instructor: Onur Mutlu Teaching Assistants: Justin Meza, Yoongu Kim Date: December 2, 2011 Name: Instructions: Problem I (69 points) : Problem
More informationAnatomy of AMD s TeraScale Graphics Engine
Anatomy of AMD s TeraScale Graphics Engine Mike Houston Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationManaging GPU Concurrency in Heterogeneous Architectures
Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationGPU Memory Model. Adapted from:
GPU Memory Model Adapted from: Aaron Lefohn University of California, Davis With updates from slides by Suresh Venkatasubramanian, University of Pennsylvania Updates performed by Gary J. Katz, University
More informationRasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?
Where does a picture come from? Rasterization and Graphics Hardware CS559 Course Notes Not for Projection November 2007, Mike Gleicher Result: image (raster) Input 2D/3D model of the world Rendering term
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationComputer Architecture Lecture 24: Memory Scheduling
18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM
More informationGraphics Performance Optimisation. John Spitzer Director of European Developer Technology
Graphics Performance Optimisation John Spitzer Director of European Developer Technology Overview Understand the stages of the graphics pipeline Cherchez la bottleneck Once found, either eliminate or balance
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationOpenGL Status - November 2013 G-Truc Creation
OpenGL Status - November 2013 G-Truc Creation Vendor NVIDIA AMD Intel Windows Apple Release date 02/10/2013 08/11/2013 30/08/2013 22/10/2013 Drivers version 331.10 beta 13.11 beta 9.2 10.18.10.3325 MacOS
More informationBifrost - The GPU architecture for next five billion
Bifrost - The GPU architecture for next five billion Hessed Choi Senior FAE / ARM ARM Tech Forum June 28 th, 2016 Vulkan 2 ARM 2016 What is Vulkan? A 3D graphics API for the next twenty years Logical successor
More informationOptimizing DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationMobile HW and Bandwidth
Your logo on white Mobile HW and Bandwidth Andrew Gruber Qualcomm Technologies, Inc. Agenda and Goals Describe the Power and Bandwidth challenges facing Mobile Graphics Describe some of the Power Saving
More informationPageForge: A Near-Memory Content- Aware Page-Merging Architecture
PageForge: A Near-Memory Content- Aware Page-Merging Architecture Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas University of Illinois at Urbana-Champaign MICRO-50 @ Boston Motivation: Server
More informationMeet the Walkers! Accelerating Index Traversals for In-Memory Databases"
Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan Our World is Data-Driven! Data resides
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationTexture Compression in Memory- and Performance- Constrained Embedded Systems. Jens Ogniewski Information Coding Group Linköpings University
Texture Compression in Memory- and Performance- Constrained Embedded Systems Jens Ogniewski Information Coding Group Linköpings University Outline Background / Motivation DXT / PVRTC texture compression
More informationPARCEL: Proxy Assisted BRowsing in Cellular networks for Energy and Latency reduction
PARCEL: Proxy Assisted BRowsing in Cellular networks for Energy and Latency reduction Ashiwan Sivakumar 1, Shankaranarayanan PN 1, Vijay Gopalakrishnan 2, Seungjoon Lee 3*, Sanjay Rao 1 and Subhabrata
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More information