Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Size: px
Start display at page:

Download "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand"

Transcription

1 Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu

2 Consumer Devices Consumer devices are everywhere! Energy consumption is a first-class concern in consumer devices 2

3 Popular Google Consumer Workloads Chrome Google s web browser TensorFlow Mobile Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 3

4 Energy Cost of Data Movement 1 st key observation: 62.7% of the total system energy is spent on data movement SoC Data Movement CPU CPU CPU CPU L1 L2 Compute Unit DRAM Processing-in-Memory (PIM) Potential solution: move computation close to data Challenge: limited area and energy budget 4

5 Using PIM to Reduce Data Movement 2 nd key observation: a significant fraction of data movement often comes from simple functions We can design lightweight logic to implement these simple functions in memory Small embedded low-power core PIM Core Small fixed-function accelerators PIM Accelerator PIM PIM Accelerator Accelerator Offloading to PIM logic reduces energy by 55.4% and improves performance by 54.2% on average 5

6 6 Goals 1 2 Understand the data movement related bottlenecks in modern consumer workloads Analyze opportunities to mitigate data movement by using processing-in-memory (PIM) 3 Design PIM logic that can maximize energy efficiency given the limited area and energy budget in consumer devices

7 7 Outline Introduction Background Analysis Methodology Workload Analysis Evaluation Conclusion

8 8 Potential Solution to Address Data Movement Processing-in-Memory (PIM) A potential solution to reduce data movement Idea: move computation close to data Reduces data movement Exploits large in-memory bandwidth Exploits shorter access latency to memory Enabled by recent advances in 3D-stacked memory

9 9 Outline Introduction Background Analysis Methodology Workload Analysis Evaluation Conclusion

10 10 Workload Analysis Methodology Workload Characterization Chromebook with an Intel Celeron SoC and 2GB of DRAM Extensively use performance counters within SoC Energy Model Sum of the energy consumption within the CPU, all caches, off-chip interconnects, and DRAM CPU CPU L1 L2 DRAM

11 11 PIM Logic Implementation SoC DRAM Logic Layer PIM Core Customized embedded general-purpose core No aggressive ILP techniques 256-bit SIMD unit PIM Accelerator PIM PIM Accelerator Accelerator N Small fixed-function accelerators Multiple copies of customized in-memory logic unit

12 Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video Codec 12

13 Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 13

14 14 How Chrome Renders a Web Page Loading and Parsing Painting HTML CSS HTML Parser CSS Parser Render Tree Layout Rasterization Composit- Compositing

15 15 How Chrome Renders a Web Page Loading and Parsing Layouting Painting HTML CSS HTML Parser CSS Parser Render Tree Layout assembles all layers into a final screen image Rasterization Compositing calculates the visual elements and position of each object paints those objects and generates the bitmaps

16 16 Browser Analysis To satisfy user experience, the browser must provide: Fast loading of webpages Smooth scrolling of webpages Quick switching between browser tabs We focus on two important user interactions: 1) Page Scrolling 2) Tab Switching Both include page loading

17 Scrolling 17

18 What Does Happen During Scrolling? HTML CSS rasterization uses color blitters to convert the basic primitives into bitmaps HTML Parser CSS Parser Render Tree Layout Color Blitting Texture Tiling Rasterization itng Compositing to minimize cache misses during compositing, the graphics driver reorganizes the bitmaps 18

19 19 Scrolling Energy Analysis Frac%on of Total Energy 100% 80% 60% 40% 20% 0% Google Docs Texture Tiling Color Bli6ng Other Gmail Google Calendar Word- Press Twitter Animation AVG 41.9% of page scrolling energy is spent on texture tiling and color blitting

20 Total Energy (pj) Scrolling a Google Docs Web Page Texture Tiling Color Bli6ng Other CPU L1 LLC Interconnect Mem Ctrl A significant portion of total data movement comes from texture tiling and color blitting DRAM Frac%on of Total Energy 37.7% of total system energy 40% 30% 20% 10% 0% 77% of total energy consumption goes to data movement Data Movement Texture Tiling Compute Color Bli6ng 20

21 Total Energy (pj) Scrolling a Google Docs Web Page Texture Tiling Color Bli6ng Other A significant portion of total data movement comes from texture tiling and color blitting 37.7% of total system energy 40% 30% 20% 10% 0% 77% of total energy consumption goes to data movement CPU L1 LLC Interconnect Mem DRAM Ctrl for texture tiling and color Data Movement blitting? Compute Can we use PIM to mitigate the data movement cost Frac%on of Total Energy Texture Tiling Color Bli6ng 21

22 Can We Use PIM for Texture Tiling? CPU-Only!me CPU Memory CPU + PIM CPU PIM Texture Tiling Rasterization Read Bitmap Linear Bitmap Rasterization idle Conversion high data movement Invoke Compositing Write Back Texture Invoke Tiles Compositing Major sources of data movement: 1) poor data locality 2) large rasterized bitmap PIM execution size (e.g. 4MB) Texture tiling is a good candidate for Linear Bitmap Conversion Texture Tiles 22

23 Can We Implement Texture Tiling in PIM Logic? Linear Bitmap Texture Tiling Tiled Texture Requires simple primitives: memcopy, bitwise operations, and simple arithmetic operations PIM Core PIM Accelerator 9.4% of the area available for PIM logic 7.1% of the area available for PIM logic PIM core and PIM accelerator are feasible to implement in-memory Texture Tiling 23

24 24 Color Blitting Analysis Generates a large amount of data movement Accounts for 19.1% of the total system energy during scrolling Color blitting is a good candidate for PIM execution Requires low-cost operations: Memset, simple arithmetic, and shift operations It is feasible to implement color blitting in PIM core and PIM accelerator

25 25 Scrolling Wrap Up Texture tiling and color blitting account for a significant portion (41.9%) of energy consumption 37.7% of total system energy goes to data movement generated by these functions 1 Both functions can benefit significantly from PIM execution 2 Both functions are feasible to implement as PIM logic

26 Tab Switching 26

27 27 What Happens During Tab Switching? Chrome employs a multi-process architecture Each tab is a separate process Chrome Process Tab 1 Process Tab 2 Process Tab N Process Main operations during tab switching: Context switch Load the new page

28 28 Memory Consumption Primary concerns during tab switching: How fast a new tab loads and becomes interactive Memory consumption Chrome uses compression to reduce each tab s memory footprint CPU Compressed Inactive Tab Tab Decompression Compression DRAM ZRAM

29 29 Data Movement Study To study data movement during tab switching, we emulate a user switching through 50 tabs We make two key observations: 1 2 Compression and decompression contribute to18.1% of the total system energy 19.6 GB of data moves between CPU and ZRAM

30 30 Can We Use PIM to Mitigate the Cost? compression CPU CPU-Only Memory Swap out N pages Read N Pages Compress Write back Other tasks Uncompressed Pages high data movement ZRAM!me CPU CPU + PIM PIM Swap out N pages Other tasks No off-chip data movement Uncompressed Pages Compress ZRAM PIM core and PIM accelerator are feasible to implement in-memory compression/decompression

31 31 Tab Switching Wrap Up A large amount of data movement happens during tab switching as Chrome attempts to compress and decompress tabs Both functions can benefit from PIM execution and can be implemented as PIM logic 2

32 Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 32

33 Workload Analysis Chrome Google s web browser TensorFlow Google s machine learning framework Video Playback Google s video codec Video Capture Google s video codec 33

34 34 TensorFlow Mobile Inference Prediction 57.3% of the inference energy is spent on data movement 54.4% of the data movement energy comes from packing/unpacking and quantization

35 36 Packing Matrix Packing Packed Matrix Reorders elements of matrices to minimize cache misses during matrix multiplication Up to 40% of the inference energy and 31% of inference execution time Packing s data movement accounts for up to 35.3% of the inference energy A simple data reorganization process that requires simple arithmetic

36 36 Quantization floating point Quantization integer Converts 32-bit floating point to 8-bit integers to improve inference execution time and energy consumption Up to 16.8% of the inference energy and 16.1% of inference execution time Majority of quantization energy comes from data movement A simple data conversion operation that requires shift, addition, and multiplication operations

37 37 Quantization floating point Quantization integer Converts 32-bit floating point to 8-bit integers to improve inference execution time and energy consumption Based on our analysis, we conclude that: Both functions are good candidates for PIM execution It is feasible to implement them in PIM logic Up to 16.8% of the inference energy and 16.1% of inference execution time Majority of quantization energy comes from data movement A simple data conversion operation that requires shift, addition, and multiplication operations

38 38 Video Playback and Capture Compressed video VP9 Decoder Display Captured video VP9 Encoder Compressed video Majority of energy is spent on data movement Majority of data movement comes from simple functions in decoding and encoding pipelines

39 39 Outline Introduction Background Analysis Methodology Workload Analysis Evaluation Conclusion

40 We study each target in isolation and emulate each separately and run them in our simulator 40 Evaluation Methodology System Configuration (gem5 Simulator) SoC: 4 OoO cores, 8-wide issue, 64 kb L1cache, 2MB L2 cache PIM Core: 1 core per vault, 1-wide issue, 4-wide SIMD, 32kB L1 cache 3D-Stacked Memory: 2GB cube, 16 vaults per cube Internal Bandwidth: 256GB/S Off-Chip Channel Bandwidth: 32 GB/s Baseline Memory: LPDDR3, 2GB, FR-FCFS scheduler

41 Normalized Energy Normalized Energy Texture Tiling CPU-Only PIM-Core PIM-Acc Color Blitting Compression Decompression Packing Quantization Sub-Pixel Interpolation Deblocking Filter Motion Estimation 77.7% Chrome and 82.6% Browser of energy TensorFlow reduction for Video texture Playback tiling and Mobile Capture and packing comes from eliminating data movement PIM core and PIM accelerator reduces energy consumption on average by 49.1% and 55.4% 41

42 42 Normalized Runtime 1.0 CPU-Only PIM-Core PIM-Acc Normalized Run%me Texture Tiling Color Bli6ng Compression Chrome Browser Decompression Sub-Pixel InterpolaRon Deblocking Filter Video Playback and Capture MoRon EsRmaRon TensorFlow TensorFlow Mobile Offloading these kernels to PIM core and PIM accelerator improves performance on average by 44.6% and 54.2%

43 43 Conclusion Energy consumption is a major challenge in consumer devices We conduct an in-depth analysis of popular Google consumer workloads 62.7% of the total system energy is spent on data movement Most of the data movement often come from simple functions that consists of simple operations We use PIM to reduce data movement cost We design lightweight logic to implement simple operations in DRAM PIM Core Reduces total energy by 55.4% on average PIM Accelerator PIM PIM Accelerator Accelerator Reduces execution time by 54.2% on average

44 Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu

45 Backup

46 Video Playback Compressed video VP9 Decoder Display 63.5% of the system energy is spent on data movement 80.4% of the data movement energy comes from sub-pixel interpolation and deblocking filter Sub-pixel interpolation: interpolates the value of pixels at non-integer location Deblocking filter: a simple lowpass filter that attempts to remove discontinuity in pixels 39

47 Video Playback Compressed video VP9 Decoder Display 63.5% of the system energy is spent on data movement Both Functions can benefit from PIM execution They can be implemented in PIM logic Based on our analysis, we conclude that: 80.4% of the data movement energy comes from sub-pixel interpolation and deblocking filter Sub-pixel interpolation: interpolates the value of pixels at non-integer location Deblocking filter: a simple lowpass filter that attempts to remove discontinuity in pixels 40

48 Video Capture Captured video VP9 Encoder Compressed 59.1% of the system energy is spent on data movement Majority of the data movement energy comes from motion estimation which accounts for 21.3% of total system energy Motion estimation: compresses the frames using temporal redundancy between them 41

49 Video Capture Captured video VP9 Encoder Compressed 59.1% of the system energy is spent on Motion Estimation is a good candidate for PIM data movement execution and is feasible to be implemented at PIM logic Majority of the data movement energy comes from motion estimation which accounts for 21.3% of total system energy Motion estimation: compresses the frames using temporal redundancy between them 42

50 TensorFlow Mobile Inference Prediction 57.3% of the inference energy is spent on data movement 54.4% of the data movement energy comes from packing/unpacking and quantization Packing: A simple Quantization: A simple data data re-organization process conversion operation that that reorders elements of matrices converts 32-bit floating point to 8- bit integers 35

51 TensorFlow Mobile Inference Prediction Based on our analysis, we conclude that: 57.3% of the inference energy is spent on Both Functions are good candidate for PIM execution data movement It is feasible to implement them in PIM logic 54.4% of the data movement energy comes from packing/unpacking and quantization Packing: A simple Quantization: A simple data data re-organization process conversion operation that converts that reorders elements of matrices 32-bit floating point to 8-bit integers 32

52 28 Data Movement Study To study data movement during tab switching, we did an experiment: A user opens 50 tabs (most-accessed websites) Scrolls through each for a few second Switches to the next tab We make two key observations: Compression and decompression contribute to 18.1% of the total system energy 19.6 GB of data swapped in and out of ZRAM

53 37 Other Details and Results in the paper Detailed discussion of the other workloads Tensorflow Mobile Video Playback Video Capture Detailed discussion of VP9 hardware decoder and encoder System Integration Software interface Coherence

54 27 Color Blitting Analysis Color blitting generates a large amount of data movement Accounts for 19.1% of the total system energy used during page scrolling 63.9% of the energy consumed by color blitting is due to data movement Based on our analysis, we conclude that: Majority of data movement comes from: Streaming Color blitting access is pattern a good candidate for PIM execution The It is large feasible sizes to of implement the bitmaps color (e.g., blitting 1024x1024 in PIM pixels) logic Color blitting require low-cost computation operations Memset, simple arithmetic for alpha blending, shift operations

55 Data Movement During Tab-Switching Data Swapped Out to ZRAM (MB/s) Data Swapped In from ZRAM (MB/s) GB of data is swapped in from ZRAM and 50 gets 0 decompressed, at a rate of up to 227 MB/s When 0 switching between tabs, compression and decompression Time contribute (seconds) to 18.1% the 250 total system energy GB of data is compressed and swapped 150 out to ZRAM, at a rate of up to 201 MB/s Time (seconds) 32

56 Methodology: Implementing PIM Targets DRAM SoC Vault PIM Core Cache OR... PIM- Accelerator 1 PIM- Accelerator N PIM Logic PIM Logic... PIM Logic Vault Mem Ctrl Vault Mem Ctrl... Vault Mem Ctrl General-purpose Fixed-function PIM PIM core Accelerator A customized A customized low-power in-memory single mm issue logic 2 area core unit budget for the performs a single thread of the PIM logic target A 4-wide SIMD unit 15

57 Cost Analysis: Texture Tiling PIM core SIMD 4-wide Texture Tiling Texture Tiling (int* src, int* dst, ) { for (y = k*columnwidth, y<maxwidth, y++) { int swizzle = swizzle1; mem_copy (dst + (y*swizzle), src + x, k for (x=x1; x<x2; x+=tile) { mem_copy ((x+y)^swizzle, src + x, x3- swizzle ^= swizzle_bit; PIM } accelelerator Requires simple primitives: memcopy, bitwise 0.33 mm operations, mm and simple arithmetic operations 2 (9.4% of the area available per vault ) Multiple In-memory tiling unit (7.1% of the area available per vault ) 57

58 30 Memory Consumption Potential Solution: kill inactive background tabs When the user accesses those tabs again, reload the tab pages from the disk Downsides: 1) high latency 2) loss of data Chrome uses compression to reduce each tab s memory footprint Uses compression to compress inactive pages and places them into a DRAM-based memory pool (ZRAM) When the user switches to a previously-inactive tab, Chrome loads the data from ZRAM and decompresses it

59 Methodology: Identifying PIM 59 targets We pick a function as PIM target candidate if: Consumes the most energy out of the all functions Significant data movement cost (MPKI > 10) Bounded by data movement, not computation We drop any candidate if: Incurs any performance loss when runs on PIM logic Requires more area than is available in the logic layer of 3Dstacked memory

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack

More information

COL862 Programming Assignment-1

COL862 Programming Assignment-1 Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015 MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015 Video Codecs 70% of internet traffic will be video in 2018 [CISCO] Video

More information

DRAM Tutorial Lecture. Vivek Seshadri

DRAM Tutorial Lecture. Vivek Seshadri DRAM Tutorial 18-447 Lecture Vivek Seshadri DRAM Module and Chip 2 Goals Cost Latency Bandwidth Parallelism Power Energy 3 DRAM Chip Bank I/O 4 Sense Amplifier top enable Inverter bottom 5 Sense Amplifier

More information

ARM Multimedia IP: working together to drive down system power and bandwidth

ARM Multimedia IP: working together to drive down system power and bandwidth ARM Multimedia IP: working together to drive down system power and bandwidth Speaker: Robert Kong ARM China FAE Author: Sean Ellis ARM Architect 1 Agenda System power overview Bandwidth, bandwidth, bandwidth!

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications

More information

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Mali GPU acceleration of HEVC and VP9 Decoder

Mali GPU acceleration of HEVC and VP9 Decoder Mali GPU acceleration of HEVC and VP9 Decoder 2 Web Video continues to grow!!! Video accounted for 50% of the mobile traffic in 2012 - Citrix ByteMobile's 4Q 2012 Analytics Report. Globally, IP video traffic

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps Nandita Vijaykumar Gennady Pekhimenko Adwait Jog Saugata Ghose Abhishek Bhowmick Rachata Ausavarungnirun Chita Das Mahmut Kandemir

More information

Overview. Technology Details. D/AVE NX Preliminary Product Brief

Overview. Technology Details. D/AVE NX Preliminary Product Brief Overview D/AVE NX is the latest and most powerful addition to the D/AVE family of rendering cores. It is the first IP to bring full OpenGL ES 2.0/3.1 rendering to the FPGA and SoC world. Targeted for graphics

More information

Module Introduction. Content 15 pages 2 questions. Learning Time 25 minutes

Module Introduction. Content 15 pages 2 questions. Learning Time 25 minutes Purpose The intent of this module is to introduce you to the multimedia features and functions of the i.mx31. You will learn about the Imagination PowerVR MBX- Lite hardware core, graphics rendering, video

More information

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014 Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms Visual Computing Systems Review: mechanisms to reduce aliasing in the graphics pipeline When sampling visibility?! -

More information

Graphics Hardware. Instructor Stephen J. Guy

Graphics Hardware. Instructor Stephen J. Guy Instructor Stephen J. Guy Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability! Programming Examples Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability!

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content

More information

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:

More information

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer Real-Time Rendering (Echtzeitgraphik) Michael Wimmer wimmer@cg.tuwien.ac.at Walking down the graphics pipeline Application Geometry Rasterizer What for? Understanding the rendering pipeline is the key

More information

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don RAMP-IV: A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae,, and Hoi-Jun Yoo oratory Dept. of EECS,

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Understanding Reduced-Voltage Operation in Modern DRAM Devices

Understanding Reduced-Voltage Operation in Modern DRAM Devices Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Understanding Sources of Inefficiency in General-Purpose Chips. Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE

Understanding Sources of Inefficiency in General-Purpose Chips. Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE Understanding Sources of Inefficiency in General-Purpose Chips Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE 1 Outline Motivation H.264 Basics Key ideas Implementation & Evaluation Summary

More information

Design-Induced Latency Variation in Modern DRAM Chips:

Design-Induced Latency Variation in Modern DRAM Chips: Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms Donghyuk Lee 1,2 Samira Khan 3 Lavanya Subramanian 2 Saugata Ghose 2 Rachata Ausavarungnirun

More information

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner Sandbridge Technologies, 1 North Lexington Avenue, White Plains, NY 10601 sjinturkar@sandbridgetech.com

More information

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video Amrita Mazumdar Armin Alaghi Jonathan T. Barron David Gallup Luis Ceze Mark Oskin Steven M. Seitz University of Washington Google

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

ORBX 2 Technical Introduction. October 2013

ORBX 2 Technical Introduction. October 2013 ORBX 2 Technical Introduction October 2013 Summary The ORBX 2 video codec is a next generation video codec designed specifically to fulfill the requirements for low latency real time video streaming. It

More information

Using Virtual Texturing to Handle Massive Texture Data

Using Virtual Texturing to Handle Massive Texture Data Using Virtual Texturing to Handle Massive Texture Data San Jose Convention Center - Room A1 Tuesday, September, 21st, 14:00-14:50 J.M.P. Van Waveren id Software Evan Hart NVIDIA How we describe our environment?

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India Outline of Presentation MPEG-2 to H.264 Transcoding Need for a multiprocessor

More information

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes

More information

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce

More information

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Architectures Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Overview of today s lecture The idea is to cover some of the existing graphics

More information

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research Optimizing Games for ATI s IMAGEON 2300 Aaftab Munshi 3D Architect ATI Research A A 3D hardware solution enables publishers to extend brands to mobile devices while remaining close to original vision of

More information

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than

More information

Design, Implementation and Evaluation of a Task-parallel JPEG Decoder for the Libjpeg-turbo Library

Design, Implementation and Evaluation of a Task-parallel JPEG Decoder for the Libjpeg-turbo Library Design, Implementation and Evaluation of a Task-parallel JPEG Decoder for the Libjpeg-turbo Library Jingun Hong 1, Wasuwee Sodsong 1, Seongwook Chung 1, Cheong Ghil Kim 2, Yeongkyu Lim 3, Shin-Dug Kim

More information

Characterization of Native Signal Processing Extensions

Characterization of Native Signal Processing Extensions Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

International Journal of Emerging Technology and Advanced Engineering Website:   (ISSN , Volume 2, Issue 4, April 2012) A Technical Analysis Towards Digital Video Compression Rutika Joshi 1, Rajesh Rai 2, Rajesh Nema 3 1 Student, Electronics and Communication Department, NIIST College, Bhopal, 2,3 Prof., Electronics and

More information

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV 0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV Rajeev Jayavant Cyrix Corporation A National Semiconductor Company 8/18/98 1 0;L$UFKLWHFWXUDO)HDWXUHV ¾ Next-generation Cayenne Core Dual-issue pipelined

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

Working with Metal Overview

Working with Metal Overview Graphics and Games #WWDC14 Working with Metal Overview Session 603 Jeremy Sandmel GPU Software 2014 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

PVRTC & Texture Compression. User Guide

PVRTC & Texture Compression. User Guide Public Imagination Technologies PVRTC & Texture Compression Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty

More information

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this

More information

Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques

Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques Yu Cai, Saugata Ghose, Yixin Luo, Ken Mai, Onur Mutlu, Erich F. Haratsch February 6, 2017

More information

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd. PROFESSIONAL WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB Andreas Anyuru WILEY John Wiley & Sons, Ltd. INTRODUCTION xxl CHAPTER 1: INTRODUCING WEBGL 1 The Basics of WebGL 1 So Why Is WebGL So Great?

More information

COMP 3221: Microprocessors and Embedded Systems

COMP 3221: Microprocessors and Embedded Systems COMP 3: Microprocessors and Embedded Systems Lectures 7: Cache Memory - III http://www.cse.unsw.edu.au/~cs3 Lecturer: Hui Wu Session, 5 Outline Fully Associative Cache N-Way Associative Cache Block Replacement

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

15-740/ Computer Architecture, Fall 2011 Midterm Exam II 15-740/18-740 Computer Architecture, Fall 2011 Midterm Exam II Instructor: Onur Mutlu Teaching Assistants: Justin Meza, Yoongu Kim Date: December 2, 2011 Name: Instructions: Problem I (69 points) : Problem

More information

Anatomy of AMD s TeraScale Graphics Engine

Anatomy of AMD s TeraScale Graphics Engine Anatomy of AMD s TeraScale Graphics Engine Mike Houston Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Managing GPU Concurrency in Heterogeneous Architectures

Managing GPU Concurrency in Heterogeneous Architectures Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

GPU Memory Model. Adapted from:

GPU Memory Model. Adapted from: GPU Memory Model Adapted from: Aaron Lefohn University of California, Davis With updates from slides by Suresh Venkatasubramanian, University of Pennsylvania Updates performed by Gary J. Katz, University

More information

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care? Where does a picture come from? Rasterization and Graphics Hardware CS559 Course Notes Not for Projection November 2007, Mike Gleicher Result: image (raster) Input 2D/3D model of the world Rendering term

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology Graphics Performance Optimisation John Spitzer Director of European Developer Technology Overview Understand the stages of the graphics pipeline Cherchez la bottleneck Once found, either eliminate or balance

More information

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,

More information

OpenGL Status - November 2013 G-Truc Creation

OpenGL Status - November 2013 G-Truc Creation OpenGL Status - November 2013 G-Truc Creation Vendor NVIDIA AMD Intel Windows Apple Release date 02/10/2013 08/11/2013 30/08/2013 22/10/2013 Drivers version 331.10 beta 13.11 beta 9.2 10.18.10.3325 MacOS

More information

Bifrost - The GPU architecture for next five billion

Bifrost - The GPU architecture for next five billion Bifrost - The GPU architecture for next five billion Hessed Choi Senior FAE / ARM ARM Tech Forum June 28 th, 2016 Vulkan 2 ARM 2016 What is Vulkan? A 3D graphics API for the next twenty years Logical successor

More information

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most

More information

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Mobile HW and Bandwidth

Mobile HW and Bandwidth Your logo on white Mobile HW and Bandwidth Andrew Gruber Qualcomm Technologies, Inc. Agenda and Goals Describe the Power and Bandwidth challenges facing Mobile Graphics Describe some of the Power Saving

More information

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

PageForge: A Near-Memory Content- Aware Page-Merging Architecture PageForge: A Near-Memory Content- Aware Page-Merging Architecture Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas University of Illinois at Urbana-Champaign MICRO-50 @ Boston Motivation: Server

More information

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases"

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan Our World is Data-Driven! Data resides

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Texture Compression in Memory- and Performance- Constrained Embedded Systems. Jens Ogniewski Information Coding Group Linköpings University

Texture Compression in Memory- and Performance- Constrained Embedded Systems. Jens Ogniewski Information Coding Group Linköpings University Texture Compression in Memory- and Performance- Constrained Embedded Systems Jens Ogniewski Information Coding Group Linköpings University Outline Background / Motivation DXT / PVRTC texture compression

More information

PARCEL: Proxy Assisted BRowsing in Cellular networks for Energy and Latency reduction

PARCEL: Proxy Assisted BRowsing in Cellular networks for Energy and Latency reduction PARCEL: Proxy Assisted BRowsing in Cellular networks for Energy and Latency reduction Ashiwan Sivakumar 1, Shankaranarayanan PN 1, Vijay Gopalakrishnan 2, Seungjoon Lee 3*, Sanjay Rao 1 and Subhabrata

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information