Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms
|
|
- Ruby Bruce
- 6 years ago
- Views:
Transcription
1 Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Arizona State University Dhinakaran and Carole-Jean ) October 28, 2014 IEEE International Symposium on Workload Characterization
2 Smartphones 2GB RAM, 8 cores, full HD display, hardware accelerators Space and weight constraints limit battery capacity Need to reduce power consumption wherever possible Can consume power even as high as 3 Watts during regular use 2
3 Data movement is a power hog! Power Consumption (W) Time LOAD - without data dependency LOAD - with data dependency ADD- single issue ADD - dual issue LOADs consume more power than ADDs 3
4 Memory hierarchy SoC components share data through the DRAM Cortex-A9 I/DTLB L1I$ L1D$ Cortex-A9 I/DTLB L1I$ L1D$ Cortex-A9 I/DTLB L1I$ L1D$ Cortex-A9 I/DTLB L1I$ L1D$ Cache Snoop Controller AXI Interface L2 Unified Cache LPDDR Memory GPU DSP ISP... 4
5 Problem statement How do we measure the data movement energy on a commercial smart phone device? What is the impact of data movement energy on the total energy consumption for smart phone workloads? 5
6 Presentation Outline Introduction Methodology Experimental setup Energy measurement for workloads Results Conclusion 6
7 Data movement characterization Identify the types of data movement Finding the associated energy costs Characterize representative workloads Register L1 cache L2 cache Memory E mem E L2 E L1 E reg 7 ΔEnergy L1 ΔEnergy L2 ΔEnergy mem
8 Micro-benchmarks 1.E DRAMtoReg REG 2. E L2toReg 3. E L1toReg 4. E L1toReg, no- dep 5. E Add 6. E NOP L1 $ L2 $ DRAM 8
9 Micro-benchmarks 1. E DRAMtoReg REG 2.E L2toReg 3. E L1toReg 4. E L1toReg, no- dep 5. E Add 6. E NOP L1 $ L2 $ DRAM 9
10 Micro-benchmarks 1. E DRAMtoReg REG 2. E L2toReg 3.E L1toReg 4.E L1toReg, no- dep 5. E Add 6. E NOP L1 $ L2 $ DRAM 10
11 Data movement micro-benchmark design Initialize() Benchmark template for(i = 1 iterations/x) do end for Cleanup(). <operation> // repeat x times.. Data movement Initialize(): allocate mem, initialize array, start timer Operation : ptr = *(void **)ptr; Cleanup() : free memory, end timer c = &d x = &a b = &c a = &b 11
12 Programming concerns Prefetching Data locality Compiler optimization Loop unrolling Task priority Task migration CPU frequency governor 12
13 Stall cycle energy Pointer chasing causes data dependencies Processor stalls when waiting for data from memory E.g. 3 stall cycles for each load in E L1toReg E stall = (E L1toReg E L1toReg,no- dep )/N stalls Eliminating stall cycle energy E L2toReg = (E L2toReg- total N stalls *E stall )/N mem- accesses 13
14 Experimental platform Samsung Galaxy S3 I9300 smart phone Exynos SMDK 4412 Quad 4 ARM Cortex A9 cores L1 I cache 32 KB L1 D cache 32 KB L2 shared cache 1 MB DRAM 1GB Android 4.3 (Jelly Bean) Kernel configured to access performance counters 14
15 Experimental setup A DAQ is used to measure current drawn by the smartphone NI SignalExpress records voltage, current, power 15
16 Energy measurements P benchmark = P daq- reading P idle E benchmark = P benchmark.dt 16 Benchmark Idle 0 ΔBenchmark ΔIdle
17 Energy cost of data movement Operation Energy cost(nj) Energy(nJ) NOP ADD LOAD L1 Reg LOAD L2 Reg LOAD RAM Reg Stall cycle Moving data could cost 115 times as much as an ADD operation 17
18 Microbenchmark validation Compare the measured energy with the values estimated from the data movement energy costs Energy est. = N L1 * E L1toReg + N L2 *E L2toL1 + N DRAM *E DRAMtoL2 + N ADD *E ADD + N NOP *E NOP Error rate (%) ADD + L1 NOP + L1 NOP + L1 + ADD ADD + L2 ADD + RAM Avg. error rate is 3.4%, Max. error rate is 8.6% 18
19 Smartphone workloads Educational Web Browsing(EWB) * Models reading documents along with browsing Video playback * HD video Photo Viewing * High resolution images General Web Browsing (GWB) * Offline, automated benchmark Realistic General Web browsing (R-GWB) * Represents realistic browsing behavior (scroll-up, horizontal scroll and random delays) Frozen Bubble Interactive game * Pandiyan et al., MobileBench IISWC 2013; Gutierrez et al., Bbench IISWC 11; Huang et al., Moby ISPASS 14 19
20 Profiling workloads Hardware Performance Counters CPU- Instruction count, L1 misses, L1 accesses, stalls etc. L2 cache controller L2 misses, L2 accesses, prefetches 20 ARM DS-5 Streamline
21 Total energy breakdown Device Energy Breakdown 100% 80% 60% 40% 20% 0% Data Movement Stalls Others 35.6% of the energy is spent in data movement on average 23.5% of the energy is spent during stall cycles 21
22 Data movement energy breakdown L1 -> Reg L2 -> L1 Data L2 -> L1 Instruction Mem -> L2 Prefetches GWB VideoPlayback PhotoView 39% 27% 47% 30% 33% 24% 3% 18% 10% 3% 30% 7% 21% 6% 2% 22
23 Data movement energy breakdown L1 -> Reg L2 -> L1 Data L2 -> L1 Instruction Mem -> L2 Prefetches EWB-Blackboard RWB Frozen Bubble 28% 44% 31% 39% 29% 37% 15% 10% 3% 16% 11% 3% 24% 8% 2% 23
24 Problem statement How do we measure the data movement energy on a commercial smart phone device? What is the impact of data movement energy on the total energy consumption for smart phone workloads? 24
25 Conclusion This is the first work that proposes a detailed methodology for quantifying the instruction-level data movement energy cost in modern smartphones The detailed data movement energy characterization shows 35.5% of total energy consumption is spent in data movement for interactive mobile workloads 24.5% of total energy consumption is spent during stall cycles 25
26 Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Dhinakaran Pandiyan and Carole-Jean Wu Thank you! This work was partially supported by the NSF I/UCRC Center for Embedded Systems (NSF grant # ) and by Science Foundation of Arizona under the Bisgrove Early Career Scholarship. 26
Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationMediaTek CorePilot. Heterogeneous Multi-Processing Technology. Delivering extreme compute performance with maximum power efficiency
MediaTek CorePilot Heterogeneous Multi-Processing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on
More informationCtrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs
The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationUtilization-based Power Modeling of Modern Mobile Application Processor
Utilization-based Power Modeling of Modern Mobile Application Processor Abstract Power modeling of a modern mobile application processor (AP) is challenging because of its complex architectural characteristics.
More informationARM big.little Technology Unleashed An Improved User Experience Delivered
ARM big.little Technology Unleashed An Improved User Experience Delivered Govind Wathan Product Specialist Cortex -A Mobile & Consumer CPU Products 1 Agenda Introduction to big.little Technology Benefits
More informationMEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE
MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE VIGNESH ADHINARAYANAN, INDRANI PAUL, JOSEPH L. GREATHOUSE, WEI HUANG, ASHUTOSH PATTNAIK, WU-CHUN FENG POWER AND ENERGY ARE FIRST-CLASS
More informationIntroduction to the Tegra SoC Family and the ARM Architecture. Kristoffer Robin Stokke, PhD FLIR UAS
Introduction to the Tegra SoC Family and the ARM Architecture Kristoffer Robin Stokke, PhD FLIR UAS Goals of Lecture To give you something concrete to start on Simple introduction to ARMv8 NEON programming
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationThread Affinity Experiments
Thread Affinity Experiments Power implications on Exynos Introduction The LPGPU2 Profiling Tool and API provide support for CPU thread affinity locking and logging, and although this functionality is not
More informationCharacterization of User s Behavior Variations for Design of Replayable Mobile Workloads
Characterization of User s Behavior Variations for Design of Replayable Mobile Workloads Shruti Patil, Yeseong Kim (B), Kunal Korgaonkar, Ibrahim Awwal, and Tajana S. Rosing University of California San
More informationF28HS Hardware-Software Interface: Systems Programming
F28HS Hardware-Software Interface: Systems Programming Hans-Wolfgang Loidl School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh Semester 2 2017/18 0 No proprietary software has
More informationTablet MODECOM FreeTAB1004 IPS X4
MODECOM FreeTAB 1004 IPS X4 new member of Quad-Core family MODECOM FreeTAB 1004 IPS X4 is a tablet PC of outstanding technical parameters, efficiency and design and excellent price. This high-end device
More informationTake GPU Processing Power Beyond Graphics with Mali GPU Computing
Take GPU Processing Power Beyond Graphics with Mali GPU Computing Roberto Mijat Visual Computing Marketing Manager August 2012 Introduction Modern processor and SoC architectures endorse parallelism as
More informationEnergy Discounted Computing On Multicore Smartphones Meng Zhu & Kai Shen. Atul Bhargav
Energy Discounted Computing On Multicore Smartphones Meng Zhu & Kai Shen Atul Bhargav Overview Energy constraints in a smartphone Li-Ion Battery Arm big.little Hardware Sharing What is Energy Discounted
More informationIntegrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM
Integrating CPU and GPU, The ARM Methodology Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM The ARM Business Model Global leader in the development of
More informationBig.LITTLE Processing with ARM Cortex -A15 & Cortex-A7
Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7 Improving Energy Efficiency in High-Performance Mobile Platforms Peter Greenhalgh, ARM September 2011 This paper presents the rationale and design
More informationEach Milliwatt Matters
Each Milliwatt Matters Ultra High Efficiency Application Processors Govind Wathan Product Manager, CPG ARM Tech Symposia China 2015 November 2015 Ultra High Efficiency Processors Used in Diverse Markets
More informationKeyStone II. CorePac Overview
KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB
More informationEnergy Efficiency Analysis of Heterogeneous Platforms: Early Experiences
Energy Efficiency Analysis of Heterogeneous Platforms: Early Experiences Youhuizi Li, Weisong Shi, Congfeng Jiang, Jilin Zhang and Jian Wan Key Laboratory of Complex Systems Modeling and Simulation, Hangzhou
More informationCOL862 - Low Power Computing
COL862 - Low Power Computing Power Measurements using performance counters and studying the low power computing techniques in IoT development board (PSoC 4 BLE Pioneer Kit) and Arduino Mega 2560 Submitted
More informationECE 2300 Digital Logic & Computer Organization. More Caches Measuring Performance
ECE 23 Digital Logic & Computer Organization Spring 28 More s Measuring Performance Announcements HW7 due tomorrow :59pm Prelab 5(c) due Saturday 3pm Lab 6 (last one) released HW8 (last one) to be released
More informationOptimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd
Optimizing ARM SoC s with Carbon Performance Analysis Kits ARM Technical Symposia, Fall 2014 Andy Ladd Evolving System Requirements Processor Advances big.little Multicore Unicore DSP Cortex -R7 Block
More informationBuilding blocks for 64-bit Systems Development of System IP in ARM
Building blocks for 64-bit Systems Development of System IP in ARM Research seminar @ University of York January 2015 Stuart Kenny stuart.kenny@arm.com 1 2 64-bit Mobile Devices The Mobile Consumer Expects
More informationSamsung System LSI Business
Samsung System LSI Business NS (Stephen) Woo, Ph.D. President & GM of System LSI Samsung Electronics 0/32 Disclaimer The materials in this report include forward-looking statements which can generally
More informationSmartwatches (April 12, 2017) Samsung Gear Live, 2014 Samsung S 3G, 2014 Samsung S3 LTE, November 2016
Smartwatches (April 12, 2017) Samsung Gear Live, 2014 Samsung S 3G, 2014 Samsung S3 LTE, November 2016 1 Samsung Gear Live 2 Samsung Gear Live 1.63 Super AMOLED display with a resolution of 320 x 320 pixels
More informationGoogle Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan
More informationWearDrive: Fast and Energy Efficient Storage for Wearables
WearDrive: Fast and Energy Efficient Storage for Wearables Reza Shisheie Cleveland State University CIS 601 Wearable Computing: A New Era 2 Wearable Computing: A New Era Notifications Fitness/Healthcare
More informationCAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads
2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole-
More informationMediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency
MediaTek CorePilot 2.0 Heterogeneous Computing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on a chip
More informationEvaluating the Reuse Cache for mobile processors. Lokesh Jindal, Urmish Thakker, Swapnil Haria CS-752 Fall 2014 University of Wisconsin-Madison
Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker, Swapnil Haria CS-752 Fall 2014 University of Wisconsin-Madison 1 Executive Summary Problem : Mobile SOCs Area is money! Cache
More informationDongjun Shin Samsung Electronics
2014.10.31. Dongjun Shin Samsung Electronics Contents 2 Background Understanding CPU behavior Experiments Improvement idea Revisiting Linux I/O stack Conclusion Background Definition 3 CPU bound A computer
More informationThe Dangers and Complexities of SQLite Benchmarking. Dhathri Purohith, Jayashree Mohan and Vijay Chidambaram
The Dangers and Complexities of SQLite Benchmarking Dhathri Purohith, Jayashree Mohan and Vijay Chidambaram 2 3 Benchmarking SQLite is Non-trivial! Benchmarking complex systems in a repeatable fashion
More informationPower Measurements using performance counters
Power Measurements using performance counters CSL862: Low-Power Computing By Suman A M (2015SIY7524) Android Power Consumption in Android Power Consumption in Smartphones are powered from batteries which
More informationDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System Architecture Farzad Farshchi $, Prathap Kumar Valsan^, Renato Mancuso *, Heechul Yun $ $ University of Kansas, ^ Intel, * Boston University
More informationARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial
ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial Roxana Rusitoru Systems Research Engineer, ARM 1 Motivation & background Goal: Why: Who: 2 HPC-oriented
More informationA Lightweight Pseudo CPU Hotplug Mechanism for Mobile Devices
2016 10th International Conference on Next Generation Mobile Applications, Security and Technologies A Lightweight Pseudo CPU Hotplug Mechanism for Mobile Devices Kyoung Don Jang, Dong Hyun Kang, Do Hyoung
More informationNVIDIA Jetson Platform Characterization
NVIDIA Jetson Platform Characterization Hassan Halawa, Hazem A. Abdelhafez, Andrew Boktor, Matei Ripeanu The University of British Columbia {hhalawa, hazem, boktor, matei}@ece.ubc.ca Abstract. This study
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationSMARTPHONE HARDWARE: ANATOMY OF A HANDSET. Mainak Chaudhuri Indian Institute of Technology Kanpur Commonwealth of Learning Vancouver
SMARTPHONE HARDWARE: ANATOMY OF A HANDSET Mainak Chaudhuri Indian Institute of Technology Kanpur Commonwealth of Learning Vancouver Outline of topics What is the hardware architecture of a How does communication
More information10.1" Quad Core 3G. 1.2 GHz " 1280x800 IPS 3G. Specification: Optional: GPS OTA TABLET L10C1
10.1" Quad Core 3G TABLET L10C1 Specification: CPU: MT8389WK Quad core A7 1.2 GHz LCD Screen: 10.1 IPS 1280 x 800 Memory/Storage: 1GB DDR3/8G EMMC Camera: Dual camera (0.3M+2M) Connectivity: Wi-Fi 802.
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationHoLiSwap: Reducing Wire Energy in L1 Caches
: Reducing Wire Energy in L1 Caches CVA MEMO 136 Yatish Turakhia 1, Subhasis Das 2, Tor M. Aamodt 3, and William J. Dally 4 1,2,4 Department of Electrical Engineering, Stanford University 3 Department
More informationAn Evaluation of Unified Memory Technology on NVIDIA GPUs
An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo
More informationECE 471 Embedded Systems Lecture 2
ECE 471 Embedded Systems Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 7 September 2018 Announcements Reminder: The class notes are posted to the website. HW#1 will
More informationUnified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association
Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some
More informationTHE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017
THE FUTURE OF GPU DATA MANAGEMENT Michael Wolfe, May 9, 2017 CPU CACHE Hardware managed What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data
More informationA176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O
The A176 Cyclone is the smallest and most powerful Rugged-GPGPU, ideally suited for distributed systems. Its 256 CUDA cores reach 1 TFLOPS, and it consumes less than 17W at full load (8-10W at typical
More informationMulticore for mobile: The More the Merrier? Roger Shepherd Chipless Ltd
Multicore for mobile: The More the Merrier? Roger Shepherd Chipless Ltd 1 Topics The Mobile Computing Platform The Application Processor CMOS Power Model Multicore Software: Complexity & Scaling Conclusion
More informationComputer Architecture Dr. Charles Kim Howard University
EECE416 Microcomputer Fundamentals & Design Computer Architecture Dr. Charles Kim Howard University 1 Computer Architecture Computer Architecture Art of selecting and interconnecting hardware components
More informationLOWERING POWER CONSUMPTION OF HEVC DECODING. Chi Ching Chi Techinische Universität Berlin - AES PEGPUM 2014
LOWERING POWER CONSUMPTION OF HEVC DECODING Chi Ching Chi Techinische Universität Berlin - AES PEGPUM 2014 Introduction How to achieve low power HEVC video decoding? Modern processors expose many low power
More informationMV8895 CPU Module Product Specifications
MV8895 CPU Module Product Specifications January 7 2019 Contents 1. Product Name 2. Product Features 3. Product Specifications 3.1. H/W Size 3.2. H/W Specifications 3.3. S/W Specifications 4. External
More informationEvaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms
Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms Heechul Yun, Prathap Kumar Valsan University of Kansas {heechul.yun, prathap.kumarvalsan}@ku.edu Abstract Tasks running
More informationPerformance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews
Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models Jason Andrews Agenda System Performance Analysis IP Configuration System Creation Methodology: Create,
More informationUnder The Hood: Performance Tuning With Tizen. Ravi Sankar Guntur
Under The Hood: Performance Tuning With Tizen Ravi Sankar Guntur How to write a Tizen App Tools already available in IDE v2.3 Dynamic Analyzer Valgrind 2 What s NEXT? Want to optimize my application App
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationUnleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM
Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases Steve Steele, ARM 1 Today s Computational Challenges Trends Growing display sizes and resolutions, richer
More informationMeet the Walkers! Accelerating Index Traversals for In-Memory Databases"
Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan Our World is Data-Driven! Data resides
More informationDE0 Nano SoC - CPU Performance and Power
DE0 Nano SoC DE0 Nano SoC - CPU Performance and Power While Running Debian 19 th March 2017 - Satyen Akolkar Group 5 - AR Internet of Things By: Satyen Akolkar OVERVIEW The benchmark was performed by using
More informationMobile Tablets. Promate lumitab Tablet PC
Mobile Tablets 1 Xtouch / Vodafone Tablets Product Description Packaging Warranty XTouch PF1sV2 1.1 inch Wifi Only, Bluetooth 4., MS Windows 1 Professional Tablet PC Series - Intel BayTrail-T Z3735G Quad
More information8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2
CSE 820 Graduate Computer Architecture Richard Enbody Dr. Enbody 1 st Day 2 1 Why Computer Architecture? Improve coding. Knowledge to make architectural choices. Ability to understand articles about architecture.
More informationEvaluating the Effectiveness of Model Based Power Characterization
Evaluating the Effectiveness of Model Based Power Characterization John McCullough, Yuvraj Agarwal, Jaideep Chandrashekhar (Intel), Sathya Kuppuswamy, Alex C. Snoeren, Rajesh Gupta Computer Science and
More informationBifrost - The GPU architecture for next five billion
Bifrost - The GPU architecture for next five billion Hessed Choi Senior FAE / ARM ARM Tech Forum June 28 th, 2016 Vulkan 2 ARM 2016 What is Vulkan? A 3D graphics API for the next twenty years Logical successor
More informationRuntime Data Management on Non-volatile Memory-based Heterogeneous Memory for Task-Parallel Programs
Runtime Data Management on Non-volatile Memory-based Heterogeneous Memory for Task-Parallel Programs Kai Wu Jie Ren University of California, Merced PASA Lab Dong Li SC 18 1 Non-volatile Memory is Promising
More informationUnified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015
Unified memory GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015 Manuel Ujaldón Associate Professor @ Univ. of Malaga (Spain) Conjoint Senior
More informationA Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than
More informationElaborazione dati real-time su architetture embedded many-core e FPGA
Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T
More informationDesign and Implementation of a Random Access File System for NVRAM
This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.* No.*,*-* Design and Implementation of a Random Access
More informationAnastasia Ailamaki. Performance and energy analysis using transactional workloads
Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:
More informationMemory Hierarchy Basics
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases
More informationChapter 5. Introduction ARM Cortex series
Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1
More informationLow-Power Processor Solutions for Always-on Devices
Low-Power Processor Solutions for Always-on Devices Pieter van der Wolf MPSoC 2014 July 7 11, 2014 2014 Synopsys, Inc. All rights reserved. 1 Always-on Mobile Devices Mobile devices on the move Mobile
More informationCache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.
Cache Memories From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging
More informationDecoupled Access-Execute on ARM big.little
IT 16 059 Examensarbete 30 hp Augusti 2016 Decoupled Access-Execute on ARM big.little Anton Weber Institutionen för informationsteknologi Department of Information Technology Abstract Decoupled Access-Execute
More informationTutorial Outline. 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin. 8:30 am 9:00 am! Introduction! 10:00 am 10:30 am! Break!
Tutorial Outline Time Topic! 8:30 am 9:00 am! Introduction! 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin 10:00 am 10:30 am! Break! 10:30 am 11:00 am! Workload Characterization Tool: WIICA! 11:00
More informationSYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS
SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS Embedded System System Set of components needed to perform a function Hardware + software +. Embedded Main function not computing Usually not autonomous
More informationTINY System Ultra-Low Power Sensor Hub for Always-on Context Features
TINY System Ultra-Low Power Sensor Hub for Always-on Context Features MediaTek White Paper June 2015 MediaTek s sensor hub solution, powered by the TINY Stem low power architecture, supports always-on
More informationTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationKevin Meehan Stephen Moskal Computer Architecture Winter 2012 Dr. Shaaban
Kevin Meehan Stephen Moskal Computer Architecture Winter 2012 Dr. Shaaban Contents Raspberry Pi Foundation Raspberry Pi overview & specs ARM11 overview ARM11 cache, pipeline, branch prediction ARM11 vs.
More informationCACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás
CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,
More informationMali Developer Resources. Kevin Ho ARM Taiwan FAE
Mali Developer Resources Kevin Ho ARM Taiwan FAE ARM Mali Developer Tools Software Development SDKs for OpenGL ES & OpenCL OpenGL ES Emulators Shader Development Studio Shader Library Asset Creation Texture
More informationBenchmarking the Memory Hierarchy of Modern GPUs
1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong
More informationJEDEC Mobile Forum 2014
How Storage Solutions Are Accelerating the Mobile Revolution Stephen Lum, Mobile Memory Product Marketing Hank Lai, Mobile Memory Product Planning Samsung Electronics, AHQ JEDEC Mobile Forum 2014 Copyright
More informationEvaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms
Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms Heechul Yun, Prathap Kumar Valsan University of Kansas {heechul.yun, prathap.kumarvalsan}@ku.edu Abstract Tasks running
More informationSystem-on-Chip Architecture for Mobile Applications. Sabyasachi Dey
System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution
More informationAnatomy of a Globally Recursive Embedded LINPACK Benchmark
Anatomy of a Globally Recursive Embedded LINPACK Benchmark Jack Dongarra and Piotr Luszczek Batteries included. Some assembly required. ARM Landscape Architecture ARM11, Cortex A8, A9, A15 ISA ARMv6, ARMv7
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationAMD Fusion APU: Llano. Marcello Dionisio, Roman Fedorov Advanced Computer Architectures
AMD Fusion APU: Llano Marcello Dionisio, Roman Fedorov Advanced Computer Architectures Outline Introduction AMD Llano architecture AMD Llano CPU core AMD Llano GPU Memory access management Turbo core technology
More informationHOT CHIPS 2014 NVIDIA S DENVER PROCESSOR. Darrell Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman
HOT CHIPS 2014 NVIDIA S DENVER PROCESSOR Darrell Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman TEGRA K1 with Dual Denver CPUs The First 64-bit Android Kepler-Class
More informationPower Measurement Using Performance Counters
Power Measurement Using Performance Counters October 2016 1 Introduction CPU s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power
More informationEfficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems
Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address
More informationEnabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager
Enabling a Richer Multimedia Experience with GPU Compute Roberto Mijat Visual Computing Marketing Manager 1 What is GPU Compute Operating System and most application processing continue to reside on the
More informationNext Generation Visual Computing
Next Generation Visual Computing (Making GPU Computing a Reality with Mali ) Taipei, 18 June 2013 Roberto Mijat ARM Addressing Computational Challenges Trends Growing display sizes and resolutions Increasing
More informationI/O Stack Optimization for Smartphones
I/O Stack Optimization for Smartphones Sooman Jeong 1, Kisung Lee 2, Seongjin Lee 1, Seoungbum Son 2, and Youjip Won 1 1 Dept. of Electronics and Computer Engineering, Hanyang University 2 Samsung Electronics
More informationPerformance and Power Impact of Issuewidth in Chip-Multiprocessor Cores
Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions
More informationThe Benefits of GPU Compute on ARM Mali GPUs
The Benefits of GPU Compute on ARM Mali GPUs Tim Hartley 1 SEMICON Europa 2014 ARM Introduction World leading semiconductor IP Founded in 1990 1060 processor licenses sold to more than 350 companies >
More information