Contour Detection on Mobile Platforms
|
|
- Bernice Conley
- 6 years ago
- Views:
Transcription
1 Contour Detection on Mobile Platforms Bor-Yiing Su, Prof. Kurt Keutzer, Parallel Computing Lab, University of California, Berkeley 1/26
2 Diagnosing Power/Performance Correctness Category of This Work Personal Health Image Hearing, Speech Parallel Retrieval Music Browser Design Patterns/Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Static Verification Parallel Libraries Parallel Frameworks Type Systems Efficiency Sketching Languages Autotuners Legacy Communication & Schedulers Code Synch. Primitives Efficiency Language Compilers OS Libraries & Services Legacy OS Hypervisor Multicore/GPGPU 2 RAMP Manycore Directed Testing Dynamic Checking Debugging with Replay 2/26
3 Outline Motivation Revisiting the Image Contour Detector: Damascene Parallelizing Damascene on Multiple Platforms Conclusion 3/26
4 What will future visual applications look like? Interact with images Computer needs to understand images to provide image interaction services Summarize information from the images Location aware services Optical character recognition Interact with image parts through touchpad interfaces Augmented reality 4/26
5 CBIR on a Mobile Platform Goal: meeting end user s requirements Improving detection rate (accuracy), response time (speed), and battery life (energy) Our work: using Damascene as a case study to understand the capability of modern mobile platforms in terms of: Execution time Power consumption Choice of computation platform 5/26
6 Outline Motivation Revisiting the Image Contour Detector: Damascene Parallelizing Damascene on Multiple Platforms Conclusion 6/26
7 Image Contour Detection State-of-the-art algorithm which achieves the highest accuracy: gpb M. Maire, P. Arbeláez, C. Fowlkes, and J. Malik. Using contours to detect and localize junctions in natural images. CVPR, pp. 1 8, June Image Human Generated Contours Machine Generated Contours 7/26
8 Computation Flow of Damascene Contours Combine Non-max Suppression Intervening Contour Combine, Normalize 8/26
9 Outline Motivation Revisiting the Image Contour Detector: Damascene Parallelizing Damascene on Multiple Platforms Targeting Platforms Performance on Desktop Platforms On Nvidia GTX480 GPU On Intel Nehalem Core i7 920 Quad-Core CPU Performance on Mobile Platforms On ARM v7 Quad-Core Processor On MSM8655 Single-Core Snapdragon with NEON Support Execution Time Analysis Power Analysis Conclusion 9/26
10 Desktop Platforms Fermi GTX 480 Nelalem Core i7 920 Specifications GTX 480 CUDA Cores 480 Processor Clock SP FLOPS Memory Bandwidth Memory Size L1 Cache + Shared Memory L2 Cache 1.4 GHz 1.35 T GB/s 1.5 GB 64 kb per SM 768 kb Specifications Core i7 920 Cores 4 Processor Clock 2.66 GHz SP FLOPS G Memory Bandwidth 25.6 GB/s L1 Cache 64 kb per core L2 Cache 1MB per core L3 Cache 8 MB 10/26
11 Mobile Platforms ARM Cortex A9 Snapdragon MSM8655 Specifications Cortex A9 Cores 4 Processor Clock 800 MHz SIMD Width (NEON) 128 bits DMIPS 8000 L1 Cache 32 kb L2 Cache 512 kb Memory Bandwidth N.A. Specifications MSM 8655 Cores 1 Processor Clock 1 GHz SIMD Width (NEON) 128 bits DMIPS 2100 L1 Cache 32 kb L2 Cache 256 kb Memory Bandwidth N.A. 11/26
12 Damascene on GTX 480 The starting implementation Image size used for benchmarking: 321 x 481 pixels Routine Execution Time (ms) Preprocess Textons 129 Localcues 733 Intervening Lanczos Solver 800 Poseprocess 7.96 Total 1706 Bryan Catanzaro, Bor-Yiing Su, Narayanan Sundaram, Yunsup Lee, Mark Murphy, and Kurt Keutzer, "Efficient, High-Quality Image Contour Detection," in International Conference on Computer Vision (ICCV-2009), pp , Kyoto Japan, September /26
13 Porting to the CPU Platform Porting the original CUDA implementation to serial C implementation Using our revised algorithms from the CUDA implementation Using pthread to parallelize routines that execute more than 1 second on a 321 x 481 image Exploring coarse grained parallelism on CPU instead of fine grained parallelism on GPU Setting compiler flags to vectorize the code, but not using SSE intrinsics manually Resulting in 10%-20% performance gain Image size used for benchmarking The same as the image used in GPU benchmarking 321 x 481 pixels 13/26
14 Performance on Core i7 920 Not all kernels scale linearly on the number of threads Some computations are bounded by memory bandwidth Compiler vectorization works better on single thread Not all routines in Damascene are parallelized Routines Number of Used Threads Routines Number of Threads texton localcues intervene Lanczos Total texton localcues intervene Lanczos Total Execution Time (s) Scalability on Number of Threads 14/26
15 Porting to the ARM Cortex A9 Quad-Core Processor Code base The same as the CPU pthread code Compiler flags None: not using any compiler flags fpu: -march=armv7-a -mfloat-abi=softfp -mfpu=neon All: -O3 -ftree-vectorize Neonize: using the NEON intrinsics manually 15/26
16 Performance on the ARM Cortex A9 Quad-Core Processor Image size used for benchmarking : 205 x 154 pixels Due to memory limitations Scalability is very close to linear on the number of threads Computation tends to be compute bounded on a slower processor Number of Used Threads Number of Threads Routines Routines texton texton localcues localcues intervene intervene Lanczos Lanczos Total Total Execution Time (s) Scalability on Number of Threads 16/26
17 Porting to the Single Core Snapdragon Processor Development environment: Android NDK Code base The same as the serial CPU implementation: Not using pthread Compiler flags None: not using any compiler flags fpu: -march=armv7-a -mfloat-abi=softfp -mfpu=neon All: -O3 -ftree-vectorize Neonize: using the NEON intrinsics manually 17/26
18 Benchmarking the Lanczos Eigensolver The Lanczos iterative solver is composed of spmv, saxpy, sdot and snrm2 kernels All these kernels are memory bounded Manually NEONize the kernels result in a performance gain of 7%-3x Sparse Matrix Pattern 18/26
19 Performance on the Snapdragon Processor Image size used for benchmarking : 205 x 154 pixels Due to memory limitations Setting all compiler flags delivers a 6x speedup on the contour detection computation It is essential to get the compiler flags right on floating point computations Routine Flag_None Flag_All Ratio texton localcues intervene Lanczos Total Execution Time (s) Compiler Flags None vs. All 19/26
20 Execution Time Analysis Image size used for benchmarking : 205 x 154 pixels Implementations Snapdragon: single core, 4-way SIMD, compiler vectorization Cortex A9: quad-core, 4-way SIMD, compiler vectorization Nehalem: quad-core, 4-way SIMD, compiler vectorization Fermi: 15 cores, 16-way SIMD, dual issue, manual vectorization Routin e Snapdr agon Cortex A9 Nehale m Fermi texton localcues interven e Lanczos Total Routin e Snapdr agon Cortex A9 Nehale m Fermi texton localcues interven e Lanczos Total Execution Time (s) Speedup Ratio 20/26
21 Execution Time Observations The performance gap between snapdragon and Core i7 920 is about 50x Frequency difference: 1G vs. 2.66G Memory bandwidth difference: 1G vs 25.6 G Bytes/s A quad-core snapdragon with perfect scalability will close the gap to 12x Platform choice of computation Transferring the entire image to the cloud and using the CUDA implementation will be faster than performing the computation locally on the mobile platform Routin e Snapdr agon Cortex A9 Nehale m Fermi texton localcues interven e Lanczos Total Routin e Snapdr agon Cortex A9 Nehale m Fermi texton localcues interven e Lanczos Total Execution Time (s) Speedup Ratio 21/26
22 Power Analysis Image size used for benchmarking : 205 x 154 pixels Snapdragon power numbers Using the Trepn profiler Sampling CPU current every second Nehalem and Fermi power numbers Using the Kill-a-watt meter Measuring the rough system power Routine Max Power (W) Min Power (W) Total Energy (Joule) Snapdrago n Nehalem (1 thread) Nehalem (2 threads) Nehalem (4 threads) Fermi Ratio /26
23 Power Observations Race to halt principle is well applied on the desktop platforms Mobile platforms are about 400x more power efficient than desktop platforms Although the execution time on the mobile platform is about 50x (compared to CPU) to 250x (compared to GPU) slower, it still consumes the least amount of energy As long as the image uploading and the results downloading energy is smaller than the computation energy, we should keep the computation on the cloud Routine Max Power (W) Min Power (W) Total Energy (Joule) Snapdrago n Nehalem (1 thread) Nehalem (2 threads) Nehalem (4 threads) Fermi Ratio /26
24 Outline Motivation Revisiting the Image Contour Detector: Damascene Parallelizing Damascene on Multiple Platforms Conclusion 24/26
25 Conclusion Execution time Our hand optimized CUDA code still delivers the best performance Our CPU implementation achieves linear scalability on ARM Cortex A9, but not on Core i7 920 The computation tends to be more compute bounded on the slower ARM Cortex A9, but more memory bounded on the faster Core i7 920 The gap between single core snapdragon and quad core Core i7 920 is about 50x Power Race to halt principle is well applied on the desktop platforms Snapdragon is about 400x more power efficient than desktop platforms Although Snapdragon is about 50x 250x slower than the desktop platforms, it still consumes less energy Platform choice of computation Damascene is very computationally intensive and has lots of data parallelism Uploading the entire image to the cloud will be better than performing the computations locally on the mobile platform in terms of both total execution time and energy consumption With the advancement of mobile technologies, this decision might be changed in the future 25/26
26 Future Work Include network bandwidth scenarios Data transmission latency model Data transmission power consumption model Include input pre-processing analysis Image down sampling Image compression rate Port the entire object recognition system onto mobile platforms Client-cloud co-design Revisit the object recognition system from the perspective of clientcloud co-design Try out other object recognition systems that are less computationally intensive on the mobile platforms 26/26
27 Thank You 27/26
Patterns and Computer Vision
Patterns and Computer Vision Michael Anderson, Bryan Catanzaro, Jike Chong, Katya Gonina, Kurt Keutzer, Tim Mattson, Mark Murphy, David Sheffield, Bor-Yiing Su, Narayanan Sundaram and the rest of the PALLAS
More informationIntegrating the Par Lab Stack Running Damascene on SEJITS/ROS/RAMP Gold
Integrating the Par Lab Stack Running on /ROS/RAMP Gold Kevin Klues, Yunsup Lee, Andrew Waterman Par Lab Winter Retreat 2010 Overall Goal of the Par Lab: Create Productive, Efficient, Correct, Portable
More informationCode Generators for Stencil Auto-tuning
Code Generators for Stencil Auto-tuning Shoaib Kamil with Cy Chan, Sam Williams, Kaushik Datta, John Shalf, Katherine Yelick, Jim Demmel, Leonid Oliker Diagnosing Power/Performance Correctness Where this
More informationHardware Acceleration for Tagging
Parallel Hardware Parallel Applications IT industry (Silicon Valley) Parallel Software Users Hardware Acceleration for Tagging Sarah Bird, David McGrogan, John Kubiatowicz, Krste Asanovic June 5, 2008
More informationParallel Webpage Layout
Parallel Webpage Layout Leo Meyerovich, Chan Siu Man, Chan Siu On, Heidi Pan Krste Asanovic, Rastislav Bodik and many others from the UPCRC Berkeley project UC Berkeley Par Lab Research Overview Diagnosing
More informationMaking Performance Understandable: Towards a Standard for Performance Counters on Manycore Architectures
Parallel Hardware Parallel Applications IT industry (Silicon Valley) Parallel Software Users Making Performance Understandable: Towards a Standard for Performance Counters on Manycore Architectures Sarah
More informationLarge Displacement Optical Flow & Applications
Large Displacement Optical Flow & Applications Narayanan Sundaram, Kurt Keutzer (Parlab) In collaboration with Thomas Brox (University of Freiburg) Michael Tao (University of California Berkeley) Parlab
More informationSEJITS: High Performance for Productivity Applications
SEJITS: High Performance for Productivity Applications Shoaib Kamil & many others Parlab End of Project Celebration May 30, 2013 Correctness Where Did We Start? Personal Health Image Retrieval Hearing,
More informationActive Testing for Concurrent Programs
Active Testing for Concurrent Programs Pallavi Joshi, Mayur Naik, Chang-Seo Park, Koushik Sen 12/30/2008 ROPAS Seminar ParLab, UC Berkeley Intel Research Overview ParLab The Parallel Computing Laboratory
More informationCode Generators for Stencil Auto-tuning
Code Generators for Stencil Auto-tuning Shoaib Kamil with Cy Chan, John Shalf, Sam Williams, Kaushik Datta, Katherine Yelick, Jim Demmel, Leonid Oliker Diagnosing Power/Performance Correctness Where this
More informationActive Testing for Concurrent Programs
Active Testing for Concurrent Programs Pallavi Joshi Mayur Naik Chang-Seo Park Koushik Sen 1/8/2009 ParLab Retreat ParLab, UC Berkeley Intel Research Overview Checking correctness of concurrent programs
More informationThird party names are the property of their owners.
Disclaimer READ THIS its very important The views expressed in this talk are those of the speaker and not his employer. This is an academic style talk and does not address details of any particular Intel
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationThe Content-Based Image Retrieval Project
Chapter 3 The Content-Based Image Retrieval Project Bryan Catanzaro and Kurt Keutzer 1 Introduction The Content Based Image Retrieval Project was one of Par Lab s five motivating applications. In this
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationOptimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and
Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging
More informationEnd of Parlab talk. May 29, 2013 Gans Srinivasa Intel Corp
End of Parlab talk May 29, 2013 Gans Srinivasa Intel Corp 1 A Journey: Year 2005; 2008; 2011; Potential Parallel Speedup 1000 100 MIMD*SIMD (32 b) MIMD*SIMD (64 b) SIMD (32b) SIMD (64b) MIMD 10 To Multicore
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationParallel Application Library for Object Recognition
Parallel Application Library for Object Recognition Bor-Yiing Su Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-199 http://www.eecs.berkeley.edu/pubs/techrpts/2012/eecs-2012-199.html
More informationA Retrospective on Par Lab Architecture Research
A Retrospective on Par Lab Architecture Research Par Lab SIGARCH Krste Asanovic, Rimas Avizienis, Jonathan Bachrach, Scott Beamer, Sarah Bird, Alex Bishara, Chris Celio, Henry Cook, Ben Keller, Yunsup
More informationComparison of High-Speed Ray Casting on GPU
Comparison of High-Speed Ray Casting on GPU using CUDA and OpenGL November 8, 2008 NVIDIA 1,2, Andreas Weinlich 1, Holger Scherl 2, Markus Kowarschik 2 and Joachim Hornegger 1 1 Chair of Pattern Recognition
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationEhsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas
Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas 2 Increasing number of transistors on chip Power and energy limited Single- thread performance limited => parallelism Many opeons: heavy mulecore,
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationReal-time covariance tracking algorithm for embedded systems
Real-time covariance tracking algorithm for embedded systems A. Roméro, L. Lacassagne, A. Zahraee, M. Gouiffès www.lri.fr/~lacas LRI, LIMSI & IEF - University Paris-Sud Covariance matching techniques are
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationARE WE OPTIMIZING HARDWARE FOR
ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science
More informationCoordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin
Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationSummary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory
Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts Abdeslem DJAOUI PPD Rutherford Appleton Laboratory Background to Workshop Organised by CERN OpenLab and held in IT Auditorium,
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationTEGRA K1 AND THE AUTOMOTIVE INDUSTRY. Gernot Ziegler, Timo Stich
TEGRA K1 AND THE AUTOMOTIVE INDUSTRY Gernot Ziegler, Timo Stich Previously: Tegra in Automotive Infotainment / Navigation Digital Instrument Cluster Passenger Entertainment TEGRA K1 with Kepler GPU GPU:
More informationPerformance of a multi-physics code on Cavium ThunderX2
Performance of a multi-physics code on Cavium ThunderX2 User Productivity Enhancement, Technology Transfer, and Training (PETTT) Presented by John G. Wohlbier (PETTT/Engility), Keith Obenschain, Gopal
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort
PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationPorting BLIS to new architectures Early experiences
1st BLIS Retreat. Austin (Texas) Early experiences Universidad Complutense de Madrid (Spain) September 5, 2013 BLIS design principles BLIS = Programmability + Performance + Portability Share experiences
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationPerformance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 esaule@uncc.edu, {kamer,umit}@bmi.osu.edu 1 Department of Biomedical
More informationReal-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010
1 Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010 Presentation by Henrik H. Knutsen for TDT24, fall 2012 Om du ønsker, kan du sette inn navn, tittel på foredraget, o.l.
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationDebunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU
Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU The myth 10x-1000x speed up on GPU vs CPU Papers supporting the myth: Microsoft: N. K. Govindaraju, B. Lloyd, Y.
More informationNvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018
Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks
More informationThe Mont-Blanc approach towards Exascale
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationMixed Precision Methods
Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationPreliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths
Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Y. Kodama, T. Odajima, M. Matsuda, M. Tsuji, J. Lee and M. Sato RIKEN AICS (Advanced Institute for Computational
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More information