Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment
|
|
- Gwen Brown
- 5 years ago
- Views:
Transcription
1 Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment Heegon Kim, Sungju Lee, Yongwha Chung, Daihee Park, and Taewoong Jeon Dept. of Computer and Information Science, Korea University, Sejong, Korea {khg86,peacfeel,ychungy,dhpark,jeon}@korea.ac.kr Abstract. Recently, many multimedia applications can be parallelized by using multicore platforms such as CPU and PU. In this paper, we propose a parallel processing approach for a multimedia application by using both CPU and PU. Instead of distributing the parallelizable workload to either CPU or PU(i.e., homogeneous computing), we distribute the workload simultaneously into both CPU and PU(i.e., heterogeneous computing) by using OpenCL. Based on the experimental results with a photomosaic application, we confirm that the proposed parallel processing approach can provide better performance than the typical parallel processing approach by utilizing the given resource maximally. Keywords: CPU, PU, Heterogeneous Computing, OpenCL. 1 Introduction As multicore processors are used for handheld devices as well as PCs/servers, parallel processing approaches have been developed for many applications[1-2]. For example, many approaches have been reported to parallelize multimedia applications [3-4]. Furthermore, many users create their own content using these devices as handheld devices such as smartphones become powerful. In this paper, we focus on parallelizing multimedia applications by using both CPU and PU. In fact, these applications have sufficient parallelism, and many parallel processing results have been reported[5-7] by using general-purpose programming on PU such as Nvidia s CUDA[8], in addition to Pthread[9] on CPU. Recently, OpenCL[10] has been defined as a standard for heterogeneous parallel computing. It provides a cross-platform framework for writing software able to run on different kinds of devices, from multicore CPUs to PUs. That is, a parallel program written with OpenCL can be executed on either CPU or PU[11]. enerally, it is true that PU can provide better performance than CPU for multimedia applications. However, a current multicore CPU is also a powerful processor, and thus, when used together with PU, can reduce the total execution time. We propose a load balancing approach which can overcome the performance limit of either CPU-only or PU-only execution. We first parallelize a given multimedia Corresponding author. James J. (Jong Hyuk) Park et al. (eds.), Multimedia and Ubiquitous Engineering, Lecture Notes in Electrical Engineering 308, DOI: / _4, Springer-Verlag Berlin Heidelberg
2 28 H. Kim et al. application with OpenCL, and measure its execution time on CPU and PU, respectively. Then, we partition the parallelized workload into two parts, based on the relative performance of PU over CPU. Finally, we assign the PU-portion of workload to PU by using a non-blocking command, and then assign the remaining parallel portion to CPU without waiting for a result from PU. By reducing the idle time on either CPU or PU, we overlap the PU execution maximally with the CPU execution. The rest of the paper is structured as follows. Section 2 explains OpenCL[10] and multimedia application Photomosaic[12]. Section 3 describes our proposed load balancing approach. The experimental results are given in Section 4, and conclusions are provided in Section 5. 2 Background 2.1 OpenCL OpenCL[10] is an open standard aimed at providing a programming environment suitable to access heterogeneous architectures. In particular, OpenCL(shown in Fig. 1) allows to execute computational workloads on various multicore processors. Considering the increasing availability of such types of processors, OpenCL is playing a crucial role in enabling portable applications to access a wide range of computational resources. To achieve this aim, various levels of abstraction have been introduced in the OpenCL model. Platform performs an abstraction of the number and type of computing devices in a hardware platform. At this level are made available to developers the routines to query and to manage the computing devices, to create the contexts and work queues for submission of sets of instructions called kernels. Execution is based on the concept of kernel which is a collection of instructions executed on the computing device, multicore CPU or PU, called OpenCL device. An OpenCL application can be divided in two programs: host and kernel. The host program is executed on CPU. It defines the context for the kernels and manages their execution. Especially, when a kernel is submitted for execution by the host, an index space is defined. An instance of the kernel executes for each point in this index space. This kernel instance is called a work-item and is identified by its point in the index space, which provides a global ID for the workitem. Each work-item executes the same code on distinguished data. That is, work-items are organized into work-groups providing a more coarse-grained decomposition of the index space. Language describes the syntax and programming interface for writing kernels(set of instructions that execute on computing device such as multicore CPUs or PUs).
3 Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment 29 Fig. 1. Platform model of OpenCL 2.2 Photomosaic A photomosaic[12] is a compound word of Photograph and Mosaic. The photomosaic divides a large image into several small parts, which are converted into small tile images of similar colors. Fig. 2. Result of the photomosaic Fig. 2 shows the similarity between the original image and the result image of the photomosaic. The result image is composed of many smaller tile images. In this paper, the photomosaic iterates the loop of image conversion 5 times per pixel. 3 Parallel Photomosaic The performance of each core of CPU is better than PU s, whereas the number of CPU cores is less than the number of PU cores. The PU which has hundreds of cores is more advantageous, if calculation is made of a lot of iterations of the same operation. A large number of studies of PU-equipped environments using only the PU parallel processing have been published[13-14]. The photomosaic does not have data dependency among the tile images. Therefore, a parallel photomosaic by OpenCL is processed using compute units for each tile
4 30 H. Kim et al. images. The host program is waiting during the execution of the kernel function, because typical OpenCL programs are performed by synchronization using blocking mode(see Fig. 3). Fig. 3. Typical parallel processing of photomosaic using PU In heterogeneous computing environments, we propose an approach which improves performance using not only PU but also CPU to reduce the CPU idle time(i.e., waiting time). OpenCL allows asynchronous processing using non-blocking mode. In this paper, non-blocking mode is used in order to reduce the CPU idle time. In non-blocking mode, both CPU and PU resources can be used simultaneously as shown in Fig. 4. Since the idle time is reduced, the proposed approach can effect a speedup higher than can be achieved by typical parallel processing. Fig. 4. Proposed parallel processing of photomosaic using both PU and CPU 4 Experimental Results For evaluating the proposed approach, we used AMD Phenom II X4 955 Processor, eforce TX 285, and the target image with resolution. The number of tile images is AMD Phenom II X4 955 Processor has four cores, and eforce TX 285 has 240 cores. However, the PU core provides lower performance than the CPU core. Also, many typical parallel processing studies with PU have focused on PU only. First, the execution time of the photomosaic was measured for evaluating parallel OpenCL speedup. The photomosaic was measured in three ways: sequential, parallel using PU-only by OpenCL, and parallel using multicore CPU-only by OpenCL. Table 1 shows the sequential and parallel execution times of the photomosaic application. Multicore CPU-only was measured using multicore CPU, and PUonly was measured using PU. Multicore CPU(x%)+PU(y%) was measured using both multicore CPU and PU, and multicore CPU had x% portion while PU had y% portion. The result shows that the performance of using CPU-only by OpenCL provides super speedup(i.e., a 4-core CPU has a speedup of 17). The reason is that the cache-hit ratio was highly improved with the increased number of cores.
5 Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment 31 Table 1. Sequential and parallel execution times of the photomosaic Execution time(sec) Sequential processing Parallel processing Multicore CPU-only PU-only Multicore CPU(50%) + PU(50%) Multicore CPU(25%) + PU(75%) 8.40 Next, the execution time of the photomosaic with the workload divided into two parts was measured, in which one part was performed by CPU and the other part was performed by PU. As Table 1 shows, the photomosaic that was divided into 25% CPU portion and 75% PU portion can provide better performance than the one using multicore CPU-only or PU-only. These portions were constrained by index space, therefore the division into two parts is not possible in certain proportions depending on the PU performance and CPU performance(i.e., CPU(33%) + PU(66%) ). The proposed approach can have a speedup of 40, and can yield 25% better performance than the one using PU-only by OpenCL. However, if the 2-part division is made inappropriately, the proposed approach provides lower performance than PU-only. Fig. 5 shows the speedups with OpenCL achieved by four different ways of parallel processing. Fig. 5. Speedup with OpenCL 5 Conclusions We have proposed an efficient heterogeneous parallel processing approach to reduce CPU idle time. The approach, which uses both CPU and PU by OpenCL, decreases total execution time for better performance.
6 32 H. Kim et al. Experiments with the use of both CPU and PU for parallel processing have demonstrated that our parallel processing approach can provide a speedup of 40 and (if properly load-balanced between CPU and PU) 25% better performance than the generally used parallel approach using PU only. Acknowledgement. This research was supported by Basic Science Research Program through the National Research Foundation of Korea(funded by the Ministry of Education, Science and Technology, 2012R1A1A ) and BK21 Plus Program. References 1. Held, J., Bautista, J., Koehl, S.: From a Few Cores to Many: A Tera-Scale Computing Research Overview. Intel White Paper (2006) 2. Levy, M., Conte, T.: Embedded Multicore Processors and Systems. IEEE Micro 29, 7 9 (2009) 3. Sihn, K., Baik, H., Kim, J., Bae, S., Song, J.: Novel Approaches to Parallel H.264 Decoder on Symmetric Multicore Systems. In: Proc. of International Conference on Acoustics, Speech, and Signal Processing, pp (2009) 4. Chen, W., Hang, H.: H.264/AVC Motion Estimation Implementation on CUDA. In: Proc. of International Multimedia and Expo Conf., pp (2008) 5. Shams, R., Sadeghi, P., Kennedy, R., Hartley, R.: A Survey of Medical Image Registration on Multicore and the PU. IEEE Signal Processing Magazine 27(2), (2010) 6. Bienia, C., Kumar, S., Singh, J., Li, K.: The PARSEC Benchmark Suite: Characterization and Architectural Implications. In: Proc. of International Conference on Parallel Architectures and Compilation Techniques, pp (2008) 7. Kim, H., Lee, S., Chung, Y., Pan, S.: Parallelizing H.264 and AES Collectively. KSII Tr. Internet & Info. Systems 7(9), (2013) 8. NVidia, NVidia CUDA Compute Unified Device Architecture Programming uide, NVidia (2008) 9. Akhter, S., Roberts, J.: Multi-Core Programming - Increasing Performance through Software Multi-Threading. Intel Press, Hillsboro (2006) 10. Stone, J., ohara, D., Shi,.: OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science and Engineering 12(3), (2010) 11. aetano, R., Pesquet-Popescu, B.: OpenCL Implementation of Motion Estimation for Cloud Video Processing. In: Proc. of International Symposium on Multimedia Signal Processing, pp. 1 6 (2011) 12. Silvers, R., Hawley, M.: Photomosaics. Henry Holt, New York (1997) 13. Cao, J., Xie, X.-f., Liang, J., Li, D.-d.: PU Accelerated Target Tracking Method. In: Jin, D., Lin, S. (eds.) Advances in MSEC Vol. 1. AISC, vol. 128, pp Springer, Heidelberg (2011) 14. Davendra, D., Zelinka, I.: PU Based Enhanced Differential Evolution Algorithm: A Comparison between CUDA and OpenCL. Intelligent Systems Reference Library, vol. 38, pp (2013)
Real-time processing for intelligent-surveillance applications
LETTER IEICE Electronics Express, Vol.14, No.8, 1 12 Real-time processing for intelligent-surveillance applications Sungju Lee, Heegon Kim, Jaewon Sa, Byungkwan Park, and Yongwha Chung a) Dept. of Computer
More informationCPU-GPU hybrid computing for feature extraction from video stream
LETTER IEICE Electronics Express, Vol.11, No.22, 1 8 CPU-GPU hybrid computing for feature extraction from video stream Sungju Lee 1, Heegon Kim 1, Daihee Park 1, Yongwha Chung 1a), and Taikyeong Jeong
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationParallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming
Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),
More informationGPU Implementation of a Multiobjective Search Algorithm
Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationAn Improvement of the Occlusion Detection Performance in Sequential Images Using Optical Flow
, pp.247-251 http://dx.doi.org/10.14257/astl.2015.99.58 An Improvement of the Occlusion Detection Performance in Sequential Images Using Optical Flow Jin Woo Choi 1, Jae Seoung Kim 2, Taeg Kuen Whangbo
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationEvaluation Of The Performance Of GPU Global Memory Coalescing
Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More information2511 Sejong Ave., Sejong-si, Republic of Korea 30019; 1. INTRODUCTION ABSTRACT
Heterogeneous Computing for a Real-Time Pig Monitoring System Younchang Choi* a, Jinseong Kim a, Jaehak Kim a, Yeonwoo Chung a, Yongwha Chung a, Daihee Park a, and Hakjae Kim b a Dept. of Computer and
More informationH.264 Parallel Optimization on Graphics Processors
H.264 Parallel Optimization on Graphics Processors Elias Baaklini, Hassan Sbeity and Smail Niar University of Valenciennes, 59313, Valenciennes, Cedex 9, France {elias.baaklini,smail.niar}@univ-valenciennes.fr
More informationAn Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs
An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs Ana Balevic Leiden Institute of Advanced Computer Science University of Leiden Leiden, The Netherlands balevic@liacs.nl
More informationThe Design and Evaluation of Hierarchical Multilevel Parallelisms for H.264 Encoder on Multi-core. Architecture.
UDC 0043126, DOI: 102298/CSIS1001189W The Design and Evaluation of Hierarchical Multilevel Parallelisms for H264 Encoder on Multi-core Architecture Haitao Wei 1, Junqing Yu 1, and Jiang Li 1 1 School of
More informationimplementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot
Parallel Implementation Algorithm of Motion Estimation for GPU Applications by Tian Song 1,2*, Masashi Koshino 2, Yuya Matsunohana 2 and Takashi Shimamoto 1,2 Abstract The video coding standard H.264/AVC
More informationSMCCSE: PaaS Platform for processing large amounts of social media
KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationXIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture
XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics
More informationOpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4
OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationHigh-Performance VLSI Architecture of H.264/AVC CAVLD by Parallel Run_before Estimation Algorithm *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 29, 595-605 (2013) High-Performance VLSI Architecture of H.264/AVC CAVLD by Parallel Run_before Estimation Algorithm * JONGWOO BAE 1 AND JINSOO CHO 2,+ 1
More informationOpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania
OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationParallelizing Inline Data Reduction Operations for Primary Storage Systems
Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr
More informationParallel Approach for Implementing Data Mining Algorithms
TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
More informationGOP Level Parallelism on H.264 Video Encoder for Multicore Architecture
2011 International Conference on Circuits, System and Simulation IPCSIT vol.7 (2011) (2011) IACSIT Press, Singapore GOP Level on H.264 Video Encoder for Multicore Architecture S.Sankaraiah 1 2, H.S.Lam,
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationPerformance Analysis of Sobel Edge Detection Filter on GPU using CUDA & OpenGL
Performance Analysis of Sobel Edge Detection Filter on GPU using CUDA & OpenGL Ms. Khyati Shah Assistant Professor, Computer Engineering Department VIER-kotambi, INDIA khyati30@gmail.com Abstract: CUDA(Compute
More informationENGINEERING MECHANICS 2012 pp Svratka, Czech Republic, May 14 17, 2012 Paper #249
. 18 m 2012 th International Conference ENGINEERING MECHANICS 2012 pp. 377 381 Svratka, Czech Republic, May 14 17, 2012 Paper #249 COMPUTATIONALLY EFFICIENT ALGORITHMS FOR EVALUATION OF STATISTICAL DESCRIPTORS
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationUse cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games
Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationVideo Inter-frame Forgery Identification Based on Optical Flow Consistency
Sensors & Transducers 24 by IFSA Publishing, S. L. http://www.sensorsportal.com Video Inter-frame Forgery Identification Based on Optical Flow Consistency Qi Wang, Zhaohong Li, Zhenzhen Zhang, Qinglong
More informationToward Interlinking Asian Resources Effectively: Chinese to Korean Frequency-Based Machine Translation System
Toward Interlinking Asian Resources Effectively: Chinese to Korean Frequency-Based Machine Translation System Eun Ji Kim and Mun Yong Yi (&) Department of Knowledge Service Engineering, KAIST, Daejeon,
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationExpressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17
Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]
More informationREDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS
BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationA Design of Cooperation Management System to Improve Reliability in Resource Sharing Computing Environment
A Design of Cooperation Management System to Improve Reliability in Resource Sharing Computing Environment Ji Su Park, Kwang Sik Chung 1, Jin Gon Shon Dept. of Computer Science, Korea National Open University
More informationMolatomium: Parallel Programming Model in Practice
Molatomium: Parallel Programming Model in Practice Motohiro Takayama, Ryuji Sakai, Nobuhiro Kato, Tomofumi Shimada Toshiba Corporation Abstract Consumer electronics products are adopting multi-core processors.
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationOpenMP for next generation heterogeneous clusters
OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great
More informationCS 179: GPU Programming
CS 179: GPU Programming Lecture 1: Introduction Images: http://en.wikipedia.org http://www.pcper.com http://northdallasradiationoncology.com/ GPU Gems (Nvidia) Administration Covered topics: (GP)GPU computing/parallelization
More informationNowadays data-intensive applications play a
Journal of Advances in Computer Engineering and Technology, 3(2) 2017 Data Replication-Based Scheduling in Cloud Computing Environment Bahareh Rahmati 1, Amir Masoud Rahmani 2 Received (2016-02-02) Accepted
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationImplementation of the USB Token System for Fingerprint Verification
Implementation of the USB Token System for Fingerprint Verification Daesung Moon, Youn Hee Gil, Sung Bum Pan, and Yongwha Chung Biometrics Technology Research Team, ETRI, Daejeon, Korea {daesung, yhgil,
More informationTowards Breast Anatomy Simulation Using GPUs
Towards Breast Anatomy Simulation Using GPUs Joseph H. Chui 1, David D. Pokrajac 2, Andrew D.A. Maidment 3, and Predrag R. Bakic 4 1 Department of Radiology, University of Pennsylvania, Philadelphia PA
More informationRendering Technique for Colored Paper Mosaic
Rendering Technique for Colored Paper Mosaic Youngsup Park, Sanghyun Seo, YongJae Gi, Hanna Song, and Kyunghyun Yoon CG Lab., CS&E, ChungAng University, 221, HeokSuk-dong, DongJak-gu, Seoul, Korea {cookie,shseo,yj1023,comely1004,khyoon}@cglab.cse.cau.ac.kr
More informationA Hillclimbing Approach to Image Mosaics
A Hillclimbing Approach to Image Mosaics Chris Allen Faculty Sponsor: Kenny Hunt, Department of Computer Science ABSTRACT This paper presents a hillclimbing approach to image mosaic creation. Our approach
More informationNew Optimal Load Allocation for Scheduling Divisible Data Grid Applications
New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,
More informationANALYSIS OF A PARALLEL LEXICAL-TREE-BASED SPEECH DECODER FOR MULTI-CORE PROCESSORS
17th European Signal Processing Conference (EUSIPCO 2009) Glasgow, Scotland, August 24-28, 2009 ANALYSIS OF A PARALLEL LEXICAL-TREE-BASED SPEECH DECODER FOR MULTI-CORE PROCESSORS Naveen Parihar Dept. of
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationA Hybrid Approach to Parallel Connected Component Labeling Using CUDA
International Journal of Signal Processing Systems Vol. 1, No. 2 December 2013 A Hybrid Approach to Parallel Connected Component Labeling Using CUDA Youngsung Soh, Hadi Ashraf, Yongsuk Hae, and Intaek
More informationChapter 14 HARD: Host-Level Address Remapping Driver for Solid-State Disk
Chapter 14 HARD: Host-Level Address Remapping Driver for Solid-State Disk Young-Joon Jang and Dongkun Shin Abstract Recent SSDs use parallel architectures with multi-channel and multiway, and manages multiple
More informationRanking Web Pages by Associating Keywords with Locations
Ranking Web Pages by Associating Keywords with Locations Peiquan Jin, Xiaoxiang Zhang, Qingqing Zhang, Sheng Lin, and Lihua Yue University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn
More informationA cache-aware performance prediction framework for GPGPU computations
A cache-aware performance prediction framework for GPGPU computations The 8th Workshop on UnConventional High Performance Computing 215 Alexander Pöppl, Alexander Herz August 24th, 215 UCHPC 215, August
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability 1 History of GPU
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability History of GPU
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationParallel-computing approach for FFT implementation on digital signal processor (DSP)
Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm
More informationEnergy and Performance-Aware Task Scheduling in a Mobile Cloud Computing Environment
2014 IEEE International Conference on Cloud Computing Energy and Performance-Aware Task Scheduling in a Mobile Cloud Computing Environment Xue Lin, Yanzhi Wang, Qing Xie, Massoud Pedram Department of Electrical
More informationParallel Neural Network Training with OpenCL
Parallel Neural Network Training with OpenCL Nenad Krpan, Domagoj Jakobović Faculty of Electrical Engineering and Computing Unska 3, Zagreb, Croatia Email: nenadkrpan@gmail.com, domagoj.jakobovic@fer.hr
More informationComputer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015
18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April
More informationVector Quantization. A Many-Core Approach
Vector Quantization A Many-Core Approach Rita Silva, Telmo Marques, Jorge Désirat, Patrício Domingues Informatics Engineering Department School of Technology and Management, Polytechnic Institute of Leiria
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationReliable Transmission for Remote Device Management (RDM) Protocol in Lighting Control Networks
Reliable Transmission for Remote Device Management (RDM) Protocol in Lighting Control Networks Sang-Il Choi 1, Sanghun Lee 1, Seok-Joo Koh 1, Sang-Kyu Lim 2, Insu Kim 2, and Tae-Gyu Kang 2 1 Kyungpook
More informationPERFORMANCE OF CACHE MEMORY SUBSYSTEMS FOR MULTICORE ARCHITECTURES
PERFORMANCE OF CACHE MEMORY SUBSYSTEMS FOR MULTICORE ARCHITECTURES N. Ramasubramanian 1, Srinivas V.V. 2 and N. Ammasai Gounden 3 1, 2 Department of Computer Science and Engineering, National Institute
More informationhsgm: Hierarchical Pyramid Based Stereo Matching Algorithm
hsgm: Hierarchical Pyramid Based Stereo Matching Algorithm Kwang Hee Won and Soon Ki Jung School of Computer Science and Engineering, College of IT Engineering, Kyungpook National University, 1370 Sankyuk-dong,
More informationAn Efficient Load-Sharing and Fault-Tolerance Algorithm in Internet-Based Clustering Systems
An Efficient Load-Sharing and Fault-Tolerance Algorithm in Internet-Based Clustering Systems In-Bok Choi and Jae-Dong Lee Division of Information and Computer Science, Dankook University, San #8, Hannam-dong,
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationSurvey on Heterogeneous Computing Paradigms
Survey on Heterogeneous Computing Paradigms Rohit R. Khamitkar PG Student, Dept. of Computer Science and Engineering R.V. College of Engineering Bangalore, India rohitrk.10@gmail.com Abstract Nowadays
More informationStreaming-Oriented Parallelization of Domain-Independent Irregular Kernels?
Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels? J. Lobeiras, M. Amor, M. Arenaz, and B.B. Fraguela Computer Architecture Group, University of A Coruña, Spain {jlobeiras,margamor,manuel.arenaz,basilio.fraguela}@udc.es
More informationAccelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin
Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationReal-time target tracking using a Pan and Tilt platform
Real-time target tracking using a Pan and Tilt platform Moulay A. Akhloufi Abstract In recent years, we see an increase of interest for efficient tracking systems in surveillance applications. Many of
More informationArchitecture of Request Distributor for GPU Clusters
2012 Third Workshop on Applications for Multi-Core Architecture Architecture of Request Distributor for GPU Clusters Mani Shafaat Doost, S. Masoud Sadjadi School of Computing and Information Sciences Florida
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationA Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU
for 3D Texture-based Volume Visualization on GPU Won-Jong Lee, Tack-Don Han Media System Laboratory (http://msl.yonsei.ac.k) Dept. of Computer Science, Yonsei University, Seoul, Korea Contents Background
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationParallel Variable-Length Encoding on GPGPUs
Parallel Variable-Length Encoding on GPGPUs Ana Balevic University of Stuttgart ana.balevic@gmail.com Abstract. Variable-Length Encoding (VLE) is a process of reducing input data size by replacing fixed-length
More informationCharacter Segmentation and Recognition Algorithm of Text Region in Steel Images
Character Segmentation and Recognition Algorithm of Text Region in Steel Images Keunhwi Koo, Jong Pil Yun, SungHoo Choi, JongHyun Choi, Doo Chul Choi, Sang Woo Kim Division of Electrical and Computer Engineering
More informationOpenCL for programming shared memory multicore CPUs
OpenCL for programming shared memory multicore CPUs Akhtar Ali, Usman Dastgeer, and Christoph Kessler PELAB, Dept. of Computer and Information Science, Linköping University, Sweden akhal935@student.liu.se
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationCor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming
Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming Conference Object, Postprint version This version is available at http://dx.doi.org/0.479/depositonce-577. Suggested Citation
More informationGPU Architecture and Function. Michael Foster and Ian Frasch
GPU Architecture and Function Michael Foster and Ian Frasch Overview What is a GPU? How is a GPU different from a CPU? The graphics pipeline History of the GPU GPU architecture Optimizations GPU performance
More informationSTAFF: State Transition Applied Fast Flash Translation Layer
STAFF: State Transition Applied Fast Flash Translation Layer Tae-Sun Chung, Stein Park, Myung-Jin Jung, and Bumsoo Kim Software Center, Samsung Electronics, Co., Ltd., Seoul 135-893, KOREA {ts.chung,steinpark,m.jung,bumsoo}@samsung.com
More informationPerformance Estimation of Parallel Face Detection Algorithm on Multi-Core Platforms
Performance Estimation of Parallel Face Detection Algorithm on Multi-Core Platforms Subhi A. Bahudaila and Adel Sallam M. Haider Information Technology Department, Faculty of Engineering, Aden University.
More information