High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Similar documents
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

arxiv: v1 [cs.dc] 24 Feb 2010

Sorting Pearson Education, Inc. All rights reserved.

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Database Applications (15-415)

Efficient Algorithms for Sorting on GPUs

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Database Applications (15-415)

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

In-place Super Scalar. Tim Kralj. Samplesort

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Cartoon parallel architectures; CPUs and GPUs

Lecture 2: CUDA Programming

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Portland State University ECE 588/688. Graphics Processors

Sorting. Task Description. Selection Sort. Should we worry about speed?

Australian Journal of Basic and Applied Sciences

MCMD L20 : GPU Sorting GPU. Parallel processor - Many cores - Small memory. memory transfer overhead

POST-SIEVING ON GPUs

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Sorting Large Multifield Records on a GPU*

High Performance Computing on GPUs using NVIDIA CUDA

GPUs have enormous power that is enormously difficult to use

Introduction to GPU programming with CUDA

Numerical Simulation on the GPU

Parallel Algorithm Engineering

External Sorting. Chapter 13. Comp 521 Files and Databases Fall

Fast Tridiagonal Solvers on GPU

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005

Applications of Berkeley s Dwarfs on Nvidia GPUs

CS 179: GPU Programming. Lecture 7

CS671 Parallel Programming in the Many-Core Era

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

B. Tech. Project Second Stage Report on

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Parallelising Pipelined Wavefront Computations on the GPU

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Dense matching GPU implementation

Sorting. Sorting. Stable Sorting. In-place Sort. Bubble Sort. Bubble Sort. Selection (Tournament) Heapsort (Smoothsort) Mergesort Quicksort Bogosort

Sorting. Overview. External sorting. Warm up: in memory sorting. Purpose. Overview. Sort benchmarks

Practical Introduction to CUDA and GPU

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

SORTING, SETS, AND SELECTION

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Comparison Based Sorting for Systems with Multiple GPUs

Key question: how do we pick a good pivot (and what makes a good pivot in the first place)?

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Information Coding / Computer Graphics, ISY, LiTH

Highly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010

Programming in CUDA. Malik M Khan

L10 Layered Depth Normal Images. Introduction Related Work Structured Point Representation Boolean Operations Conclusion

GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy

Chapter 1 Divide and Conquer Algorithm Theory WS 2013/14 Fabian Kuhn

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Code Optimizations for High Performance GPU Computing

Programming GPUs for database applications - outsourcing index search operations

A GPGPU Compiler for Memory Optimization and Parallelism Management

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Chapter 1 Divide and Conquer Algorithm Theory WS 2015/16 Fabian Kuhn

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Better sorting algorithms (Weiss chapter )

Multi Agent Navigation on GPU. Avi Bleiweiss

Parallel Sorting. Sathish Vadhiyar

Introduction to GPGPU and GPU-architectures

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Sorting with GPUs: A Survey

Sorting Algorithms. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture on Multicores. Darius Sidlauskas Post-doc. Darius Sidlauskas, 25/ /21

Review. 35a.cpp. 36a.cpp. Lecture 13 5/29/2012. Compiler Directives. Library Functions Environment Variables

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

CS 314 Principles of Programming Languages

Effective Learning and Classification using Random Forest Algorithm CHAPTER 6

DIVIDE AND CONQUER ALGORITHMS ANALYSIS WITH RECURRENCE EQUATIONS

When MPPDB Meets GPU:

Computer Architecture

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

SORTING AND SELECTION

TUNING CUDA APPLICATIONS FOR MAXWELL

Spring Prof. Hyesoon Kim

Warps and Reduction Algorithms

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Building an Inverted Index

Benchmarking the Memory Hierarchy of Modern GPUs

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CMPSCI 691AD General Purpose Computation on the GPU

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

Parallel Processing SIMD, Vector and GPU s cont.

Optimization solutions for the segmented sum algorithmic function

Transcription:

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer System and Architecture ICT, CAS, China

Outline GPU computation model Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion

GPU computation model Massively multi-threaded, data-parallel many-core architecture Important features: SIMT execution model Avoid branch divergence Warp-based scheduling implicit hardware synchronization among threads within a warp Access pattern Coalesced vs. non-coalesced

Why merge sort? Similar case with external sorting Limited shared memory on chip vs. limited main memory Sequential memory access Easy to meet coalesced requirement

Why bitonic-based merge sort? Massively fine-grained parallelism Because of the relatively high complexity, bitonic network is not good at sorting large arrays Only used to sort small subsequences in our implementation Again, coalesced memory access requirement

Problems in bitonic network naïve implementation Block-based bitonic network One element per thread Some problems in each stage thread n elements produce only n /2 compare-and-swap operations Form both ascending pairs and descending pairs Between stages synchronization block Too many branch divergences and synchronization operations

What we use? Warp-based bitonic network each bitonic network is assigned to an independent warp, instead of a block Barrier-free, avoid synchronization between stages threads in a warp perform 32 distinct compare-and-swap operations with the same order Avoid branch divergences At least 128 elements per warp And further a complete comparison-based sorting algorithm: GPU-Warpsort

Overview of GPU-Warpsort Divide input seq into small tiles, and each followed by a warpbased bitonic sort Merge, until the parallelism is insufficient. Split into small subsequences Merge, and form the output

Step1: barrier-free bitonic sort divide the input array into equal-sized tiles Each tile is sorted by a warp-based bitonic network 128+ elements per tile to avoid branch divergence No need for syncthreads() Ascending pairs + descending pairs Use max () and min () to replace if-swap pairs

Step 2: bitonic-based merge sort t -element merge sort Allocate a t -element buffer in shared memory Load the t /2 smallest elements from seq A and B, respectively Merge Output the lower t /2 elements Load the next t /2 smallest elements from A or B t = 8 in this example

Step 3: split into small tiles Problem of merge sort the number of pairs decreases geometrically Can not fit this massively parallel platform Method Divide the large seqs into independent small tiles which satisfy:

Step 3: split into small tiles (cont.) How to get the splitters? Sample the input sequence randomly

Step 4: final merge sort Subsequences (0,i ), (1,i ),, (l -1,i ) are merged into S i Then,S 0, S 1,, S l are assembled into a totally sorted array

Experimental setup Host AMD Opteron880 @ 2.4 GHz, 2GB RAM GPU 9800GTX+, 512 MB Input sequence Key-only and key-value configurations 32-bit keys and values Sequence size: from 1M to 16M elements Distributions Zero, Sorted, Uniform, Bucket, and Gaussian

Performance comparison Mergesort Fastest comparison-based sorting algorithm on GPU (Satish, IPDPS 09) Implementations already compared by Satish are not included Quicksort Cederman, ESA 08 Radixsort Fastest sorting algorithm on GPU (Satish, IPDPS 09) Warpsort Our implementation

Performance results Key-only 70% higher performance than quicksort Key-value 20%+ higher performance than mergesort 30%+ for large sequences (>4M)

Results under different distributions Uniform, Bucket, and Gaussian distribution almost get the same performance Zero distribution is the fastest Not excel on Sorted distribution Load imbalance

Conclusion We present an efficient comparison-based sorting algorithm for many-core GPUs carefully map the tasks to GPU architecture Use warp-based bitonic network to eliminate barriers provide sufficient homogeneous parallel operations for each thread avoid thread idling or thread divergence totally coalesced global memory accesses when fetching and storing the sequence elements The results demonstrate up to 30% higher performance Compared with previous optimized comparison-based algorithms

Thanks