AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

Similar documents
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

Hybrid Parallel Sort on the Cell Processor

SIMD Exploitation in (JIT) Compilers

Sorting On A Cell Broadband Engine SPU

1,a) bitonic sort. High-speed parallelization of bitonic sort. Kenjiro Taura 1,b)

Adaptive SMT Control for More Responsive Web Applications

1 a = [ 5, 1, 6, 2, 4, 3 ] 4 f o r j i n r a n g e ( i + 1, l e n ( a ) 1) : 3 min = i

CellSort: High Performance Sorting on the Cell Processor

Hardware Acceleration of Database Operations

Parallel Algorithm for Integer Sorting with Multi-core Processors

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Algorithms and Applications

A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake

Auto-Vectorization of Interleaved Data for SIMD

Cell Programming Tips & Techniques

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

PS3 programming basics. Week 1. SIMD programming on PPE Materials are adapted from the textbook

Lecture 8 Parallel Algorithms II

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Self-Aware Adaptation in FPGA-based Systems

Lecture 13: March 25

Lecture 3: Sorting 1

Automatic SIMDization of Parallel Sorting on x86-based Manycore Processors

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

The University of Texas at Austin

ECE 2574: Data Structures and Algorithms - Basic Sorting Algorithms. C. L. Wyatt

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

With regards to bitonic sorts, everything boils down to doing bitonic sorts efficiently.

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

Parallel Sorting Algorithms

Cell Processor and Playstation 3

Sorting Large Multifield Records on a GPU*

Overview of Sorting Algorithms

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

Optimization of FEM solver for heterogeneous multicore processor Cell. Noriyuki Kushida 1

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU

Sorting and Searching. Tim Purcell NVIDIA

CellSs Making it easier to program the Cell Broadband Engine processor

Programming II (CS300)

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Citation for published version (APA): Ydraios, E. (2010). Database cracking: towards auto-tunning database kernels

GPU-Based DWT Acceleration for JPEG2000

IBM Research Report. Optimizing the Use of Static Buffers for DMA on a CELL Chip

CSC630/CSC730 Parallel & Distributed Computing

Parallel Sorting Algorithms

GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures

Sorting Algorithms. - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Double Patterning Layout Decomposition for Simultaneous Conflict and Stitch Minimization

In-place Super Scalar. Tim Kralj. Samplesort

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

LITERATURE REVIEW: Parallel Computational Geometry on Multicore Processors A Review of Basic Algorithms

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

CPU Resource Reservation for Simultaneous Multi-Thread Systems. Hiroshi Inoue, Takao Moriyama, Yasushi Negishi, and Moriyoshi Ohara

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Column Stores vs. Row Stores How Different Are They Really?

Question And Answer.

Trace-based JIT Compilation

A Recursive Data-Driven Approach to Programming Multicore Systems

Programming II (CS300)

Lecture Notes on Cache Iteration & Data Dependencies

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim

Auto-Vectorization with GCC

Parallel Sorting. Sathish Vadhiyar

Outline. Computer Science 331. Three Classical Algorithms. The Sorting Problem. Classical Sorting Algorithms. Mike Jacobson. Description Analysis

Information Coding / Computer Graphics, ISY, LiTH

Parallel and Distributed Computing

Cell SDK and Best Practices

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

ECE 8823: GPU Architectures. Objectives

Compiling Effectively for Cell B.E. with GCC

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Solved by: Syed Zain Ali Bukhari (BSCS)

Sorting. Sorting. Stable Sorting. In-place Sort. Bubble Sort. Bubble Sort. Selection (Tournament) Heapsort (Smoothsort) Mergesort Quicksort Bogosort

A Hybrid Sorting Algorithm on Heterogeneous Architectures

Analysis of Algorithms. Unit 4 - Analysis of well known Algorithms

Sorting. Bubble Sort. Pseudo Code for Bubble Sorting: Sorting is ordering a list of elements.

Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort

arxiv: v1 [cs.dc] 24 Feb 2010

Question 7.11 Show how heapsort processes the input:

Parallel Exact Inference on the Cell Broadband Engine Processor

Deterministic Parallel Sorting Algorithm for 2-D Mesh of Connected Computers

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Subject: Computer Science

Efficient Computation of Radial Distribution Function on GPUs

Excel 2016 Essentials Syllabus

8/2/10. Looking for something COMP 10 EXPLORING COMPUTER SCIENCE. Where is the book Modern Interiors? Lecture 7 Searching and Sorting TODAY'S OUTLINE

Sorting (Chapter 9) Alexandre David B2-206

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

An Evaluation of Multi-Hit Ray Traversal in a BVH Using Existing First-Hit/Any-Hit Kernels: Algorithm Listings and Performance Visualizations

CUDA Performance Optimization

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Transcription:

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory September 18 th, 2007 PACT2007 at Brasov, Romania

Goal Develop a fast sorting algorithm by exploiting the following features of today s processors SIMD instructions Multiple Cores (Thread-level parallelism) Outperform existing algorithms, such as Quick sort, by exploiting SIMD instructions on a single thread Scale well by using thread-level parallelism of multiple cores

Benefit and limitation of SIMD instructions Benefits: SIMD selection instructions (e.g. select minimum instruction) can accelerate sorting by parallelizing comparisons avoiding unpredictable conditional branches input 1 input 2 output A0 A1 A2 A3 B0 B1 B2 B3 MIN MIN MIN MIN C0 C1 C2 C3 Limitation: SIMD load / store instructions can be effective only when they access contiguous 128 bits of data (e.g. four 32-bit values) aligned on 128-bit boundary

SIMD instructions are effective for some existing sorting algorithms but slower than Quick sort Such as Bitonic merge sort, Odd-even merge sort, and GPUTeraSort [Govindaraju 05] They are slower than Quick sort for sorting a large number (N) of elements Their computational complexity is O(N (log (N)) 2 ) A step of Bitonic merge sort 1 5 6 8 7 4 3 2 MIN MAX

Presentation Outline Motivation AA-sort: Our new sorting algorithm Experimental results Summary

What is Comb sort? Comb sort [Lacey '91] is an extension to Bubble sort It compares two non-adjacent elements 2 7 6 5 8 4 1 3 starting to compare two elements with large distance distance = 6

What is Comb sort? Comb sort [Lacey '91] is an extension to Bubble sort It compares two non-adjacent elements 2 7 6 5 8 4 1 3 starting to compare two elements with large distance distance = distance / 1.3 = 4 decreasing distance by dividing a constant called shrink factor (1.3) for each iteration

What is Comb sort? Comb sort [Lacey '91] is an extension to Bubble sort It compares two non-adjacent elements 2 7 6 5 8 4 1 3 starting to compare two elements with large distance distance = distance / 1.3 = 3 decreasing distance by dividing a constant called shrink factor (1.3) for each iteration

What is Comb sort? Comb sort [Lacey '91] is an extension to Bubble sort It compares two non-adjacent elements 2 7 6 5 8 4 1 3 starting to compare two elements with large distance distance = distance / 1.3 = 2 decreasing distance by dividing a constant called shrink factor (1.3) for each iteration

What is Comb sort? Comb sort [Lacey '91] is an extension to Bubble sort It compares two non-adjacent elements 2 7 6 5 8 4 1 3 starting to compare two elements with large distance distance = distance / 1.3 = 1 repeat until all data are sorted with distance = 1 decreasing distance by dividing a constant called shrink factor (1.3) for each iteration average complexity N log (N)

Our technique to SIMDize the Comb sort input data 5 3 2 11 9 4 12 1 10 6 8 7 SIMD instructions are not effective for Comb sort due to unaligned memory accesses loop-carried dependencies sorted output data 1 2 3 4 5 6 7 8 9 10 11 12 sorted

Our technique to SIMDize the Comb sort input data 5 3 2 11 9 4 12 1 10 6 8 7 Transposed order SIMD instructions are effective for Comb sort into the Transposed order 1 4 7 10 2 5 8 11 3 6 9 12 Reorder after sorting sorted output data 1 2 3 4 5 6 7 8 9 10 11 12 sorted

Our technique to SIMDize the Comb sort input data 5 3 2 11 9 4 12 1 10 6 8 7 Transposed order 1 4 7 10 2 5 8 11 3 6 9 12 sorted output data assume four elements in one vector 1 2 3 4 5 6 7 8 9 10 11 12 sorted

Our technique to SIMDize the Comb sort input data 5 3 2 11 9 4 12 1 10 6 8 7 sorted output data Transposed order unaligned access loop-carried dependency 1 2 3 4 5 6 7 8 9 10 11 12 sorted 1 2 3 4 5 7 8 10 11 6 9 12 no unaligned access no loop-carried dependency

Analysis of our SIMDized Comb sort our SIMDized Comb sort naively SIMDized Comb sort original (scalar) Comb sort Number of unpredictable conditional branches almost 0 almost 0 O(N log(n)) Number of unaligned memory accesses almost 0 O(N log(n)) N/A Computational complexity O(N log(n)) reordering: O(N) O(N log(n)) O(N log(n))

Overview of the AA-Sort It consists of two phases: Phase 1: SIMDized Comb sort Phase 2: SIMDized Merge sort Phase 1 Phase 2 block1 block2 block3 block4 sorted input data divide input data into small blocks that can fit into (L2) cache sort each block using SIMDized Comb sort merge the blocks using SIMDized Merge sort

Our approach to SIMDize merge operations SIMD instructions are effective for Bitonic merge or Odd-even merge Their computational complexity are higher than usual merge operations Our solution Integrate Odd-even merge into the usual merge operation to take advantage of SIMD instructions while keeping the computational complexity of usual merge operation

Odd-even merge for values in two vector registers sorted sorted Input two vector registers contain four presorted values in each 1 4 7 8 2 3 5 6 input < < < < < < < < < 1 2 3 4 5 6 7 8 sorted stage 1 stage 2 stage 3 output Odd-even Merge one SIMD comparison and shuffle operations for each stage No conditional branches! Output eight values in two vector registered are now sorted

Our technique to integrate Odd-even merge into usual merge sorted sorted array 1 1 4 7 9 sorted array 2 2 3 5 6 10 11 12 14 8 13 15 16 sorted 21 23 17 18

Our technique to integrate Odd-even merge into usual merge sorted array 1 1 4 7 9 10 11 12 14 21 23 sorted array 2 2 3 5 6 8 13 15 16 17 18 sorted sorted vector registers 1 4 7 9 2 3 5 6 Odd-even merge

Our technique to integrate Odd-even merge into usual merge sorted array 1 1 4 7 9 10 11 12 14 21 23 sorted array 2 2 3 5 6 8 13 15 16 17 18 sorted vector registers 1 2 3 4 5 6 7 9

Our technique to integrate Odd-even merge into usual merge sorted array 1 1 4 7 9 10 11 12 14 21 23 sorted array 2 2 3 5 6 8 13 15 16 17 18 use a scalar comparison to select array to load from vector registers 5 6 7 9 merged array 1 2 3 4 output smaller four values as merged array sorted

Our technique to integrate Odd-even merge into usual merge sorted array 1 1 4 7 9 10 11 12 14 21 23 sorted array 2 2 3 5 6 8 13 15 16 17 18 vector registers 8 13 15 16 5 6 7 9 Odd-even merge merged array 1 2 3 4 sorted

Our technique to integrate Odd-even merge into usual merge sorted array 1 1 4 7 9 10 11 12 14 21 23 sorted array 2 2 3 5 6 8 13 15 16 17 18 vector registers 9 13 15 16 merged array 1 2 3 4 5 6 7 8 sorted

Comparing merge operations our integrated merge operation odd-even merge implemented with SIMD usual (scalar) merge operation Number of unpredictable conditional branches 1 for every output vector 0 1 for every output element Computational complexity O(N) O(N log(n)) O(N)

Parallelizing AA-sort among multiple threads thread 1 thread 2 block1 block2 block3 block4 each thread sorts independent blocks using SIMDized Comb sort Phase 1 each thread executes independent merge operations Phase 2 thread 1, 2 sorted multiple threads cooperate on one merge operation

Presentation Outline Motivation AA-sort: Our new sorting algorithm Experimental results Summary

Environment for Performance Evaluation We used two processors for performance evaluation PowerPC 970 using VMX instructions (up to 4 cores) Cell BE using SPE cores (up to 16 SPE cores) We compared the performance of four algorithms AA-sort GPUTeraSort [Govindaraju '05] ESSL (IBM s optimized library) STL delivered with GCC (open source library)

Single-thread performance on PowerPC 970 for various input size execution time (sec). 100 10 1 0.1 AA-sort GPUTeraSort ESSL STL faster 0.01 1 M 2 M 4 M 8 M 16 M 32 M 64 M 128 M number of elements sorting 32-bit random integers on one core of PowerPC 970

Single-thread performance on PowerPC 970 execution time (sec). 5 4 3 2 1 x3.3 x1.7 x3.0 faster 0 AA-sort GPUTeraSort ESSL STL algorithm sorting 32 M elements of random 32-bit integers on PowerPC 970

Scalability with multiple cores on PowerPC 970 3 relative throughput over AA-sort using 1 core. 2.5 2 1.5 1 0.5 AA-sort GPUTeraSort faster x2.6 by 4 cores x2.1 by 4 cores 0 0 1 2 3 4 5 number of cores sorting 32 M elements of random 32-bit integers on PowerPC 970

Scalability with multiple cores on Cell BE 14 relative throughput over AA-sort using 1 core. 12 10 8 6 4 2 AA-sort GPUTeraSort 7.4x by 8 cores faster 12.2x by 16 cores 0 0 4 8 12 16 number of cores sorting 32 M elements of random 32-bit integers on Cell BE

Summary We proposed a new sorting algorithm called AA-sort, which can take advantage of SIMD instructions Multiple Cores (thread-level parallelism) We evaluated the AA-sort on PowerPC 970 and Cell BE Using only 1 core of PowerPC 970, the AA-sort outperformed IBM s ESSL by 1.7x GPUTeraSort by 3.3x On Cell BE, the AA-sort showed good scalability 8 cores 7.4x 16 cores 12.2x

Thank you for your attention!

SIMD instructions are not effective for Quick sort SIMD instructions are NOT effective for Quick sort, which requires element-wise and unaligned memory accesses A step of Quick sort (pivot = 7) 5 3 2 11 9 4 12 1 10 6 8 13 search an element larger than the pivot search an element smaller than the pivot swap element-wise and unaligned memory accesses

Transposed order To see the values to sort as a two dimensional array adjacent values are stored in one vector adjacent values are stored in same slot of adjacent vectors row: array of vectors column: elements in a vector 0 1 2 3 4 5 6 7 8 9 10 11 N-4 N-3 N-2 N-1 Original order = row major 0 n 2n 3n 1 n+1 2n+1 3n+1 2 n+2 2n+2 3n+2 n-1 2n-1 3n-1 N-1 Transposed order = column major