GNU libstdc++ parallel mode: Algorithms
|
|
- Brook Gardner
- 6 years ago
- Views:
Transcription
1 Introduction Library Overview Algorithms Conclusion 1/40 The GNU libstdc++ parallel mode: Algorithms Institute for Theoretical Computer Science University of Karlsruhe
2 Introduction Library Overview Algorithms Conclusion 2/40 Talk Outline Introduction Library Overview Algorithms Conclusion
3 Introduction Library Overview Algorithms Conclusion 3/40 Motivation How to Benefit from Multi-Core Systems? automatic parallelization not sufficient manual/explicit parallelization needed, but expensive, beyond qualification of most programmers Our Approach provide a parallelized library of basic algorithms for shared-memory systems provide implementations of all worthwhile STL algorithms included with GCC, under the name libstdc++ parallel mode formerly known as the Multi-Core Standard Template Library make the usage of (data-)parallel algorithms very easy actual parallelism not visible to the user, but encapsulated use established base bring multi-core performance to the end-user in every program
4 Introduction Library Overview Algorithms Conclusion 4/40 Basic Approach Make the usage of (data-)parallel algorithms as easy as winking. Starting Point provide the functionality of the C++ Standard Template Library run the algorithms in parallel Why STL? many useful algorithms and data structures included simple interface, very well-known among developers recompilation of existing programs may suffice C++ accepted and efficient language, standardized
5 Goals Introduction Library Overview Algorithms Conclusion 5/40 Ease of Use easy to use no new language, no language extension no (binary) library to be installed on target system just few compiler options Good Performance some speedup already for small inputs scale down full speedup for larger inputs scale up co-exist with other forms of parallelization respect machine load dynamic load-balancing
6 Introduction Library Overview Algorithms Conclusion 6/40 Competitors STAPL abstracts from memory model/communication must incorporate distributed-memory issues no code publicly available interface only similar to STL Intel Threading Building Blocks mostly on a more abstract level, parallel programming framework only combinatorial generic algorithm is parallel sorter
7 Introduction Library Overview Algorithms Conclusion 7/40 Technical Foundations based on OpenMP (fork-join parallelism) switching on/off parallelism both at compile-time and run-time Applications STL Interface Serial STL Algorithms OpenMP Extensions Parallel STL Algorithms Atomic Ops MCSTL OS Thread Scheduling Multi-Core Hardware
8 Introduction Library Overview Algorithms Conclusion 8/40 Atomic Operations a few operations are executed without any chance of interference atomically fetch and add(x, i) t := x; r := x; r := r + i; x := r; return t; allows concurrent iteration over sequence
9 Introduction Library Overview Algorithms Conclusion 8/40 Atomic Operations a few operations are executed without any chance of interference atomically fetch and add(x, i) t := x; r := x; r := r + i; x := r; return t; allows concurrent iteration over sequence compare and swap(x, c, r) if(x = c) { x := r; return c [true]; } else { return r [false]; } secure state transition, can emulate fetch and add and others by using in a loop slower than usual operation, in particular when concurrent
10 Introduction Library Overview Algorithms Conclusion 9/40 Overview of Important Algorithms Strictly STL (mostly <algorithm>) for each and friends (embarrassingly parallel) find partial sum (prefix sum) partition, partial sort merge sort random shuffle bulk construction and bulk insert for set and map 1 Extension to STL multiway merge multiseq partition (helper) 1 Parallelization of Bulk Operations for STL Dictionaries in HPPC [1]
11 Introduction Library Overview Algorithms Conclusion 10/40 MCSTL Development Status Algorithm Class Function Call(s) Status w/lb w/olb Embarrassingly for each, generate( n), fill( n), impl yes yes Parallel count( if), transform, replace( if), min element, max element, adjacent difference, unique copy Find find( if), find first of, impl yes not adjacent find, mismatch, equal, worthwhile lexicographical compare Search search( n) impl yes not ww. Numerical accumulate, partial sum, impl planned yes Algorithms inner product Partition partition, stable partition impl yes not ww. Merge merge, multiway merge, inplace merge impl planned yes Partial Sort nth element, partial sort impl yes planned Sort sort, stable sort impl yes yes Random Permutation random shuffle impl yes not worthw. Dictionaries (multi )map/set bulk op yes yes Complex set union, set intersection, impl no yes Set Operations set (symmetric )difference,... Vector Arithmetic valarray operations worked yes yes on Heap Construction make heap, sort heap planned Priority Queues amortized update operations planned
12 Introduction Library Overview Algorithms Conclusion 11/40 for each Problem Definition execute a certain function on a range of elements many similar functions like transform, generate parallelization is easy only for uniform execution time per element, exclusive machine
13 Introduction Library Overview Algorithms Conclusion 12/40 for each Implementation Multiple implementations, depending on the purpose Static Load-Balancing divide work into parts of almost equal size used for accumulate, since ends of chunks can easily be spliced (not commutative) Dynamic Load-Balancing initially divide work into parts of almost equal size allow unemployed threads to take work from others work-stealing
14 Introduction Library Overview Algorithms Conclusion 13/40 for each: Work-Stealing additional synchronization is done only by threads out of work steal from random victim steal half of the left jobs atomic operation fetch and add is used to reserve job(s), efficiently supported by today s hardware chunk size C allows compromise between the two worst cases uniformly distributed, little work and skewedly distributed, much work maximal slowdown 10 for C = 1 and no work, neutral for C = 10 and no work full speedup for hard work no matter what distribution logarithmic number of steals suffices with high probability
15 Introduction Library Overview Algorithms Conclusion 14/40 for each: Performance Results Speedup Mandelbrot on 4-way Opteron, at most 1000 iterations per pixel 4 bal. 3 bal. 4 unbal. 2 bal. 3 unbal. 2 unbal. seq Number of pixels
16 find Introduction Library Overview Algorithms Conclusion 15/40 Problem Definition find the first position in a sequence that matches/satisfies a predicate find if also immediately covers find, adjacent find, mismatch, equal, lexicographical compare Considerations sequential algorithm needs O(m) time if first hit is at position m naïve parallel algorithm needs Ω(n/p) = Ω(m) time, if m = n/p 1 (worst case). parallelization overhead makes situation even worse for small m.
17 Introduction Library Overview Algorithms Conclusion 16/40 find: Algorithm Solution start sequentially through position m 0 only then start assigning blocks to p threads dynamically load-balancing using fetch-and-add primitive first thread that is successful signals success to all others by grabbing the remaining part Tradeoff small blocks lower termination latency, but increase overhead solution: exponentially grow block size from starting m to m p 0 p 1 p 2 p sequential m 0 parallel
18 Introduction Library Overview Algorithms Conclusion 17/40 find: Performance Results Speedup th, gb 4 th, gb 8 th, gb 16 th, gb 32 th, gb 2 th, fsb 4 th, fsb 8 th, fsb 16 th, fsb 32 th, fsb 2 th, naive 4 th, naive 8 th, naive 16 th, naive 32 th, naive sequential Length of Sequence 10 7
19 Introduction Library Overview Algorithms Conclusion 18/40 merge, multiway merge Problem Definitions merge: combine two sorted sequences into one sorted sequence multiway merge: combine k > 2 sorted sequences into one sorted sequence important for (external memory) sorting How to divide the problem? find slabs, i. e. consistent sets of ranges from the sequences two possibilities: (randomized) splitting by sampling exact partitioning into slabs of equal size (using multi-sequence selection)
20 Introduction Library Overview Algorithms Conclusion 19/40 merge, multiway merge: Sequence Partitioning Exact Splitting vs. Sampling performance guarantee, no bad inputs like long sequences of equal elements complicated algorithm for multi-sequence selection [4] first generic implementation provided, explicitly handles degenerated inputs Algorithm divide sequences into p slabs of almost equal size determine the target positions in parallel, merge the slabs to the target positions total running time O(m/p log k + k log k log max j S j ) where m = j S j, the accumulated length of all sequences
21 Introduction Library Overview Algorithms Conclusion 20/40 merge, multiway merge: Diagram k { t 3 t 2 t 1 t 0
22 Introduction Library Overview Algorithms Conclusion 21/40 sort, stable sort Parallel Multiway Mergesort + less communication necessary + stable variant easy to derive needs twice the space Parallel Load-Balanced Quicksort + in-place ± dynamic load-balancing to compensate for unequal splitting concurrent access to memory not stable Both variants implemented in the MCSTL, user s choice.
23 Introduction Library Overview Algorithms Conclusion 22/40 Parallel Multiway Mergesort Procedure 1. divide sequence into p parts of equal size 2. in parallel, sort the parts locally 3. use parallel p-way merging to compute the final sequence 4. copy result back to original position t 0 t 1 t 2 t 3
24 Introduction Library Overview Algorithms Conclusion 23/40 partition < > pivot Sequential Algorithm scan from both ends swap to desired order when contrary
25 Introduction Library Overview Algorithms Conclusion 24/40 Parallel Partitioning [Tsigas Zhang 2003] 1. scan blocks of size B from both ends 1.1 claim new blocks when running out of data 2. swap the unfinished blocks to the middle 3. recurse on the middle input p 0 p 1 p 2 swap in parallel rest recursive or sequential time complexity O(n/p + B log p)
26 Introduction Library Overview Algorithms Conclusion 25/40 partition: Example 3 processors, B=3, pivot 50, no special cases p 0 p 1 p
27 Introduction Library Overview Algorithms Conclusion 26/40 Partitioning of 32-bit integers on Sun T1 Speedup sequential 1 thread 2 threads 3 threads 4 threads 8 threads 16 threads 32 threads n
28 Introduction Library Overview Algorithms Conclusion 27/40 Parallel Balanced Quicksort Procedure [3] 1. split sequence using parallel partition, descend recursively with the appropriate number of threads 2. as soon as there is only one processor left per partition: start local sorting 3. each processor sorts locally, pushes parts to process later into lock-free dequeue 4. other processors can steal parts when out of work input partition in parallel partition in parallel p 0 p 1 sequential sorting p 0 p 1 steal p 0 p 1 p 2 p 2
29 Introduction Library Overview Algorithms Conclusion 28/40 Sorting Performance Results Sorting Pairs of 64-bit Integers on the Sun T1 Speedup sequential 2 th, mwms 4 th, mwms 8 th, mwms 16 th, mwms 32 th, mwms 2 th, bqs 4 th, bqs 8 th, bqs 16 th, bqs 32 th, bqs Number of elements
30 Introduction Library Overview Algorithms Conclusion 29/40 Sorting Performance Results Multiway Mergesort for 32-bit integers on 2 Quad-Core-Xeons Speedup sequential 1 thread 2 threads 3 threads 4 threads 5 threads 6 threads 7 threads 8 threads Input Size
31 Introduction Library Overview Algorithms Conclusion 30/40 Random Permutation (random shuffle) Standard Sequential Algorithm (e. g. STL) for 0 i < n swap (a[i], a[rand(i + 1, n 1)]) Cache-efficient (parallel) algorithm 1. distribute randomly to (local) buckets 1b. (copy local buckets to global buckets) 2. permute buckets
32 Introduction Library Overview Algorithms Conclusion 31/40 Random Permutation (random shuffle) time complexity O( n p + p), global communication volume n cache efficiency very important (factor 2) Cache-aware random shuffling of integers on 4-way Opteron Speedup sequential 1 thread 2 threads 3 threads 4 threads n 10 8
33 Introduction Library Overview Algorithms Conclusion 32/40 Dictionary Bulk Operations Algorithmic Problem construct/insert into red-black tree complicated splitting and balancing of work bulk algorithm already brings sequential speedup not yet in parallel mode, but only in MCSTL Memory Management memory allocation takes a considerable share of the time C++ does not allow asymmetric allocation/deallocation, i. e. allocate several nodes at once, later deallocate one by one
34 Introduction Library Overview Algorithms Conclusion 33/40 Dictionary Bulk Operations Performance Insertion, n=0.1k, 2-way Quadcore Xeon Speedup th 7 th 6 th 5 th 4 th 3 th 2 th 1 th seq Number of inserted elements (k)
35 Introduction Library Overview Algorithms Conclusion 34/40 Effect of Core Mapping (Tree Construction) Speedup threads, different sockets 2 threads, same socket, different dice 2 threads, same die 1 thread sequential Number of inserted elements (k)
36 Introduction Library Overview Algorithms Conclusion 35/40 Usage Example Code #include <algorithm> vector<double> v; std::random_shuffle(v.begin(), v.end()); Applications in combination with STXXL: External Memory STL sort, multiway merge suffix array construction additional integer sort routine, merge, for each Release to Public Domain integration into libstdc++ started will ship as a part of GCC GPL with runtime exception open-source, everybody can contribute
37 Introduction Library Overview Algorithms Conclusion 36/40 Conclusions Benefits MCSTL provides a easy way to incorporate data parallelism into programs on an algorithmic level fully generic performance is excellent for large inputs speedup at hand for small inputs as well, depending on circumstances could transparently support new paradigms, e. g. transactional memory repository for parallel algorithm implementations use more (MC)STL
38 Introduction Library Overview Algorithms Conclusion 37/40 Demands to Language Spec, OS, and hardware OS allow specification of affinity between threads Threads 0, 2 and 4 should run close together (shared cache), otherwise widely separated (maximum bandwidth). define penalties for switching cores (cache locality) Hardware more memory bandwidth faster communication larger shared caches
39 Introduction Library Overview Algorithms Conclusion 38/40 Future Work performance estimation automatic switching point detection Switching Number of Threads for Balanced Quicksort: Preliminary Results seq 1 th 2 th 3 th 4 th 5 th 6 th 7 th 8 th Speedup Input Size more application studies updates of complex data structures like priority queues
40 Introduction Library Overview Algorithms Conclusion 39/40 Performance Estimation Issues circumstances can hardly be determined at compile-time execution times of functors, comparators, overloaded assignment operators and copy constructors crucial for performance
41 Introduction Library Overview Algorithms Conclusion 40/40 References L. Frias and J. Singler. Parallelization of bulk operations for STL dictionaries. In Workshop on Highly Parallel Processing on a Chip (HPPC), J. Singler, P. Sanders, and F. Putze. The Multi-Core Standard Template Library. In Euro-Par 2007: Parallel Processing, volume 4641 of LNCS, pages Springer-Verlag. P. Tsigas and Y. Zhang. A simple, fast parallel implementation of quicksort and its performance evaluation on SUN enterprise In 11th Euromicro Conference on Parallel, Distributed and Network-Based Processing, page 372, P. J. Varman, S. D. Scheufler, B. R. Iyer, and G. R. Ricard. Merging Multiple Lists on Hierarchical-Memory Multiprocessors. Journal of Parallel and Distributed Computing, 12(2): , 1991.
Parallel Partition Revisited
Parallel Partition Revisited Leonor Frias and Jordi Petit Dep. de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya WEA 2008 Overview Partitioning an array with respect to a pivot
More informationParallel Merge Sort with Double Merging
Parallel with Double Merging Ahmet Uyar Department of Computer Engineering Meliksah University, Kayseri, Turkey auyar@meliksah.edu.tr Abstract ing is one of the fundamental problems in computer science.
More informationSingle-Pass List Partitioning
Single-Pass List Partitioning Leonor Frias 1 Johannes Singler 2 Peter Sanders 2 1 Dep. de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya 2 Institut für Theoretische Informatik,
More informationStudienarbeit Parallel Highway-Node Routing
Studienarbeit Parallel Highway-Node Routing Manuel Holtgrewe Betreuer: Dominik Schultes, Johannes Singler Verantwortlicher Betreuer: Prof. Dr. Peter Sanders January 15, 2008 Abstract Highway-Node Routing
More informationCourse Review for Finals. Cpt S 223 Fall 2008
Course Review for Finals Cpt S 223 Fall 2008 1 Course Overview Introduction to advanced data structures Algorithmic asymptotic analysis Programming data structures Program design based on performance i.e.,
More informationCS301 - Data Structures Glossary By
CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm
More informationLesson 1 4. Prefix Sum Definitions. Scans. Parallel Scans. A Naive Parallel Scans
Lesson 1 4 Prefix Sum Definitions Prefix sum given an array...the prefix sum is the sum of all the elements in the array from the beginning to the position, including the value at the position. The sequential
More informationData Structures and Algorithm Analysis in C++
INTERNATIONAL EDITION Data Structures and Algorithm Analysis in C++ FOURTH EDITION Mark A. Weiss Data Structures and Algorithm Analysis in C++, International Edition Table of Contents Cover Title Contents
More informationFirst Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors
First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Distributed Computing Systems Chalmers University
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationLecture 3: Sorting 1
Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:
More informationDIVIDE AND CONQUER ALGORITHMS ANALYSIS WITH RECURRENCE EQUATIONS
CHAPTER 11 SORTING ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++, GOODRICH, TAMASSIA AND MOUNT (WILEY 2004) AND SLIDES FROM NANCY M. AMATO AND
More informationSelection, Bubble, Insertion, Merge, Heap, Quick Bucket, Radix
Spring 2010 Review Topics Big O Notation Heaps Sorting Selection, Bubble, Insertion, Merge, Heap, Quick Bucket, Radix Hashtables Tree Balancing: AVL trees and DSW algorithm Graphs: Basic terminology and
More informationQuick Sort. CSE Data Structures May 15, 2002
Quick Sort CSE 373 - Data Structures May 15, 2002 Readings and References Reading Section 7.7, Data Structures and Algorithm Analysis in C, Weiss Other References C LR 15-May-02 CSE 373 - Data Structures
More informationLecture 19 Sorting Goodrich, Tamassia
Lecture 19 Sorting 7 2 9 4 2 4 7 9 7 2 2 7 9 4 4 9 7 7 2 2 9 9 4 4 2004 Goodrich, Tamassia Outline Review 3 simple sorting algorithms: 1. selection Sort (in previous course) 2. insertion Sort (in previous
More informationPractice Problems for the Final
ECE-250 Algorithms and Data Structures (Winter 2012) Practice Problems for the Final Disclaimer: Please do keep in mind that this problem set does not reflect the exact topics or the fractions of each
More informationQuickSort
QuickSort 7 4 9 6 2 2 4 6 7 9 4 2 2 4 7 9 7 9 2 2 9 9 1 QuickSort QuickSort on an input sequence S with n elements consists of three steps: n n n Divide: partition S into two sequences S 1 and S 2 of about
More informationTable ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum
Table ADT and Sorting Algorithm topics continuing (or reviewing?) CS 24 curriculum A table ADT (a.k.a. Dictionary, Map) Table public interface: // Put information in the table, and a unique key to identify
More informationSorting Goodrich, Tamassia Sorting 1
Sorting Put array A of n numbers in increasing order. A core algorithm with many applications. Simple algorithms are O(n 2 ). Optimal algorithms are O(n log n). We will see O(n) for restricted input in
More informationIntel Thread Building Blocks
Intel Thread Building Blocks SPD course 2017-18 Massimo Coppola 23/03/2018 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa
More informationParallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism
Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large
More informationUnit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION
DESIGN AND ANALYSIS OF ALGORITHMS Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION http://milanvachhani.blogspot.in EXAMPLES FROM THE SORTING WORLD Sorting provides a good set of examples for analyzing
More informationHigh Performance Comparison-Based Sorting Algorithm on Many-Core GPUs
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer System and Architecture ICT, CAS, China Outline
More informationAlgorithms and Applications
Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationAlgorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I
Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationSorting Pearson Education, Inc. All rights reserved.
1 19 Sorting 2 19.1 Introduction (Cont.) Sorting data Place data in order Typically ascending or descending Based on one or more sort keys Algorithms Insertion sort Selection sort Merge sort More efficient,
More informationCOMP Parallel Computing. SMM (2) OpenMP Programming Model
COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel
More informationData Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationLecture 16: Recapitulations. Lecture 16: Recapitulations p. 1
Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently
More informationSORTING, SETS, AND SELECTION
CHAPTER 11 SORTING, SETS, AND SELECTION ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++, GOODRICH, TAMASSIA AND MOUNT (WILEY 2004) AND SLIDES FROM
More informationTreaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19
CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types
More informationChapter 1 Introduction
Preface xv Chapter 1 Introduction 1.1 What's the Book About? 1 1.2 Mathematics Review 2 1.2.1 Exponents 3 1.2.2 Logarithms 3 1.2.3 Series 4 1.2.4 Modular Arithmetic 5 1.2.5 The P Word 6 1.3 A Brief Introduction
More informationParallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai
Parallel Programming Principle and Practice Lecture 7 Threads programming with TBB Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Outline Intel Threading
More informationAlgorithm Efficiency & Sorting. Algorithm efficiency Big-O notation Searching algorithms Sorting algorithms
Algorithm Efficiency & Sorting Algorithm efficiency Big-O notation Searching algorithms Sorting algorithms Overview Writing programs to solve problem consists of a large number of decisions how to represent
More informationMergesort again. 1. Split the list into two equal parts
Quicksort Mergesort again 1. Split the list into two equal parts 5 3 9 2 8 7 3 2 1 4 5 3 9 2 8 7 3 2 1 4 Mergesort again 2. Recursively mergesort the two parts 5 3 9 2 8 7 3 2 1 4 2 3 5 8 9 1 2 3 4 7 Mergesort
More informationIntroduction to Computers and Programming. Today
Introduction to Computers and Programming Prof. I. K. Lundqvist Lecture 10 April 8 2004 Today How to determine Big-O Compare data structures and algorithms Sorting algorithms 2 How to determine Big-O Partition
More informationOn the cost of managing data flow dependencies
On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationVirtual Memory COMPSCI 386
Virtual Memory COMPSCI 386 Motivation An instruction to be executed must be in physical memory, but there may not be enough space for all ready processes. Typically the entire program is not needed. Exception
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering
More informationData Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationCS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics
CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics 1 Sorting 1.1 Problem Statement You are given a sequence of n numbers < a 1, a 2,..., a n >. You need to
More informationInformation Coding / Computer Graphics, ISY, LiTH
Sorting on GPUs Revisiting some algorithms from lecture 6: Some not-so-good sorting approaches Bitonic sort QuickSort Concurrent kernels and recursion Adapt to parallel algorithms Many sorting algorithms
More informationPresentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, Merge Sort & Quick Sort
Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 Merge Sort & Quick Sort 1 Divide-and-Conquer Divide-and conquer is a general algorithm
More informationCSc 110, Spring 2017 Lecture 39: searching
CSc 110, Spring 2017 Lecture 39: searching 1 Sequential search sequential search: Locates a target value in a list (may not be sorted) by examining each element from start to finish. Also known as linear
More informationMemory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358
Memory Management Reading: Silberschatz chapter 9 Reading: Stallings chapter 7 1 Outline Background Issues in Memory Management Logical Vs Physical address, MMU Dynamic Loading Memory Partitioning Placement
More informationProgramming II (CS300)
1 Programming II (CS300) Chapter 12: Sorting Algorithms MOUNA KACEM mouna@cs.wisc.edu Spring 2018 Outline 2 Last week Implementation of the three tree depth-traversal algorithms Implementation of the BinarySearchTree
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationSorting Algorithms. + Analysis of the Sorting Algorithms
Sorting Algorithms + Analysis of the Sorting Algorithms Insertion Sort What if first k elements of array are already sorted? 4, 7, 12, 5, 19, 16 We can shift the tail of the sorted elements list down and
More informationAlgorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs
Algorithms in Systems Engineering ISE 172 Lecture 12 Dr. Ted Ralphs ISE 172 Lecture 12 1 References for Today s Lecture Required reading Chapter 6 References CLRS Chapter 7 D.E. Knuth, The Art of Computer
More informationPerformance and Optimization Issues in Multicore Computing
Performance and Optimization Issues in Multicore Computing Minsoo Ryu Department of Computer Science and Engineering 2 Multicore Computing Challenges It is not easy to develop an efficient multicore program
More informationIntel Thread Building Blocks
Intel Thread Building Blocks SPD course 2015-16 Massimo Coppola 08/04/2015 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa
More informationSORTING AND SELECTION
2 < > 1 4 8 6 = 9 CHAPTER 12 SORTING AND SELECTION ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN JAVA, GOODRICH, TAMASSIA AND GOLDWASSER (WILEY 2016)
More informationParallel Time-Dependent Contraction Hierarchies
Parallel Time-Dependent Contraction Hierarchies Christian Vetter July 13, 2009 Student Research Project Universität Karlsruhe (TH), 76128 Karlsruhe, Germany Supervised by G. V. Batz and P. Sanders Abstract
More informationComparison Sorts. Chapter 9.4, 12.1, 12.2
Comparison Sorts Chapter 9.4, 12.1, 12.2 Sorting We have seen the advantage of sorted data representations for a number of applications Sparse vectors Maps Dictionaries Here we consider the problem of
More informationMerge Sort Goodrich, Tamassia Merge Sort 1
Merge Sort 7 2 9 4 2 4 7 9 7 2 2 7 9 4 4 9 7 7 2 2 9 9 4 4 2004 Goodrich, Tamassia Merge Sort 1 Review of Sorting Selection-sort: Search: search through remaining unsorted elements for min Remove: remove
More informationCSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting. Ruth Anderson Winter 2019
CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting Ruth Anderson Winter 2019 Today Sorting Comparison sorting 2/08/2019 2 Introduction to sorting Stacks, queues, priority queues, and
More informationarxiv: v1 [cs.ds] 8 Dec 2016
Sorting Data on Ultra-Large Scale with RADULS New Incarnation of Radix Sort Marek Kokot, Sebastian Deorowicz, and Agnieszka Debudaj-Grabysz Institute of Informatics, Silesian University of Technology,
More informationAllows program to be incrementally parallelized
Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationChap. 5 Part 2. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1
Chap. 5 Part 2 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 Static work allocation Where work distribution is predetermined, but based on what? Typical scheme Divide n size data into P
More informationRecursive Sorts. Recursive Sorts. Divide-and-Conquer. Divide-and-Conquer. Divide-and-conquer paradigm:
Recursive Sorts Recursive Sorts Recursive sorts divide the data roughly in half and are called again on the smaller data sets. This is called the Divide-and-Conquer paradigm. We will see 2 recursive sorts:
More informationThe Limits of Sorting Divide-and-Conquer Comparison Sorts II
The Limits of Sorting Divide-and-Conquer Comparison Sorts II CS 311 Data Structures and Algorithms Lecture Slides Monday, October 12, 2009 Glenn G. Chappell Department of Computer Science University of
More informationNUMA replicated pagecache for Linux
NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations
More informationCS 470 Spring Mike Lam, Professor. Advanced OpenMP
CS 470 Spring 2018 Mike Lam, Professor Advanced OpenMP Atomics OpenMP provides access to highly-efficient hardware synchronization mechanisms Use the atomic pragma to annotate a single statement Statement
More informationCS 470 Spring Mike Lam, Professor. Advanced OpenMP
CS 470 Spring 2017 Mike Lam, Professor Advanced OpenMP Atomics OpenMP provides access to highly-efficient hardware synchronization mechanisms Use the atomic pragma to annotate a single statement Statement
More information17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer
Module 2: Divide and Conquer Divide and Conquer Control Abstraction for Divide &Conquer 1 Recurrence equation for Divide and Conquer: If the size of problem p is n and the sizes of the k sub problems are
More informationData Structures and Algorithms
Data Structures and Algorithms Autumn 2018-2019 Outline Sorting Algorithms (contd.) 1 Sorting Algorithms (contd.) Quicksort Outline Sorting Algorithms (contd.) 1 Sorting Algorithms (contd.) Quicksort Quicksort
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More information8 Introduction to Distributed Computing
CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 8, 4/26/2017. Scribed by A. Santucci. 8 Introduction
More informationCOMP Data Structures
COMP 2140 - Data Structures Shahin Kamali Topic 5 - Sorting University of Manitoba Based on notes by S. Durocher. COMP 2140 - Data Structures 1 / 55 Overview Review: Insertion Sort Merge Sort Quicksort
More informationAlgorithms: Design & Practice
Algorithms: Design & Practice Deepak Kumar Bryn Mawr College Spring 2018 Course Essentials Algorithms Design & Practice How to design Learn some good ones How to implement practical considerations How
More informationIn-place Super Scalar. Tim Kralj. Samplesort
In-place Super Scalar Tim Kralj Samplesort Outline Quicksort Super Scalar Samplesort In-place Super Scalar Samplesort (IPS 4 o) Analysis Results Further work/questions Quicksort Finds pivots in the array
More informationAlgorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II
Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language
More informationKey question: how do we pick a good pivot (and what makes a good pivot in the first place)?
More on sorting Mergesort (v2) Quicksort Mergesort in place in action 53 2 44 85 11 67 7 39 14 53 87 11 50 67 2 14 44 53 80 85 87 14 87 80 50 29 72 95 2 44 80 85 7 29 39 72 95 Boxes with same color are
More informationBalanced Binary Search Trees. Victor Gao
Balanced Binary Search Trees Victor Gao OUTLINE Binary Heap Revisited BST Revisited Balanced Binary Search Trees Rotation Treap Splay Tree BINARY HEAP: REVIEW A binary heap is a complete binary tree such
More informationRandomized Algorithms, Quicksort and Randomized Selection
CMPS 2200 Fall 2017 Randomized Algorithms, Quicksort and Randomized Selection Carola Wenk Slides by Carola Wenk and Charles Leiserson CMPS 2200 Intro. to Algorithms 1 Deterministic Algorithms Runtime for
More informationMemory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005
Memory Management Algorithms on Distributed Systems Katie Becker and David Rodgers CS425 April 15, 2005 Table of Contents 1. Introduction 2. Coarse Grained Memory 2.1. Bottlenecks 2.2. Simulations 2.3.
More informationBuffer Heap Implementation & Evaluation. Hatem Nassrat. CSCI 6104 Instructor: N.Zeh Dalhousie University Computer Science
Buffer Heap Implementation & Evaluation Hatem Nassrat CSCI 6104 Instructor: N.Zeh Dalhousie University Computer Science Table of Contents Introduction...3 Cache Aware / Cache Oblivious Algorithms...3 Buffer
More informationIn multiprogramming systems, processes share a common store. Processes need space for:
Memory Management In multiprogramming systems, processes share a common store. Processes need space for: code (instructions) static data (compiler initialized variables, strings, etc.) global data (global
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationChapter 9 Memory Management
Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual
More informationSorting. Data structures and Algorithms
Sorting Data structures and Algorithms Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++ Goodrich, Tamassia and Mount (Wiley, 2004) Outline Bubble
More informationMerge Sort
Merge Sort 7 2 9 4 2 4 7 9 7 2 2 7 9 4 4 9 7 7 2 2 9 9 4 4 Divide-and-Conuer Divide-and conuer is a general algorithm design paradigm: n Divide: divide the input data S in two disjoint subsets S 1 and
More informationRun Times. Efficiency Issues. Run Times cont d. More on O( ) notation
Comp2711 S1 2006 Correctness Oheads 1 Efficiency Issues Comp2711 S1 2006 Correctness Oheads 2 Run Times An implementation may be correct with respect to the Specification Pre- and Post-condition, but nevertheless
More informationParallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs
Lecture 16 Treaps; Augmented BSTs Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Margaret Reid-Miller 8 March 2012 Today: - More on Treaps - Ordered Sets and Tables
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationSorting. Bubble Sort. Pseudo Code for Bubble Sorting: Sorting is ordering a list of elements.
Sorting Sorting is ordering a list of elements. Types of sorting: There are many types of algorithms exist based on the following criteria: Based on Complexity Based on Memory usage (Internal & External
More informationScan Primitives for GPU Computing
Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationSorting. Divide-and-Conquer 1
Sorting Divide-and-Conquer 1 Divide-and-Conquer 7 2 9 4 2 4 7 9 7 2 2 7 9 4 4 9 7 7 2 2 9 9 4 4 Divide-and-Conquer 2 Divide-and-Conquer Divide-and conquer is a general algorithm design paradigm: Divide:
More informationThe Cost of Address Translation
The Cost of Address Translation Tomasz Jurkiewicz Kurt Mehlhorn Pat Nicholson Max Planck Institute for Informatics full version of paper by TJ and KM available at arxiv preliminary version presented at
More informationOverview: The OpenMP Programming Model
Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP
More informationCache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms
Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms Aarhus University Cache-Oblivious Current Trends Algorithms in Algorithms, - A Unified Complexity Approach to Theory, Hierarchical
More information