Parallelization on Multi-Core CPUs

Size: px
Start display at page:

Download "Parallelization on Multi-Core CPUs"

Transcription

1 1 / 30

2 Amdahl s Law suppose we parallelize an algorithm using n cores and p is the proportion of the task that can be parallelized (1 p cannot be parallelized) the speedup of the algorithm is assuming infinite parallelism, the speedup is 1 (1 p)+ p n 1 (1 p) for example, if 90% of the work is parallelized, the maximum speedup is only 10 one should make sure that eery phase of one s algorithm that depends on the input data size is parallelized 2 / 30

3 Parallelization Constructs and Libraries low-leel: C++ threads, pthreads (threads, mutexes, barriers, condition ariables) parallel patterns: parallel reduce, parallel for, fork/join parallelism parallel frameworks: TBB, OpenMP, Cilk Plus 3 / 30

4 Intel Thread Building Blocks Open Source library for parallelism and concurrency fairly nice for prototyping manages a pool of worker threads implements work stealing proides high-leel abstractions enables nested parallelism large systems (e.g., database systems) will hae their own framework 4 / 30

5 Thread-Local Storage in C++ ariables can be annotated as thread_local (each thread has its own copy) howeer, sometimes it would be conenient to access the thread-local state of other threads tbb::enumerable_thread_specific allows this 5 / 30

6 Parallel Reduce tbb :: parallel_ reduce ( tbb :: blocked_range < uint64_t >(0, n), // range 0ull, // identity [&]( const tbb :: blocked_range < uint64_t >& r, uint64_ t init ) { // accumulate for ( uint64_t i=r. begin (); i!=r. end (); i ++) init += array [i]; return init ; }, [] ( uint64_t x, uint64_t y) { return x+y; }); // combine tbb :: blocked_ range ( Value begin, Value end, size_ type grainsize =1); 6 / 30

7 Parallel For tbb :: parallel_for ( tbb :: blocked_range < uint64_t >(0, [&]( const tbb :: blocked_range < uint64_t >& r) { for ( uint64_t i=r. begin (); i!=r. end (); i ++) array [i] *= 2; }); n), 7 / 30

8 Partitioners parallel_for and parallel_reduce split the gien range to enable parallel execution there are multiple builtin partitioners: static_partitioner splits work equally among threads up-front (no dynamic work stealing) simple_partitioner splits the range as much as possible (e.g., until grainsize is reached) auto_partitioner heuristic similar to simple_partitioner, but tries to aoid creating too many ranges (default) 8 / 30

9 Fork/Join Parallelism sometimes the amount of work to parallelize is not known upfront fork/join allows one to perform work on other threads ( fork ), and then to wait until these tasks are finished ( join ) often recursie parallelism structure 9 / 30

10 Naie Merge Sort with Fork/Join (TBB) const ptrdiff_ t limit = 1024; template < class Iter > oid merge_ sort ( Iter first, Iter last ) { if ( last - first > limit ) { Iter middle = first + ( last - first ) / 2; tbb :: task_ group g; // alternatie : tbb :: parallel_ inoke g. run ([&]{ merge_sort ( first, middle ); } ); merge_sort ( middle, last ); g. wait (); std :: inplace_merge ( first, middle, last ); } else { merge_sort_serial ( first, last ); } } 10 / 30

11 Analysis What is the speedup for sorting n elements with infinite cores? serial execution: log 2 (n) n rough upper bound (using Amdahl s law): the final merge is serial: n n lower bound for fraction of serial part log 2 (n) n = 1 log 2 (n) 1 using Amdahl s law the maximum speedup is 1 = log 2 (n) log 2 (n) for example, if n = 2 20 the upper bound is log 2 (n) = 20 tighter upper bound: parallel execution: log 2 (n) 1 n i=0 2 = n + n i 2 + n 4 + < 2n for example, if n = 2 20 the upper bound is 20n 2n = 10 (both analyses assume that each leel recursion leel takes the same amount of time) 11 / 30

12 Speedup, n = speedup threads 12 / 30

13 Speedup with 10 Threads speedup data size [log scale] 13 / 30

14 Parallelization Oerhead, n = time [s] 0.10 threads limit [log scale] 14 / 30

15 Parallel Merge (1) template < typename It > oid parallelmerge ( It begin1, It end1, It begin2, It end2, It out ) { tbb :: parallel_ for ( ParallelMergeRange <It >( begin1, end1, begin2, end2, out ), [&]( ParallelMergeRange <It >& r) { std :: merge (r. begin1, r.end1, r. begin2, r.end2, r. out ); }, tbb :: simple_ partitioner ()); } template < typename It > struct ParallelMergeRange { // TBB range concept It begin1, end1, begin2, end2, out ; bool empty () const { return ( end1 - begin1 ) + ( end2 - begin2 )==0; } bool is_ diisible () const { return std :: min (end1 - begin1, end2 - begin2 ) > limit ; } ParallelMergeRange ( It begin1_, It end1_, It begin2_, It end2_, It out_ ) : begin1 ( begin1_ ), end1 ( end1_ ), begin2 ( begin2_ ), end2 ( end2_ ), out ( out_ ) {} 15 / 30

16 Parallel Merge (2) // Splitting constructor ( splits r into two subranges ) ParallelMergeRange ( ParallelMergeRange & r, tbb :: split ) { if ((r.end1 -r. begin1 ) < (r.end2 -r. begin2 )) { // first range should be the larger one std :: swap (r. begin1, r. begin2 ); std :: swap (r.end1, r. end2 ); } It m1 = r. begin1 + (r.end1 -r. begin1 )/2; It m2 = std :: lower_bound (r. begin2, r.end2, *m1 ); begin1 = m1; begin2 = m2; end1 = r. end1 ; end2 = r. end2 ; out = r. out + (m1 -r. begin1 ) + (m2 -r. begin2 ); r. end1 = m1; r. end2 = m2; } }; // struct ParallelMergeRange 16 / 30

17 Parallel Out-Of-Place Merge Sort template < class It > oid parallelmergesort ( It first, It last, It out, bool inplace = false ) { if (( last - first ) < limit ) { merge_sort_serial ( first, last ); if (! inplace ) std :: moe ( first, last, out ); } else { It mid = first + ( last - first )/2; It outmid = out + ( mid - first ); It outlast = out + ( last - first ); tbb :: parallel_ inoke ( [&]() { parallelmergesort ( first, mid, out,! inplace ); }, [&]() { parallelmergesort (mid, last, outmid,! inplace ); }); if ( inplace ) parallelmerge ( out, outmid, outmid, outlast, first ); else parallelmerge ( first, mid, mid, last, out ); } } 17 / 30

18 Scalability, n = M 75M items/s 50M 25M method in-place out-of-place threads 18 / 30

19 HyPer s Parallel Merge Sort 1. diide input data statically, each thread sorts its fraction 2. determine separators, compute output positions (prefix sums) 3. merge into output array in-place sort sort sort global 1/3 global 2/3 local 1/3 Compute global separators from the local separators local 2/3 merge merge merge 19 / 30

20 Pitfalls in Parallel Code non-scalable algorithm re-think algorithm load imbalance break work into smaller tasks, dynamically schedule these between threads task oerhead: managing tasks takes more time than the actual work set a minimum per-thread tasks size (not too small, not to large) 20 / 30

21 Volcano-Style Parallelism plan-drien approach: optimizer statically determines at query compile time how many threads should run instantiates one query operator plan for each thread connects these with exchange operators, which encapsulate parallelism and manage threads Elegant model which is used by many systems Xchg(3:1) r r r r XchgHashSplit(3:3) R R 1 R 2 R 3 21 / 30

22 Volcano-Style Parallelism (2) + operators are largely obliious to parallelism static work partitioning can cause load imbalances degree of parallelism cannot easily be changed mid-query oerhead: thread oersubscription causes context switching hash re-partitioning often does not pay off exchange operators create additional copies of the tuples 22 / 30

23 Morsel-Drien Query Execution (1) break input into constant-sized work units ( morsels ) dispatcher assigns morsels to worker threads # worker threads = # hardware threads operators are designed for parallel execution Z a Z b Result A B 16 8 A 27 B 10 C C y store store HT(T) B C 8 33 x 10 y 5 23 z u Dispatcher probe(8) probe(10) HT(S) A B morsel morsel probe(16) R A Z 16 a 7 c probe(27) 10 i 27 b 18 e 5 j 7 d 5 f 23 / 30

24 Pipeplines each pipeline is parallelized indiidually using all threads BB BA T S R 24 / 30

25 Pipeplines each pipeline is parallelized indiidually using all threads BB T S BA R Build HT(T) Pipe 1 Pipe 1 Pipe 1 Scan T Scan T Scan T 24 / 30

26 Pipeplines each pipeline is parallelized indiidually using all threads BB T BA Build HT(T) Build HT(S) Pipe 2 Pipe 2 Pipe 2 Scan S S R Scan S Scan S 24 / 30

27 Pipeplines each pipeline is parallelized indiidually using all threads Probe HT(T) BB Probe HT(T) Probe HT(T) Probe HT(S) Probe HT(S) T BA Build HT(T) Probe HT(T) Build HT(S) Probe HT(S) Probe HT(S) Pipe 3 Pipe 3 Scan R Pipe 3 Scan R Pipe 3 Scan R S R Scan R 24 / 30

28 Parallel Hash Table Construction Phase 2: scan NUMA-local storage area and insert pointers into HT Phase 1: process T morsel-wise and store NUMA-locally next morsel Storage area of red core global Hash Table T morsel Storage area of green core Storage area of blue core scan Insert the pointer into HT 25 / 30

29 Hash Tagging unused bits in pointers act as a cheap bloom filter hashtable bit tag for early filtering d 48 bit pointer e f 26 / 30

30 Aggregation/Group By parallel aggregation is one of the most difficult relational operators main challenge: behaes ery differently depending on whether there are few or many distinct keys morsel morsel next red morsel K V group group K 8 13 ht V ht K V spill when ht becomes full Partition 0 (12,7) (8,3) (41,4) (13,7) Partition 3 Partition 0 (8,9) (4,30) (13,14) (33,5) Partition 3 group group group group K HT V HT K V Result ptn 0 Result ptn 1 Phase 1: local pre-aggregation Phase 2: aggregate partition-wise 27 / 30

31 Window Functions: Semantics select location, time, ag(alue) oer (partition by location order by time range between interal 2 day preceding and interal 1 day following) from measurement order by frame partition by 28 / 30

32 Window Functions: Semantics hash partitioning (thread-local) thread 1 thread 2 combine hash groups sort/ealuation inter-partition parallelism 3.2. intra-partition parallelism 29 / 30

33 References Structured Parallel Programming: Patterns for Efficient Computation, McCool and Robison and Reinders, Morgan Kaufmann, 2012 Morsel-Drien Parallelism: A NUMA-Aware Query Ealuation Framework for the Many-Core Age, Leis and Boncz and Kemper and Neumann, SIGMOD 2014 Encapsulation of Parallelism in the Volcano Query Processing System, Graefe, SIGMOD / 30

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age Presented by Dennis Grishin What is the problem? Efficient computation requires distribution of processing between

More information

Intel Thread Building Blocks, Part II

Intel Thread Building Blocks, Part II Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization

More information

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai Parallel Programming Principle and Practice Lecture 7 Threads programming with TBB Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Outline Intel Threading

More information

Intel Thread Building Blocks

Intel Thread Building Blocks Intel Thread Building Blocks SPD course 2015-16 Massimo Coppola 08/04/2015 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa

More information

Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age

Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age Viktor Leis Peter Boncz Alfons Kemper Thomas Neumann Technische Universität München {leis,kemper,neumann}@in.tum.de

More information

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Multi-threaded Queries. Intra-Query Parallelism in LLVM Multi-threaded Queries Intra-Query Parallelism in LLVM Multithreaded Queries Intra-Query Parallelism in LLVM Yang Liu Tianqi Wu Hao Li Interpreted vs Compiled (LLVM) Interpreted vs Compiled (LLVM) Interpreted

More information

Intel Thread Building Blocks

Intel Thread Building Blocks Intel Thread Building Blocks SPD course 2017-18 Massimo Coppola 23/03/2018 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa

More information

Structured Parallel Programming Patterns for Efficient Computation

Structured Parallel Programming Patterns for Efficient Computation Structured Parallel Programming Patterns for Efficient Computation Michael McCool Arch D. Robison James Reinders ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO

More information

Structured Parallel Programming

Structured Parallel Programming Structured Parallel Programming Patterns for Efficient Computation Michael McCool Arch D. Robison James Reinders ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO

More information

Table of Contents. Cilk

Table of Contents. Cilk Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages

More information

Efficiently Introduce Threading using Intel TBB

Efficiently Introduce Threading using Intel TBB Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++

More information

Tasks and Threads. What? When? Tasks and Threads. Use OpenMP Threading Building Blocks (TBB) Intel Math Kernel Library (MKL)

Tasks and Threads. What? When? Tasks and Threads. Use OpenMP Threading Building Blocks (TBB) Intel Math Kernel Library (MKL) CGT 581I - Parallel Graphics and Simulation Knights Landing Tasks and Threads Bedrich Benes, Ph.D. Professor Department of Computer Graphics Purdue University Tasks and Threads Use OpenMP Threading Building

More information

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

A Primer on Scheduling Fork-Join Parallelism with Work Stealing Doc. No.: N3872 Date: 2014-01-15 Reply to: Arch Robison A Primer on Scheduling Fork-Join Parallelism with Work Stealing This paper is a primer, not a proposal, on some issues related to implementing fork-join

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

Transactional Memory for C/C++ IBM XL C/C++ for Blue Gene/Q, V12.0 (technology preview)

Transactional Memory for C/C++ IBM XL C/C++ for Blue Gene/Q, V12.0 (technology preview) Transactional Memory for C/C++ IBM XL C/C++ for Blue Gene/Q, V12.0 (technology preiew) ii Transactional Memory for C/C++ Contents Chapter 1. Transactional memory.... 1 Chapter 2. Compiler option reference..

More information

Optimize an Existing Program by Introducing Parallelism

Optimize an Existing Program by Introducing Parallelism Optimize an Existing Program by Introducing Parallelism 1 Introduction This guide will help you add parallelism to your application using Intel Parallel Studio. You will get hands-on experience with our

More information

MULTI-THREADED QUERIES

MULTI-THREADED QUERIES 15-721 Project 3 Final Presentation MULTI-THREADED QUERIES Wendong Li (wendongl) Lu Zhang (lzhang3) Rui Wang (ruiw1) Project Objective Intra-operator parallelism Use multiple threads in a single executor

More information

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism. Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,

More information

University of Waterloo Midterm Examination Sample Solution

University of Waterloo Midterm Examination Sample Solution 1. (4 total marks) University of Waterloo Midterm Examination Sample Solution Winter, 2012 Suppose that a relational database contains the following large relation: Track(ReleaseID, TrackNum, Title, Length,

More information

Main-Memory Databases 1 / 25

Main-Memory Databases 1 / 25 1 / 25 Motivation Hardware trends Huge main memory capacity with complex access characteristics (Caches, NUMA) Many-core CPUs SIMD support in CPUs New CPU features (HTM) Also: Graphic cards, FPGAs, low

More information

Shared memory parallel computing. Intel Threading Building Blocks

Shared memory parallel computing. Intel Threading Building Blocks Shared memory parallel computing Intel Threading Building Blocks Introduction & history Threading Building Blocks (TBB) cross platform C++ template lib for task-based shared memory parallel programming

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Intel Thread Building Blocks, Part II

Intel Thread Building Blocks, Part II Intel Thread Building Blocks, Part II SPD course 2014-15 Massimo Coppola 5/05/2015 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization support

More information

Intel Threading Building Blocks (TBB)

Intel Threading Building Blocks (TBB) Intel Threading Building Blocks (TBB) SDSC Summer Institute 2012 Pietro Cicotti Computational Scientist Gordon Applications Team Performance Modeling and Characterization Lab Parallelism and Decomposition

More information

On the cost of managing data flow dependencies

On the cost of managing data flow dependencies On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Next-Generation Parallel Query

Next-Generation Parallel Query Next-Generation Parallel Query Robert Haas & Rafia Sabih 2013 EDB All rights reserved. 1 Overview v10 Improvements TPC-H Results TPC-H Analysis Thoughts for the Future 2017 EDB All rights reserved. 2 Parallel

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Effective Performance Measurement and Analysis of Multithreaded Applications

Effective Performance Measurement and Analysis of Multithreaded Applications Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Programming for high performance Optimizing the

More information

Guillimin HPC Users Meeting January 13, 2017

Guillimin HPC Users Meeting January 13, 2017 Guillimin HPC Users Meeting January 13, 2017 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Please be kind to your fellow user meeting attendees Limit

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch Advanced Databases Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch www.mniazi.ir Parallel Join The join operation requires pairs of tuples to be

More information

Introduction to Concurrency

Introduction to Concurrency Introduction to Concurrency CSE 333 Summer 2018 Instructor: Hal Perkins Teaching Assistants: Renshu Gu William Kim Soumya Vasisht Administriia Last exercise due Monday Concurrency using pthreads hw4 due

More information

. Shared Memory Programming in OpenMP and Intel TBB. Kenjiro Taura. University of Tokyo

. Shared Memory Programming in OpenMP and Intel TBB. Kenjiro Taura. University of Tokyo .. Shared Memory Programming in OpenMP and Intel TBB Kenjiro Taura University of Tokyo 1 / 62 Today s topics. 1 What is shared memory programming?. 2 OpenMP OpenMP overview parallel and for pragma Data

More information

Intel(R) Threading Building Blocks

Intel(R) Threading Building Blocks Getting Started Guide Intel Threading Building Blocks is a runtime-based parallel programming model for C++ code that uses threads. It consists of a template-based runtime library to help you harness the

More information

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1.

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1. Moore s Law 1000000 Intel CPU Introductions 6.172 Performance Engineering of Software Systems Lecture 11 Multicore Programming Charles E. Leiserson 100000 10000 1000 100 10 Clock Speed (MHz) Transistors

More information

Operator Implementation Wrap-Up Query Optimization

Operator Implementation Wrap-Up Query Optimization Operator Implementation Wrap-Up Query Optimization 1 Last time: Nested loop join algorithms: TNLJ PNLJ BNLJ INLJ Sort Merge Join Hash Join 2 General Join Conditions Equalities over several attributes (e.g.,

More information

Query Processing Models

Query Processing Models Query Processing Models Holger Pirk Holger Pirk Query Processing Models 1 / 43 Purpose of this lecture By the end, you should Understand the principles of the different Query Processing Models Be able

More information

CMSC424: Database Design. Instructor: Amol Deshpande

CMSC424: Database Design. Instructor: Amol Deshpande CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons

More information

Task-based Data Parallel Programming

Task-based Data Parallel Programming Task-based Data Parallel Programming Asaf Yaffe Developer Products Division May, 2009 Agenda Overview Data Parallel Algorithms Tasks and Scheduling Synchronization and Concurrent Containers Summary 2 1

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

OPERATING SYSTEM. Chapter 4: Threads

OPERATING SYSTEM. Chapter 4: Threads OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

Divide and Conquer Algorithms

Divide and Conquer Algorithms Divide and Conquer Algorithms T. M. Murali February 19, 2009 Divide and Conquer Break up a problem into several parts. Solve each part recursively. Solve base cases by brute force. Efficiently combine

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming F 'C 3 R'"'C,_,. HO!.-IJJ () An Introduction to Parallel Programming Peter S. Pacheco University of San Francisco ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO

More information

Compsci 590.3: Introduction to Parallel Computing

Compsci 590.3: Introduction to Parallel Computing Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling ( how to be l33t ) Lecture 6: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Kaleo Way Down We Go

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Optimizing Digital Audio Cross-Point Matrix for Desktop Processors Using Parallel Processing and SIMD Technology

Optimizing Digital Audio Cross-Point Matrix for Desktop Processors Using Parallel Processing and SIMD Technology Optimizing Digital Audio Cross-Point Matrix for Desktop Processors Using Parallel Processing and SIMD Technology JIRI SCHIMME Department of Telecommunications EEC Brno University of Technology Purkynova

More information

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced GLADE: A Scalable Framework for Efficient Analytics Florin Rusu University of California, Merced Motivation and Objective Large scale data processing Map-Reduce is standard technique Targeted to distributed

More information

QuickSort

QuickSort QuickSort 7 4 9 6 2 2 4 6 7 9 4 2 2 4 7 9 7 9 2 2 9 9 1 QuickSort QuickSort on an input sequence S with n elements consists of three steps: n n n Divide: partition S into two sequences S 1 and S 2 of about

More information

Algorithm for siftdown(int currentposition) while true (infinite loop) do if the currentposition has NO children then return

Algorithm for siftdown(int currentposition) while true (infinite loop) do if the currentposition has NO children then return 0. How would we write the BinaryHeap siftdown function recursively? [0] 6 [1] [] 15 10 Name: template class BinaryHeap { private: int maxsize; int numitems; T * heap;... [3] [4] [5] [6] 114 0

More information

Introduction to Concurrency

Introduction to Concurrency Introduction to Concurrency CSE 333 Autumn 2018 Instructor: Hal Perkins Teaching Assistants: Tarkan Al-Kazily Renshu Gu Trais McGaha Harshita Neti Thai Pham Forrest Timour Soumya Vasisht Yifan Xu Administriia

More information

Query Processing. Solutions to Practice Exercises Query:

Query Processing. Solutions to Practice Exercises Query: C H A P T E R 1 3 Query Processing Solutions to Practice Exercises 13.1 Query: Π T.branch name ((Π branch name, assets (ρ T (branch))) T.assets>S.assets (Π assets (σ (branch city = Brooklyn )(ρ S (branch)))))

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!

More information

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Chapter 20: Parallel Databases. Introduction

Chapter 20: Parallel Databases. Introduction Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

CSCI Trees. Mark Redekopp David Kempe

CSCI Trees. Mark Redekopp David Kempe CSCI 104 2-3 Trees Mark Redekopp David Kempe Trees & Maps/Sets C++ STL "maps" and "sets" use binary search trees internally to store their keys (and values) that can grow or contract as needed This allows

More information

4.8 Summary. Practice Exercises

4.8 Summary. Practice Exercises Practice Exercises 191 structures of the parent process. A new task is also created when the clone() system call is made. However, rather than copying all data structures, the new task points to the data

More information

Query Optimization. Chapter 13

Query Optimization. Chapter 13 Query Optimization Chapter 13 What we want to cover today Overview of query optimization Generating equivalent expressions Cost estimation 432/832 2 Chapter 13 Query Optimization OVERVIEW 432/832 3 Query

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

Databases IIB: DBMS-Implementation Exercise Sheet 13

Databases IIB: DBMS-Implementation Exercise Sheet 13 Prof. Dr. Stefan Brass January 27, 2017 Institut für Informatik MLU Halle-Wittenberg Databases IIB: DBMS-Implementation Exercise Sheet 13 As requested by the students, the repetition questions a) will

More information

Lecture 6: Divide-and-Conquer

Lecture 6: Divide-and-Conquer Lecture 6: Divide-and-Conquer COSC242: Algorithms and Data Structures Brendan McCane Department of Computer Science, University of Otago Types of Algorithms In COSC242, we will be looking at 3 general

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Chapter 4: Multithreaded Programming

Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading

More information

COEN244: Class & function templates

COEN244: Class & function templates COEN244: Class & function templates Aishy Amer Electrical & Computer Engineering Templates Function Templates Class Templates Outline Templates and inheritance Introduction to C++ Standard Template Library

More information

CS 240A: Shared Memory & Multicore Programming with Cilk++

CS 240A: Shared Memory & Multicore Programming with Cilk++ CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery

More information

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Chapter 12 Outline Overview of Object Database Concepts Object-Relational Features Object Database Extensions to SQL ODMG Object Model and the Object Definition Language ODL Object Database Conceptual

More information

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical

More information

Shared Memory Programming. Parallel Programming Overview

Shared Memory Programming. Parallel Programming Overview Shared Memory Programming Arvind Krishnamurthy Fall 2004 Parallel Programming Overview Basic parallel programming problems: 1. Creating parallelism & managing parallelism Scheduling to guarantee parallelism

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

The Cut and Thrust of CUDA

The Cut and Thrust of CUDA The Cut and Thrust of CUDA Luke Hodkinson Center for Astrophysics and Supercomputing Swinburne University of Technology Melbourne, Hawthorn 32000, Australia May 16, 2013 Luke Hodkinson The Cut and Thrust

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

C++ Smart Pointers. CSE 333 Autumn 2018

C++ Smart Pointers. CSE 333 Autumn 2018 C++ Smart Pointers CSE 333 Autumn 2018 Instructor: Hal Perkins Teaching Assistants: Tarkan Al-Kazily Renshu Gu Trais McGaha Harshita Neti Thai Pham Forrest Timour Soumya Vasisht Yifan Xu Administriia New

More information

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 Concepts in Multicore Programming The Multicore- Software Challenge MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 2009 Charles E. Leiserson 1 Cilk, Cilk++, and Cilkscreen, are trademarks of

More information

IBM. Systems management Logical partitions. System i. Version 6 Release 1

IBM. Systems management Logical partitions. System i. Version 6 Release 1 IBM System i Systems management Logical partitions Version 6 Release 1 IBM System i Systems management Logical partitions Version 6 Release 1 Note Before using this information and the product it supports,

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) DE: A Scalable Framework for Efficient Analytics Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) Big Data Analytics Big Data Storage is cheap ($100 for 1TB disk) Everything

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution CSE 544 Principles of Database Management Systems Magdalena Balazinska Fall 2007 Lecture 7 - Query execution References Generalized Search Trees for Database Systems. J. M. Hellerstein, J. F. Naughton

More information

Evaluation of Relational Operations: Other Techniques

Evaluation of Relational Operations: Other Techniques Evaluation of Relational Operations: Other Techniques [R&G] Chapter 14, Part B CS4320 1 Using an Index for Selections Cost depends on #qualifying tuples, and clustering. Cost of finding qualifying data

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques 376a. Database Design Dept. of Computer Science Vassar College http://www.cs.vassar.edu/~cs376 Class 16 Query optimization What happens Database is given a query Query is scanned - scanner creates a list

More information

Team SimpleMind. Ismail Oukid (TU Dresden), Ingo Müller (KIT), Iraklis Psaroudakis (EPFL) ACM SIGMOD 2015 Programming SIGMOD 2015 (June 2)

Team SimpleMind. Ismail Oukid (TU Dresden), Ingo Müller (KIT), Iraklis Psaroudakis (EPFL) ACM SIGMOD 2015 Programming SIGMOD 2015 (June 2) Team SimpleMind Ismail Oukid (TU Dresden), Ingo Müller (KIT), Iraklis Psaroudakis (EPFL) ACM SIGMOD 2015 Programming Contest @ SIGMOD 2015 (June 2) Public Agenda Programming Contest Overview Transaction

More information

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS. Chapter 18 Strategies for Query Processing We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS. 1 1. Translating SQL Queries into Relational Algebra and Other Operators - SQL is

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

Efficient Work Stealing for Fine-Grained Parallelism

Efficient Work Stealing for Fine-Grained Parallelism Efficient Work Stealing for Fine-Grained Parallelism Karl-Filip Faxén Swedish Institute of Computer Science November 26, 2009 Task parallel fib in Wool TASK 1( int, fib, int, n ) { if( n

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

INFOGR Computer Graphics. J. Bikker - April-July Lecture 11: Acceleration. Welcome!

INFOGR Computer Graphics. J. Bikker - April-July Lecture 11: Acceleration. Welcome! INFOGR Computer Graphics J. Bikker - April-July 2015 - Lecture 11: Acceleration Welcome! Today s Agenda: High-speed Ray Tracing Acceleration Structures The Bounding Volume Hierarchy BVH Construction BVH

More information

Robinhood: Towards Efficient Work-stealing in Virtualized Environments

Robinhood: Towards Efficient Work-stealing in Virtualized Environments 1 : Towards Efficient Work-stealing in Virtualized Enironments Yaqiong Peng, Song Wu, Member, IEEE, Hai Jin, Senior Member, IEEE Abstract Work-stealing, as a common user-leel task scheduler for managing

More information

SFDV3006 Concurrent Programming

SFDV3006 Concurrent Programming SFDV3006 Concurrent Programming Lecture 6 Concurrent Architecture Concurrent Architectures Software architectures identify software components and their interaction Architectures are process structures

More information

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH 1 INTRODUCTION In centralized database: Data is located in one place (one server) All DBMS functionalities are done by that server

More information