Parallelization on Multi-Core CPUs
|
|
- Imogene Atkins
- 5 years ago
- Views:
Transcription
1 1 / 30
2 Amdahl s Law suppose we parallelize an algorithm using n cores and p is the proportion of the task that can be parallelized (1 p cannot be parallelized) the speedup of the algorithm is assuming infinite parallelism, the speedup is 1 (1 p)+ p n 1 (1 p) for example, if 90% of the work is parallelized, the maximum speedup is only 10 one should make sure that eery phase of one s algorithm that depends on the input data size is parallelized 2 / 30
3 Parallelization Constructs and Libraries low-leel: C++ threads, pthreads (threads, mutexes, barriers, condition ariables) parallel patterns: parallel reduce, parallel for, fork/join parallelism parallel frameworks: TBB, OpenMP, Cilk Plus 3 / 30
4 Intel Thread Building Blocks Open Source library for parallelism and concurrency fairly nice for prototyping manages a pool of worker threads implements work stealing proides high-leel abstractions enables nested parallelism large systems (e.g., database systems) will hae their own framework 4 / 30
5 Thread-Local Storage in C++ ariables can be annotated as thread_local (each thread has its own copy) howeer, sometimes it would be conenient to access the thread-local state of other threads tbb::enumerable_thread_specific allows this 5 / 30
6 Parallel Reduce tbb :: parallel_ reduce ( tbb :: blocked_range < uint64_t >(0, n), // range 0ull, // identity [&]( const tbb :: blocked_range < uint64_t >& r, uint64_ t init ) { // accumulate for ( uint64_t i=r. begin (); i!=r. end (); i ++) init += array [i]; return init ; }, [] ( uint64_t x, uint64_t y) { return x+y; }); // combine tbb :: blocked_ range ( Value begin, Value end, size_ type grainsize =1); 6 / 30
7 Parallel For tbb :: parallel_for ( tbb :: blocked_range < uint64_t >(0, [&]( const tbb :: blocked_range < uint64_t >& r) { for ( uint64_t i=r. begin (); i!=r. end (); i ++) array [i] *= 2; }); n), 7 / 30
8 Partitioners parallel_for and parallel_reduce split the gien range to enable parallel execution there are multiple builtin partitioners: static_partitioner splits work equally among threads up-front (no dynamic work stealing) simple_partitioner splits the range as much as possible (e.g., until grainsize is reached) auto_partitioner heuristic similar to simple_partitioner, but tries to aoid creating too many ranges (default) 8 / 30
9 Fork/Join Parallelism sometimes the amount of work to parallelize is not known upfront fork/join allows one to perform work on other threads ( fork ), and then to wait until these tasks are finished ( join ) often recursie parallelism structure 9 / 30
10 Naie Merge Sort with Fork/Join (TBB) const ptrdiff_ t limit = 1024; template < class Iter > oid merge_ sort ( Iter first, Iter last ) { if ( last - first > limit ) { Iter middle = first + ( last - first ) / 2; tbb :: task_ group g; // alternatie : tbb :: parallel_ inoke g. run ([&]{ merge_sort ( first, middle ); } ); merge_sort ( middle, last ); g. wait (); std :: inplace_merge ( first, middle, last ); } else { merge_sort_serial ( first, last ); } } 10 / 30
11 Analysis What is the speedup for sorting n elements with infinite cores? serial execution: log 2 (n) n rough upper bound (using Amdahl s law): the final merge is serial: n n lower bound for fraction of serial part log 2 (n) n = 1 log 2 (n) 1 using Amdahl s law the maximum speedup is 1 = log 2 (n) log 2 (n) for example, if n = 2 20 the upper bound is log 2 (n) = 20 tighter upper bound: parallel execution: log 2 (n) 1 n i=0 2 = n + n i 2 + n 4 + < 2n for example, if n = 2 20 the upper bound is 20n 2n = 10 (both analyses assume that each leel recursion leel takes the same amount of time) 11 / 30
12 Speedup, n = speedup threads 12 / 30
13 Speedup with 10 Threads speedup data size [log scale] 13 / 30
14 Parallelization Oerhead, n = time [s] 0.10 threads limit [log scale] 14 / 30
15 Parallel Merge (1) template < typename It > oid parallelmerge ( It begin1, It end1, It begin2, It end2, It out ) { tbb :: parallel_ for ( ParallelMergeRange <It >( begin1, end1, begin2, end2, out ), [&]( ParallelMergeRange <It >& r) { std :: merge (r. begin1, r.end1, r. begin2, r.end2, r. out ); }, tbb :: simple_ partitioner ()); } template < typename It > struct ParallelMergeRange { // TBB range concept It begin1, end1, begin2, end2, out ; bool empty () const { return ( end1 - begin1 ) + ( end2 - begin2 )==0; } bool is_ diisible () const { return std :: min (end1 - begin1, end2 - begin2 ) > limit ; } ParallelMergeRange ( It begin1_, It end1_, It begin2_, It end2_, It out_ ) : begin1 ( begin1_ ), end1 ( end1_ ), begin2 ( begin2_ ), end2 ( end2_ ), out ( out_ ) {} 15 / 30
16 Parallel Merge (2) // Splitting constructor ( splits r into two subranges ) ParallelMergeRange ( ParallelMergeRange & r, tbb :: split ) { if ((r.end1 -r. begin1 ) < (r.end2 -r. begin2 )) { // first range should be the larger one std :: swap (r. begin1, r. begin2 ); std :: swap (r.end1, r. end2 ); } It m1 = r. begin1 + (r.end1 -r. begin1 )/2; It m2 = std :: lower_bound (r. begin2, r.end2, *m1 ); begin1 = m1; begin2 = m2; end1 = r. end1 ; end2 = r. end2 ; out = r. out + (m1 -r. begin1 ) + (m2 -r. begin2 ); r. end1 = m1; r. end2 = m2; } }; // struct ParallelMergeRange 16 / 30
17 Parallel Out-Of-Place Merge Sort template < class It > oid parallelmergesort ( It first, It last, It out, bool inplace = false ) { if (( last - first ) < limit ) { merge_sort_serial ( first, last ); if (! inplace ) std :: moe ( first, last, out ); } else { It mid = first + ( last - first )/2; It outmid = out + ( mid - first ); It outlast = out + ( last - first ); tbb :: parallel_ inoke ( [&]() { parallelmergesort ( first, mid, out,! inplace ); }, [&]() { parallelmergesort (mid, last, outmid,! inplace ); }); if ( inplace ) parallelmerge ( out, outmid, outmid, outlast, first ); else parallelmerge ( first, mid, mid, last, out ); } } 17 / 30
18 Scalability, n = M 75M items/s 50M 25M method in-place out-of-place threads 18 / 30
19 HyPer s Parallel Merge Sort 1. diide input data statically, each thread sorts its fraction 2. determine separators, compute output positions (prefix sums) 3. merge into output array in-place sort sort sort global 1/3 global 2/3 local 1/3 Compute global separators from the local separators local 2/3 merge merge merge 19 / 30
20 Pitfalls in Parallel Code non-scalable algorithm re-think algorithm load imbalance break work into smaller tasks, dynamically schedule these between threads task oerhead: managing tasks takes more time than the actual work set a minimum per-thread tasks size (not too small, not to large) 20 / 30
21 Volcano-Style Parallelism plan-drien approach: optimizer statically determines at query compile time how many threads should run instantiates one query operator plan for each thread connects these with exchange operators, which encapsulate parallelism and manage threads Elegant model which is used by many systems Xchg(3:1) r r r r XchgHashSplit(3:3) R R 1 R 2 R 3 21 / 30
22 Volcano-Style Parallelism (2) + operators are largely obliious to parallelism static work partitioning can cause load imbalances degree of parallelism cannot easily be changed mid-query oerhead: thread oersubscription causes context switching hash re-partitioning often does not pay off exchange operators create additional copies of the tuples 22 / 30
23 Morsel-Drien Query Execution (1) break input into constant-sized work units ( morsels ) dispatcher assigns morsels to worker threads # worker threads = # hardware threads operators are designed for parallel execution Z a Z b Result A B 16 8 A 27 B 10 C C y store store HT(T) B C 8 33 x 10 y 5 23 z u Dispatcher probe(8) probe(10) HT(S) A B morsel morsel probe(16) R A Z 16 a 7 c probe(27) 10 i 27 b 18 e 5 j 7 d 5 f 23 / 30
24 Pipeplines each pipeline is parallelized indiidually using all threads BB BA T S R 24 / 30
25 Pipeplines each pipeline is parallelized indiidually using all threads BB T S BA R Build HT(T) Pipe 1 Pipe 1 Pipe 1 Scan T Scan T Scan T 24 / 30
26 Pipeplines each pipeline is parallelized indiidually using all threads BB T BA Build HT(T) Build HT(S) Pipe 2 Pipe 2 Pipe 2 Scan S S R Scan S Scan S 24 / 30
27 Pipeplines each pipeline is parallelized indiidually using all threads Probe HT(T) BB Probe HT(T) Probe HT(T) Probe HT(S) Probe HT(S) T BA Build HT(T) Probe HT(T) Build HT(S) Probe HT(S) Probe HT(S) Pipe 3 Pipe 3 Scan R Pipe 3 Scan R Pipe 3 Scan R S R Scan R 24 / 30
28 Parallel Hash Table Construction Phase 2: scan NUMA-local storage area and insert pointers into HT Phase 1: process T morsel-wise and store NUMA-locally next morsel Storage area of red core global Hash Table T morsel Storage area of green core Storage area of blue core scan Insert the pointer into HT 25 / 30
29 Hash Tagging unused bits in pointers act as a cheap bloom filter hashtable bit tag for early filtering d 48 bit pointer e f 26 / 30
30 Aggregation/Group By parallel aggregation is one of the most difficult relational operators main challenge: behaes ery differently depending on whether there are few or many distinct keys morsel morsel next red morsel K V group group K 8 13 ht V ht K V spill when ht becomes full Partition 0 (12,7) (8,3) (41,4) (13,7) Partition 3 Partition 0 (8,9) (4,30) (13,14) (33,5) Partition 3 group group group group K HT V HT K V Result ptn 0 Result ptn 1 Phase 1: local pre-aggregation Phase 2: aggregate partition-wise 27 / 30
31 Window Functions: Semantics select location, time, ag(alue) oer (partition by location order by time range between interal 2 day preceding and interal 1 day following) from measurement order by frame partition by 28 / 30
32 Window Functions: Semantics hash partitioning (thread-local) thread 1 thread 2 combine hash groups sort/ealuation inter-partition parallelism 3.2. intra-partition parallelism 29 / 30
33 References Structured Parallel Programming: Patterns for Efficient Computation, McCool and Robison and Reinders, Morgan Kaufmann, 2012 Morsel-Drien Parallelism: A NUMA-Aware Query Ealuation Framework for the Many-Core Age, Leis and Boncz and Kemper and Neumann, SIGMOD 2014 Encapsulation of Parallelism in the Volcano Query Processing System, Graefe, SIGMOD / 30
Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin
Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age Presented by Dennis Grishin What is the problem? Efficient computation requires distribution of processing between
More informationIntel Thread Building Blocks, Part II
Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization
More informationParallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai
Parallel Programming Principle and Practice Lecture 7 Threads programming with TBB Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Outline Intel Threading
More informationIntel Thread Building Blocks
Intel Thread Building Blocks SPD course 2015-16 Massimo Coppola 08/04/2015 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa
More informationMorsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age
Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age Viktor Leis Peter Boncz Alfons Kemper Thomas Neumann Technische Universität München {leis,kemper,neumann}@in.tum.de
More informationMulti-threaded Queries. Intra-Query Parallelism in LLVM
Multi-threaded Queries Intra-Query Parallelism in LLVM Multithreaded Queries Intra-Query Parallelism in LLVM Yang Liu Tianqi Wu Hao Li Interpreted vs Compiled (LLVM) Interpreted vs Compiled (LLVM) Interpreted
More informationIntel Thread Building Blocks
Intel Thread Building Blocks SPD course 2017-18 Massimo Coppola 23/03/2018 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa
More informationStructured Parallel Programming Patterns for Efficient Computation
Structured Parallel Programming Patterns for Efficient Computation Michael McCool Arch D. Robison James Reinders ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO
More informationStructured Parallel Programming
Structured Parallel Programming Patterns for Efficient Computation Michael McCool Arch D. Robison James Reinders ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO
More informationTable of Contents. Cilk
Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages
More informationEfficiently Introduce Threading using Intel TBB
Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++
More informationTasks and Threads. What? When? Tasks and Threads. Use OpenMP Threading Building Blocks (TBB) Intel Math Kernel Library (MKL)
CGT 581I - Parallel Graphics and Simulation Knights Landing Tasks and Threads Bedrich Benes, Ph.D. Professor Department of Computer Graphics Purdue University Tasks and Threads Use OpenMP Threading Building
More informationA Primer on Scheduling Fork-Join Parallelism with Work Stealing
Doc. No.: N3872 Date: 2014-01-15 Reply to: Arch Robison A Primer on Scheduling Fork-Join Parallelism with Work Stealing This paper is a primer, not a proposal, on some issues related to implementing fork-join
More informationParallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism
Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large
More informationTransactional Memory for C/C++ IBM XL C/C++ for Blue Gene/Q, V12.0 (technology preview)
Transactional Memory for C/C++ IBM XL C/C++ for Blue Gene/Q, V12.0 (technology preiew) ii Transactional Memory for C/C++ Contents Chapter 1. Transactional memory.... 1 Chapter 2. Compiler option reference..
More informationOptimize an Existing Program by Introducing Parallelism
Optimize an Existing Program by Introducing Parallelism 1 Introduction This guide will help you add parallelism to your application using Intel Parallel Studio. You will get hands-on experience with our
More informationMULTI-THREADED QUERIES
15-721 Project 3 Final Presentation MULTI-THREADED QUERIES Wendong Li (wendongl) Lu Zhang (lzhang3) Rui Wang (ruiw1) Project Objective Intra-operator parallelism Use multiple threads in a single executor
More informationThe Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.
Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,
More informationUniversity of Waterloo Midterm Examination Sample Solution
1. (4 total marks) University of Waterloo Midterm Examination Sample Solution Winter, 2012 Suppose that a relational database contains the following large relation: Track(ReleaseID, TrackNum, Title, Length,
More informationMain-Memory Databases 1 / 25
1 / 25 Motivation Hardware trends Huge main memory capacity with complex access characteristics (Caches, NUMA) Many-core CPUs SIMD support in CPUs New CPU features (HTM) Also: Graphic cards, FPGAs, low
More informationShared memory parallel computing. Intel Threading Building Blocks
Shared memory parallel computing Intel Threading Building Blocks Introduction & history Threading Building Blocks (TBB) cross platform C++ template lib for task-based shared memory parallel programming
More informationEI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)
EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:
More informationIntel Thread Building Blocks, Part II
Intel Thread Building Blocks, Part II SPD course 2014-15 Massimo Coppola 5/05/2015 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization support
More informationIntel Threading Building Blocks (TBB)
Intel Threading Building Blocks (TBB) SDSC Summer Institute 2012 Pietro Cicotti Computational Scientist Gordon Applications Team Performance Modeling and Characterization Lab Parallelism and Decomposition
More informationOn the cost of managing data flow dependencies
On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationNext-Generation Parallel Query
Next-Generation Parallel Query Robert Haas & Rafia Sabih 2013 EDB All rights reserved. 1 Overview v10 Improvements TPC-H Results TPC-H Analysis Thoughts for the Future 2017 EDB All rights reserved. 2 Parallel
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationEffective Performance Measurement and Analysis of Multithreaded Applications
Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined
More informationPerformance Optimization Part 1: Work Distribution and Scheduling
Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Programming for high performance Optimizing the
More informationGuillimin HPC Users Meeting January 13, 2017
Guillimin HPC Users Meeting January 13, 2017 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Please be kind to your fellow user meeting attendees Limit
More informationTrack Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross
Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk
More informationAdvanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Advanced Databases Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch www.mniazi.ir Parallel Join The join operation requires pairs of tuples to be
More informationIntroduction to Concurrency
Introduction to Concurrency CSE 333 Summer 2018 Instructor: Hal Perkins Teaching Assistants: Renshu Gu William Kim Soumya Vasisht Administriia Last exercise due Monday Concurrency using pthreads hw4 due
More information. Shared Memory Programming in OpenMP and Intel TBB. Kenjiro Taura. University of Tokyo
.. Shared Memory Programming in OpenMP and Intel TBB Kenjiro Taura University of Tokyo 1 / 62 Today s topics. 1 What is shared memory programming?. 2 OpenMP OpenMP overview parallel and for pragma Data
More informationIntel(R) Threading Building Blocks
Getting Started Guide Intel Threading Building Blocks is a runtime-based parallel programming model for C++ code that uses threads. It consists of a template-based runtime library to help you harness the
More informationMoore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1.
Moore s Law 1000000 Intel CPU Introductions 6.172 Performance Engineering of Software Systems Lecture 11 Multicore Programming Charles E. Leiserson 100000 10000 1000 100 10 Clock Speed (MHz) Transistors
More informationOperator Implementation Wrap-Up Query Optimization
Operator Implementation Wrap-Up Query Optimization 1 Last time: Nested loop join algorithms: TNLJ PNLJ BNLJ INLJ Sort Merge Join Hash Join 2 General Join Conditions Equalities over several attributes (e.g.,
More informationQuery Processing Models
Query Processing Models Holger Pirk Holger Pirk Query Processing Models 1 / 43 Purpose of this lecture By the end, you should Understand the principles of the different Query Processing Models Be able
More informationCMSC424: Database Design. Instructor: Amol Deshpande
CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons
More informationTask-based Data Parallel Programming
Task-based Data Parallel Programming Asaf Yaffe Developer Products Division May, 2009 Agenda Overview Data Parallel Algorithms Tasks and Scheduling Synchronization and Concurrent Containers Summary 2 1
More informationShort Notes of CS201
#includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system
More informationOPERATING SYSTEM. Chapter 4: Threads
OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To
More informationCS201 - Introduction to Programming Glossary By
CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with
More informationDivide and Conquer Algorithms
Divide and Conquer Algorithms T. M. Murali February 19, 2009 Divide and Conquer Break up a problem into several parts. Solve each part recursively. Solve base cases by brute force. Efficiently combine
More informationMetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores
MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,
More informationAn Introduction to Parallel Programming
F 'C 3 R'"'C,_,. HO!.-IJJ () An Introduction to Parallel Programming Peter S. Pacheco University of San Francisco ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO
More informationCompsci 590.3: Introduction to Parallel Computing
Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due
More informationPerformance Optimization Part 1: Work Distribution and Scheduling
( how to be l33t ) Lecture 6: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Kaleo Way Down We Go
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationOptimizing Digital Audio Cross-Point Matrix for Desktop Processors Using Parallel Processing and SIMD Technology
Optimizing Digital Audio Cross-Point Matrix for Desktop Processors Using Parallel Processing and SIMD Technology JIRI SCHIMME Department of Telecommunications EEC Brno University of Technology Purkynova
More informationGLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced
GLADE: A Scalable Framework for Efficient Analytics Florin Rusu University of California, Merced Motivation and Objective Large scale data processing Map-Reduce is standard technique Targeted to distributed
More informationQuickSort
QuickSort 7 4 9 6 2 2 4 6 7 9 4 2 2 4 7 9 7 9 2 2 9 9 1 QuickSort QuickSort on an input sequence S with n elements consists of three steps: n n n Divide: partition S into two sequences S 1 and S 2 of about
More informationAlgorithm for siftdown(int currentposition) while true (infinite loop) do if the currentposition has NO children then return
0. How would we write the BinaryHeap siftdown function recursively? [0] 6 [1] [] 15 10 Name: template class BinaryHeap { private: int maxsize; int numitems; T * heap;... [3] [4] [5] [6] 114 0
More informationIntroduction to Concurrency
Introduction to Concurrency CSE 333 Autumn 2018 Instructor: Hal Perkins Teaching Assistants: Tarkan Al-Kazily Renshu Gu Trais McGaha Harshita Neti Thai Pham Forrest Timour Soumya Vasisht Yifan Xu Administriia
More informationQuery Processing. Solutions to Practice Exercises Query:
C H A P T E R 1 3 Query Processing Solutions to Practice Exercises 13.1 Query: Π T.branch name ((Π branch name, assets (ρ T (branch))) T.assets>S.assets (Π assets (σ (branch city = Brooklyn )(ρ S (branch)))))
More informationChapter 4: Threads. Chapter 4: Threads
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples
More information! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large
Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!
More informationChapter 20: Parallel Databases
Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!
More informationChapter 20: Parallel Databases. Introduction
Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!
More informationCSCI Trees. Mark Redekopp David Kempe
CSCI 104 2-3 Trees Mark Redekopp David Kempe Trees & Maps/Sets C++ STL "maps" and "sets" use binary search trees internally to store their keys (and values) that can grow or contract as needed This allows
More information4.8 Summary. Practice Exercises
Practice Exercises 191 structures of the parent process. A new task is also created when the clone() system call is made. However, rather than copying all data structures, the new task points to the data
More informationQuery Optimization. Chapter 13
Query Optimization Chapter 13 What we want to cover today Overview of query optimization Generating equivalent expressions Cost estimation 432/832 2 Chapter 13 Query Optimization OVERVIEW 432/832 3 Query
More informationIntroduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe
Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms
More informationChapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin
More informationDatabases IIB: DBMS-Implementation Exercise Sheet 13
Prof. Dr. Stefan Brass January 27, 2017 Institut für Informatik MLU Halle-Wittenberg Databases IIB: DBMS-Implementation Exercise Sheet 13 As requested by the students, the repetition questions a) will
More informationLecture 6: Divide-and-Conquer
Lecture 6: Divide-and-Conquer COSC242: Algorithms and Data Structures Brendan McCane Department of Computer Science, University of Otago Types of Algorithms In COSC242, we will be looking at 3 general
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationChapter 4: Multithreaded Programming
Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading
More informationCOEN244: Class & function templates
COEN244: Class & function templates Aishy Amer Electrical & Computer Engineering Templates Function Templates Class Templates Outline Templates and inheritance Introduction to C++ Standard Template Library
More informationCS 240A: Shared Memory & Multicore Programming with Cilk++
CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some
More informationChapter 18: Parallel Databases
Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery
More informationChapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction
Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of
More informationCopyright 2016 Ramez Elmasri and Shamkant B. Navathe
Chapter 12 Outline Overview of Object Database Concepts Object-Relational Features Object Database Extensions to SQL ODMG Object Model and the Object Definition Language ODL Object Database Conceptual
More informationAn Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware
An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical
More informationShared Memory Programming. Parallel Programming Overview
Shared Memory Programming Arvind Krishnamurthy Fall 2004 Parallel Programming Overview Basic parallel programming problems: 1. Creating parallelism & managing parallelism Scheduling to guarantee parallelism
More informationDatabase System Concepts
Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth
More informationThe Cut and Thrust of CUDA
The Cut and Thrust of CUDA Luke Hodkinson Center for Astrophysics and Supercomputing Swinburne University of Technology Melbourne, Hawthorn 32000, Australia May 16, 2013 Luke Hodkinson The Cut and Thrust
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationC++ Smart Pointers. CSE 333 Autumn 2018
C++ Smart Pointers CSE 333 Autumn 2018 Instructor: Hal Perkins Teaching Assistants: Tarkan Al-Kazily Renshu Gu Trais McGaha Harshita Neti Thai Pham Forrest Timour Soumya Vasisht Yifan Xu Administriia New
More informationConcepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009
Concepts in Multicore Programming The Multicore- Software Challenge MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 2009 Charles E. Leiserson 1 Cilk, Cilk++, and Cilkscreen, are trademarks of
More informationIBM. Systems management Logical partitions. System i. Version 6 Release 1
IBM System i Systems management Logical partitions Version 6 Release 1 IBM System i Systems management Logical partitions Version 6 Release 1 Note Before using this information and the product it supports,
More informationParallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville
Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information
More informationGLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)
DE: A Scalable Framework for Efficient Analytics Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) Big Data Analytics Big Data Storage is cheap ($100 for 1TB disk) Everything
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution
CSE 544 Principles of Database Management Systems Magdalena Balazinska Fall 2007 Lecture 7 - Query execution References Generalized Search Trees for Database Systems. J. M. Hellerstein, J. F. Naughton
More informationEvaluation of Relational Operations: Other Techniques
Evaluation of Relational Operations: Other Techniques [R&G] Chapter 14, Part B CS4320 1 Using an Index for Selections Cost depends on #qualifying tuples, and clustering. Cost of finding qualifying data
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationWhat happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques
376a. Database Design Dept. of Computer Science Vassar College http://www.cs.vassar.edu/~cs376 Class 16 Query optimization What happens Database is given a query Query is scanned - scanner creates a list
More informationTeam SimpleMind. Ismail Oukid (TU Dresden), Ingo Müller (KIT), Iraklis Psaroudakis (EPFL) ACM SIGMOD 2015 Programming SIGMOD 2015 (June 2)
Team SimpleMind Ismail Oukid (TU Dresden), Ingo Müller (KIT), Iraklis Psaroudakis (EPFL) ACM SIGMOD 2015 Programming Contest @ SIGMOD 2015 (June 2) Public Agenda Programming Contest Overview Transaction
More informationChapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.
Chapter 18 Strategies for Query Processing We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS. 1 1. Translating SQL Queries into Relational Algebra and Other Operators - SQL is
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationEfficient Work Stealing for Fine-Grained Parallelism
Efficient Work Stealing for Fine-Grained Parallelism Karl-Filip Faxén Swedish Institute of Computer Science November 26, 2009 Task parallel fib in Wool TASK 1( int, fib, int, n ) { if( n
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More information! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 13: Query Processing Basic Steps in Query Processing
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationQuery Processing & Optimization
Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction
More informationINFOGR Computer Graphics. J. Bikker - April-July Lecture 11: Acceleration. Welcome!
INFOGR Computer Graphics J. Bikker - April-July 2015 - Lecture 11: Acceleration Welcome! Today s Agenda: High-speed Ray Tracing Acceleration Structures The Bounding Volume Hierarchy BVH Construction BVH
More informationRobinhood: Towards Efficient Work-stealing in Virtualized Environments
1 : Towards Efficient Work-stealing in Virtualized Enironments Yaqiong Peng, Song Wu, Member, IEEE, Hai Jin, Senior Member, IEEE Abstract Work-stealing, as a common user-leel task scheduler for managing
More informationSFDV3006 Concurrent Programming
SFDV3006 Concurrent Programming Lecture 6 Concurrent Architecture Concurrent Architectures Software architectures identify software components and their interaction Architectures are process structures
More informationPARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH
PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH 1 INTRODUCTION In centralized database: Data is located in one place (one server) All DBMS functionalities are done by that server
More information