DATABASE CRACKING: Fancy Scan, not Poor Man s Sort! Don. Holger Pirk Eleni Petraki Strato Idreos

Size: px
Start display at page:

Download "DATABASE CRACKING: Fancy Scan, not Poor Man s Sort! Don. Holger Pirk Eleni Petraki Strato Idreos"

Transcription

1 DATABASE CRACKING: Fancy Scan, not Poor Man s Sort! Hardware Folks Cracking Folks Don Holger Pirk Eleni Petraki Strato Idreos Stefan Manegold Martin Kersten

2 EVALUATING RANGE PREDICATES

3 COMPLEXITY ON PAPER Scanning: O(n) Sorting: O(n log(n)) Cracking: O(n) Essentially a single Quicksort-Step

4 COSTS IN REALITY Implement microbenchmarks 1 Billion uniform random integer values Pivot in the middle of the range Workstation machine (16 GB RAM, 4 Sandy Bridge Cores)

5 COSTS IN REALITY Wallclock time in s Parallel Scanning Cracking Parallel Sorting

6 SO: WHAT S GOING ON?

7 CACHE MISSES? 1.5B 1.4B 1.2B 1.0B 800M 600M 400M 200M L1I Misses L1D Misses L2 Misses L3 Misses NOPE! 0.0 Scanning Cracking Sorting

8 CPU COSTS Micro-ops Issued? No Yes Allocation Stall? Micro-op Ever Retire? No Yes No Yes Frontend Bound Backend Bound Bad Speculation " " # Retiring! Cache Miss Stalls Other Stalls

9 CPU COSTS Data Stalls Retiring 1.0 Bad Speculation Pipeline Frontend Pipeline Backend Scanning Cracking Sorting

10 CPU COSTS Data Stalls Retiring Bad Speculation Pipeline Frontend Pipeline Backend 14 %!!! Scanning Cracking Sorting

11 CPU COSTS Data Stalls Retiring 1.0 Bad Speculation Pipeline Frontend Pipeline Backend Lots of Potential Scanning Cracking Sorting

12 WHAT CAN WE DO ABOUT IT?

13 INCREASING CPU EFFICIENCY

14 PREDICATION for(i=0; i<size; i++)! if(input[i] < pivot) {! output[outi] = input[i];! outi++! } for(i=0; i<size; i++)! {! output[outi] = input[i];! outi += (input[i] < pivot);! }

15 PREDICATION Turns control dependencies into data dependencies Eliminates Branch Mispredictions Causes unconditional (potentially unnecessary) I/O (limited to caches) Works only for out-of-place algorithms

16 PREDICATED CRACKING

17 PREDICATED CRACKING pivot 5 active backup

18 PREDICATED CRACKING pivot active backup

19 PREDICATED CRACKING pivot cmp active backup 5? State Before Iteration

20 PREDICATED CRACKING pivot cmp active backup > Evaluate Predicat & Write

21 PREDICATED CRACKING pivot cmp active backup = 1- = Advance Cursor

22 PREDICATED CRACKING pivot cmp active backup * + * Read Next Element

23 PREDICATED CRACKING pivot cmp backup active

24 PREDICATED CRACKING Predication for in-place algorithms No branching No branch mispredictions Somewhat intricate Lots of copying stuff around (integer granularity inefficient) Bulk-copying would be more efficient

25 VECTORIZED CRACKING

26 VECTORIZED CRACKING Turns in-place cracking into out-of-place cracking Copies Vector-sized chunks and cracks them into the array Makes vanilla-predication possible Uses SIMD-copying for vector copying Challenge: ensure that values aren't accidentally" overwritten

27 VECTORIZED CRACKING copy partition copy partition

28 RESULTS

29 RESULTS Data Stalls Retiring 1.0 Bad Speculation Pipeline Frontend Pipeline Backend Vectorized Predicated Original

30 RESULTS: WORKSTATION Wallclock time in s Scan Vectorized Predicated (Register) Predicated (Cache) Original

31 RESULTS: SERVER Wallclock time in s Not there yet! 0.0 Scan Vectorized Predicated (Register) Predicated (Cache) Original

32 PARALLELIZATION

33 PARALLELIZATION Obvious Solution: Partitioning

34 CRACK & MERGE x1 y1x2 y2x3 y3x4 y4 Partition

35 CRACK & MERGE x1 y1x2 y2x3 y3x4 y4 Merge

36 REFINED CRACK & MERGE x1 x2 x3 x4 y4 y3 y2 y1 Partition

37 REFINED CRACK & MERGE x1 x2 x3 x4 y4 y3 y2 y1 Smaller Merge

38 RESULTS: WORKSTATION 1,6 1,2 Seconds 0,8 0,4 0 Scan RVPCrack RPCrack PVCrack PCrack Vectorized

39 RESULTS: SERVER 3,00 2,25 Seconds 1,50 0,75 0,00 Scan RVPCrack RPCrack PVCrack PCrack Vectorized

40 IMPACT OF SELECTIVITY: WORKSTATION Wallclock time in s Vectorized Partition & Merge Vectorized Partition & Merge Refined Partition & Merge Vectorized Refined Partition & Merge Scanning Qualifying Tuples/Pivot

41 IMPACT OF SELECTIVITY: SERVER 2.6 Wallclock time in s Vectorized Partition & Merge Vectorized Partition & Merge Refined Partition & Merge Vectorized Refined Partition & Merge Scanning Qualifying Tuples/Pivot

42 CONCLUSIONS

43

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Accelerating Foreign-Key Joins using Asymmetric Memory Channels Accelerating Foreign-Key Joins using Asymmetric Memory Channels Holger Pirk Stefan Manegold Martin Kersten holger@cwi.nl manegold@cwi.nl mk@cwi.nl Why? Trivia: Joins are important But: Many Joins are (Indexed)

More information

class 10 fast scans 2.0 prof. Stratos Idreos

class 10 fast scans 2.0 prof. Stratos Idreos class 10 fast scans 2.0 prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ always want to minimize data movement - computation & utilize all resources! registers on chip cache on board

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query

More information

Register Allocation. Stanford University CS243 Winter 2006 Wei Li 1

Register Allocation. Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation Wei Li 1 Register Allocation Introduction Problem Formulation Algorithm 2 Register Allocation Goal Allocation of variables (pseudo-registers) in a procedure to hardware registers Directly

More information

IS 709/809: Computational Methods in IS Research. Algorithm Analysis (Sorting)

IS 709/809: Computational Methods in IS Research. Algorithm Analysis (Sorting) IS 709/809: Computational Methods in IS Research Algorithm Analysis (Sorting) Nirmalya Roy Department of Information Systems University of Maryland Baltimore County www.umbc.edu Sorting Problem Given an

More information

Holistic Indexing in Main-memory Column-stores

Holistic Indexing in Main-memory Column-stores Holistic Indexing in Main-memory Column-stores Eleni Petraki CWI Amsterdam petraki@cwi.nl Stratos Idreos Harvard University stratos@seas.harvard.edu Stefan Manegold CWI Amsterdam manegold@cwi.nl ABSTRACT

More information

Sorting Algorithms. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Sorting Algorithms. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University Sorting Algorithms CptS 223 Advanced Data Structures Larry Holder School of Electrical Engineering and Computer Science Washington State University 1 QuickSort Divide-and-conquer approach to sorting Like

More information

Query Processing Models

Query Processing Models Query Processing Models Holger Pirk Holger Pirk Query Processing Models 1 / 43 Purpose of this lecture By the end, you should Understand the principles of the different Query Processing Models Be able

More information

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack

More information

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops: Relational Query Optimization R & G Chapter 13 Review Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops: simple, exploits extra memory

More information

Information Coding / Computer Graphics, ISY, LiTH

Information Coding / Computer Graphics, ISY, LiTH Sorting on GPUs Revisiting some algorithms from lecture 6: Some not-so-good sorting approaches Bitonic sort QuickSort Concurrent kernels and recursion Adapt to parallel algorithms Many sorting algorithms

More information

Memory Management. Goals of Memory Management. Mechanism. Policies

Memory Management. Goals of Memory Management. Mechanism. Policies Memory Management Design, Spring 2011 Department of Computer Science Rutgers Sakai: 01:198:416 Sp11 (https://sakai.rutgers.edu) Memory Management Goals of Memory Management Convenient abstraction for programming

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

Lecture 6: Static ILP

Lecture 6: Static ILP Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week 1 Loop Dependences If a loop only has dependences within

More information

Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. Memory Ordering Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2018 Handling Memory Operations Review pipeline for out of order, superscalar processors

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 9

ECE 571 Advanced Microprocessor-Based Design Lecture 9 ECE 571 Advanced Microprocessor-Based Design Lecture 9 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 30 September 2014 Announcements Next homework coming soon 1 Bulldozer Paper

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Software and Tools for HPE s The Machine Project

Software and Tools for HPE s The Machine Project Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Data Structures and Algorithms

Data Structures and Algorithms Data Structures and Algorithms Autumn 2018-2019 Outline Sorting Algorithms (contd.) 1 Sorting Algorithms (contd.) Quicksort Outline Sorting Algorithms (contd.) 1 Sorting Algorithms (contd.) Quicksort Quicksort

More information

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering dministration CS 1/13 Introduction to Compilers and Translators ndrew Myers Cornell University P due in 1 week Optional reading: Muchnick 17 Lecture 30: Instruction scheduling 1 pril 00 1 Impact of instruction

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität petkovve@in.tum.de March 2010 Outline Motivation Periscope (PSC) Periscope performance analysis

More information

Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. Memory Ordering Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen and Haitham Akkary 2012 Handling Memory Operations Review pipeline for out of order, superscalar processors To maximize

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016 Datenbanksysteme II: Modern Hardware Stefan Sprenger November 23, 2016 Content of this Lecture Introduction to Modern Hardware CPUs, Cache Hierarchy Branch Prediction SIMD NUMA Cache-Sensitive Skip List

More information

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Anastasia Ailamaki. Performance and energy analysis using transactional workloads Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:

More information

Parallel Patterns Ezio Bartocci

Parallel Patterns Ezio Bartocci TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group Parallel Patterns Ezio Bartocci Parallel Patterns Think at a higher level than individual CUDA kernels Specify what to compute,

More information

Code Optimization & Performance. CS528 Serial Code Optimization. Great Reality There s more to performance than asymptotic complexity

Code Optimization & Performance. CS528 Serial Code Optimization. Great Reality There s more to performance than asymptotic complexity CS528 Serial Code Optimization Dept of CSE, IIT Guwahati 1 Code Optimization & Performance Machine independent opt Code motion Reduction in strength Common subexpression Elimination Tuning: Identifying

More information

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

PS2 out today. Lab 2 out today. Lab 1 due today - how was it? 6.830 Lecture 7 9/25/2017 PS2 out today. Lab 2 out today. Lab 1 due today - how was it? Project Teams Due Wednesday Those of you who don't have groups -- send us email, or hand in a sheet with just your

More information

Algorithms and Data Structures. Marcin Sydow. Introduction. QuickSort. Sorting 2. Partition. Limit. CountSort. RadixSort. Summary

Algorithms and Data Structures. Marcin Sydow. Introduction. QuickSort. Sorting 2. Partition. Limit. CountSort. RadixSort. Summary Sorting 2 Topics covered by this lecture: Stability of Sorting Quick Sort Is it possible to sort faster than with Θ(n log(n)) complexity? Countsort Stability A sorting algorithm is stable if it preserves

More information

Performance Issues and Query Optimization in Monet

Performance Issues and Query Optimization in Monet Performance Issues and Query Optimization in Monet Stefan Manegold Stefan.Manegold@cwi.nl 1 Contents Modern Computer Architecture: CPU & Memory system Consequences for DBMS - Data structures: vertical

More information

Question And Answer.

Question And Answer. Q.1 What is the number of swaps required to sort n elements using selection sort, in the worst case? A. &#920(n) B. &#920(n log n) C. &#920(n2) D. &#920(n2 log n) ANSWER : Option A &#920(n) Note that we

More information

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

Photo David Wright   STEVEN R. BAGLEY PIPELINES AND ILP Photo David Wright https://www.flickr.com/photos/dhwright/3312563248 STEVEN R. BAGLEY PIPELINES AND ILP INTRODUCTION Been considering what makes the CPU run at a particular speed Spent the last two weeks

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

External Memory Algorithms and Data Structures. Winter 2004/2005

External Memory Algorithms and Data Structures. Winter 2004/2005 External Memory Algorithms and Data Structures Winter 2004/2005 Riko Jacob Peter Widmayer Assignments: Yoshio Okamoto EMADS 04/ 05: Course Description Page 1 External Memory Algorithms and Data Structures

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Instruction Commit The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following

More information

Chapter 5. Quicksort. Copyright Oliver Serang, 2018 University of Montana Department of Computer Science

Chapter 5. Quicksort. Copyright Oliver Serang, 2018 University of Montana Department of Computer Science Chapter 5 Quicsort Copyright Oliver Serang, 08 University of Montana Department of Computer Science Quicsort is famous because of its ability to sort in-place. I.e., it directly modifies the contents of

More information

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Large-Scale Data & Systems Group Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk Large-Scale Data & Systems (LSDS)

More information

Performance Tuning the OpenEdge Database in The Modern World

Performance Tuning the OpenEdge Database in The Modern World Performance Tuning the OpenEdge Database in The Modern World Gus Björklund, Progress Mike Furgal, Bravepoint Performance tuning is not only about software configuration and turning knobs Situation: Your

More information

Analysis of parallel suffix tree construction

Analysis of parallel suffix tree construction 168 Analysis of parallel suffix tree construction Malvika Singh 1 1 (Computer Science, Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India. Email: malvikasingh2k@gmail.com)

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

Overview of Sorting Algorithms

Overview of Sorting Algorithms Unit 7 Sorting s Simple Sorting algorithms Quicksort Improving Quicksort Overview of Sorting s Given a collection of items we want to arrange them in an increasing or decreasing order. You probably have

More information

CSC 273 Data Structures

CSC 273 Data Structures CSC 273 Data Structures Lecture 6 - Faster Sorting Methods Merge Sort Divides an array into halves Sorts the two halves, Then merges them into one sorted array. The algorithm for merge sort is usually

More information

Sorting. Dr. Baldassano Yu s Elite Education

Sorting. Dr. Baldassano Yu s Elite Education Sorting Dr. Baldassano Yu s Elite Education Last week recap Algorithm: procedure for computing something Data structure: system for keeping track for information optimized for certain actions Good algorithms

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Assembly Language Programming

Assembly Language Programming Assembly Language Programming Ľudmila Jánošíková Department of Mathematical Methods and Operations Research Faculty of Management Science and Informatics University of Žilina tel.: 421 41 513 4200 Ludmila.Janosikova@fri.uniza.sk

More information

class 5 column stores 2.0 prof. Stratos Idreos

class 5 column stores 2.0 prof. Stratos Idreos class 5 column stores 2.0 prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ worth thinking about what just happened? where is my data? email, cloud, social media, can we design systems

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Key question: how do we pick a good pivot (and what makes a good pivot in the first place)?

Key question: how do we pick a good pivot (and what makes a good pivot in the first place)? More on sorting Mergesort (v2) Quicksort Mergesort in place in action 53 2 44 85 11 67 7 39 14 53 87 11 50 67 2 14 44 53 80 85 87 14 87 80 50 29 72 95 2 44 80 85 7 29 39 72 95 Boxes with same color are

More information

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1 Pipelining COMP375 Computer Architecture and dorganization Parallelism The most common method of making computers faster is to increase parallelism. There are many levels of parallelism Macro Multiple

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit

More information

CS330. Query Processing

CS330. Query Processing CS330 Query Processing 1 Overview of Query Evaluation Plan: Tree of R.A. ops, with choice of alg for each op. Each operator typically implemented using a `pull interface: when an operator is `pulled for

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques

More information

Question 7.11 Show how heapsort processes the input:

Question 7.11 Show how heapsort processes the input: Question 7.11 Show how heapsort processes the input: 142, 543, 123, 65, 453, 879, 572, 434, 111, 242, 811, 102. Solution. Step 1 Build the heap. 1.1 Place all the data into a complete binary tree in the

More information

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2) Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2) 1 Scheduled and Unrolled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6,

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

CMU Introduction to Computer Architecture, Spring Midterm Exam 2. Date: Wed., 4/17. Legibility & Name (5 Points): Problem 1 (90 Points):

CMU Introduction to Computer Architecture, Spring Midterm Exam 2. Date: Wed., 4/17. Legibility & Name (5 Points): Problem 1 (90 Points): Name: CMU 18-447 Introduction to Computer Architecture, Spring 2013 Midterm Exam 2 Instructions: Date: Wed., 4/17 Legibility & Name (5 Points): Problem 1 (90 Points): Problem 2 (35 Points): Problem 3 (35

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Instruction Commit The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following

More information

Overview of Query Evaluation. Overview of Query Evaluation

Overview of Query Evaluation. Overview of Query Evaluation Overview of Query Evaluation Chapter 12 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Overview of Query Evaluation v Plan: Tree of R.A. ops, with choice of alg for each op. Each operator

More information

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University Multithreaded Architectures and The Sort Benchmark Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University About our Sort Benchmark Based on the benchmark proposed in A measure

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy Competitors using generic parts Performance benefits to be had for custom design Original PlayStation: no vector processing or floating point support Geometry issues Photorealism at the core of design

More information

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Optimize Data Structures and Memory Access Patterns to Improve Data Locality Optimize Data Structures and Memory Access Patterns to Improve Data Locality Abstract Cache is one of the most important resources

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Reading for this lecture (Goodrich and Tamassia):

Reading for this lecture (Goodrich and Tamassia): COMP26120: Algorithms and Imperative Programming Basic sorting algorithms Ian Pratt-Hartmann Room KB2.38: email: ipratt@cs.man.ac.uk 2017 18 Reading for this lecture (Goodrich and Tamassia): Secs. 8.1,

More information

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Distributed Computing Systems Chalmers University

More information

CS240 Fall Mike Lam, Professor. Quick Sort

CS240 Fall Mike Lam, Professor. Quick Sort ??!!!!! CS240 Fall 2015 Mike Lam, Professor Quick Sort Merge Sort Merge sort Sort sublists (divide & conquer) Merge sorted sublists (combine) All the "hard work" is done after recursing Hard to do "in-place"

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

Evaluation of relational operations

Evaluation of relational operations Evaluation of relational operations Iztok Savnik, FAMNIT Slides & Textbook Textbook: Raghu Ramakrishnan, Johannes Gehrke, Database Management Systems, McGraw-Hill, 3 rd ed., 2007. Slides: From Cow Book

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

class 17 updates prof. Stratos Idreos

class 17 updates prof. Stratos Idreos class 17 updates prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ early/late tuple reconstruction, tuple-at-a-time, vectorized or bulk processing, intermediates format, pushing selects

More information

L14 Quicksort and Performance Optimization

L14 Quicksort and Performance Optimization L14 Quicksort and Performance Optimization Alice E. Fischer Fall 2018 Alice E. Fischer L4 Quicksort... 1/12 Fall 2018 1 / 12 Outline 1 The Quicksort Strategy 2 Diagrams 3 Code Alice E. Fischer L4 Quicksort...

More information

Overview of Implementing Relational Operators and Query Evaluation

Overview of Implementing Relational Operators and Query Evaluation Overview of Implementing Relational Operators and Query Evaluation Chapter 12 Motivation: Evaluating Queries The same query can be evaluated in different ways. The evaluation strategy (plan) can make orders

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS Department of Computer Science University of Babylon LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS By Faculty of Science for Women( SCIW), University of Babylon, Iraq Samaher@uobabylon.edu.iq

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

Overview of Query Evaluation. Chapter 12

Overview of Query Evaluation. Chapter 12 Overview of Query Evaluation Chapter 12 1 Outline Query Optimization Overview Algorithm for Relational Operations 2 Overview of Query Evaluation DBMS keeps descriptive data in system catalogs. SQL queries

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Midterm Exam Prof. Martin Thursday, March 15th, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting Steve Wolfman, based on work by Dan Grossman LICENSE: This file is licensed under a Creative

More information

Part XVII. Staircase Join Tree-Aware Relational (X)Query Processing. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 440

Part XVII. Staircase Join Tree-Aware Relational (X)Query Processing. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 440 Part XVII Staircase Join Tree-Aware Relational (X)Query Processing Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 440 Outline of this part 1 XPath Accelerator Tree aware relational

More information

Main Memory and the CPU Cache

Main Memory and the CPU Cache Main Memory and the CPU Cache CPU cache Unrolled linked lists B Trees Our model of main memory and the cost of CPU operations has been intentionally simplistic The major focus has been on determining

More information

Faster Sorting Methods

Faster Sorting Methods Faster Sorting Methods Chapter 9 Contents Merge Sort Merging Arrays Recursive Merge Sort The Efficiency of Merge Sort Iterative Merge Sort Merge Sort in the Java Class Library Contents Quick Sort The Efficiency

More information

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages Overview of Query Processing Query Parser Query Processor Evaluation of Relational Operations Query Rewriter Query Optimizer Query Executor Yanlei Diao UMass Amherst Lock Manager Access Methods (Buffer

More information

EXTERNAL SORTING. CS 564- Spring ACKs: Dan Suciu, Jignesh Patel, AnHai Doan

EXTERNAL SORTING. CS 564- Spring ACKs: Dan Suciu, Jignesh Patel, AnHai Doan EXTERNAL SORTING CS 564- Spring 2018 ACKs: Dan Suciu, Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? I/O aware algorithms for sorting External merge a primitive for sorting External merge-sort basic

More information

Computer Memory. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Computer Memory. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Computer Memory Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Warm Up public int sum1(int n, int m, int[][] table) { int output = 0; for (int i = 0; i < n; i++) { for (int j = 0; j

More information