Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Similar documents
Meet the Walkers! Accelerating Index Traversals for In-Memory Databases"

Hybrid Memory Cube (HMC)

Practical Near-Data Processing for In-Memory Analytics Frameworks

Addressing the Memory Wall

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Energy-efficient acceleration of task dependency trees on CPU-GPU hybrids

BREAKING THE MEMORY WALL

Five Emerging DRAM Interfaces You Should Know for Your Next Design

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Dark Silicon Accelerators for Database Indexing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Processing in Memory. Advanced Seminar Computer Engineering. ZITI CAG University of Heidelberg. Felix Kaiser

Performance in the Multicore Era

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Architecture-Conscious Database Systems

Emerging IC Packaging Platforms for ICT Systems - MEPTEC, IMAPS and SEMI Bay Area Luncheon Presentation

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

Hardware Evolution in Data Centers

Near Memory Computing Spectral and Sparse Accelerators

Near-Data Processing for Differentiable Machine Learning Models

HAMLeT: Hardware Accelerated Memory Layout Transform within 3D-stacked DRAM

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Vector Engine Processor of SX-Aurora TSUBASA

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Physical Design Implementation for 3D IC Methodology and Tools. Dave Noice Vassilios Gerousis

Fundamentals of Database Systems

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

Hardware Acceleration of Database Operations

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Communication for PIM-based Graph Processing with Efficient Data Partition. Mingxing Zhang, Youwei Zhuo (equal contribution),

Near Memory Key/Value Lookup Acceleration MemSys 2017

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Hash Joins for Multi-core CPUs. Benjamin Wagner

Computer Architecture Spring 2016

Part IV: 3D WiNoC Architectures

Architecture and Implementation of Database Systems (Winter 2014/15)

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

A Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10. Onur Kahraman

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

BuMP: Bulk Memory Access Prediction and Streaming

L évolution des architectures et des technologies d intégration des circuits intégrés dans les Data centers

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

Lecture 7: Parallel Processing

THREAD LEVEL PARALLELISM

Introduction to I/O Efficient Algorithms (External Memory Model)

Advanced Databases: Parallel Databases A.Poulovassilis

Microprocessor Trends and Implications for the Future

Jignesh M. Patel. Blog:

Accelerating Analytical Workloads

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Column Stores vs. Row Stores How Different Are They Really?

Lecture 2: Topology - I

15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

Memory Footprint Matters: Efficient Equi-Join Algorithms for Main Memory Data Processing

Multi-core Computing Lecture 2

DATABASE CRACKING: Fancy Scan, not Poor Man s Sort! Don. Holger Pirk Eleni Petraki Strato Idreos

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

GRVI Phalanx. A Massively Parallel RISC-V FPGA Accelerator Accelerator. Jan Gray

CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube

3D WiNoC Architectures

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Chapter 0 Introduction

Six-Core AMD Opteron Processor

Optimizing Apache Spark with Memory1. July Page 1 of 14

Concurrency for data-intensive applications

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Technology Trends Presentation For Power Symposium

Computer Architecture

Research Faculty Summit Systems Fueling future disruptions

High Performance Memory Opportunities in 2.5D Network Flow Processors

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

Transcription:

Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1

Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack of DRAM dies Makes near-memory processing practical DRAM DRAM DRAM DRAM Logic Why NMP? Less data movement à Less energy consumption Leverage DRAM's massive internal BW & parallelism à High performance Exploit NMP to accelerate key algorithms 2

Join Operation A fundamental operation in database systems Main contributor to execution time in analytic DBMSs Find the matching keys in two tables Ongoing debate over two main algorithms: Hash-based: Current best for CPU execution Cache-optimized Sort-based Higher computational complexity But more regular memory access patterns Join Revisit sort vs. hash for near-memory execution 3

Near-Memory Join Memory access patterns: Key for maximizing NMP efficiency Sequential access patterns best exploit DRAM characteristics Number of accesses is only part of the story More sequential accesses better than fewer random accesses Sort join trumps hash join Sequential access pattern + Wide NMP sort logic à High efficiency Sort ~2x better than hash in perf & energy-efficiency 4

Outline Overview Near-memory processing (NMP) Join operator Evaluation Conclusion 5

Data Movement and Energy 64-bit DP 20 pj 256-bit buses 26 pj 20 mm 256 pj 500 pj Efficient off-chip link [Dally, SC 14 Panel] Avoid unnecessary data movement through NMP 6

Near-Memory Processing (NMP) Emerging technology: 3D-stacked memory Logic die in a stack of DRAM dies Through-Silicon Via (TSV) Low energy consumption Separated vertical partitions vaults Provide a high level of parallelism E.g., Micron HMC, AMD HBM DRAM 3D stacked memory Logic Vault Computation performed next to memory 7

NMP Realities Stacked memory: Limited capacity per chip ( 8GB) High capacity requires multiple chips Large datasets require chip-to-chip communication Link Link Link Link NMP does not eliminate all data movement! 8

NMP: Key Aspects Chip-to-chip accesses consume more energy per bit At least ~2x more than intra-vault accesses Must minimize chip-to-chip accesses for efficiency DRAM implies wide interface and destructive accesses Costly random accesses DRAM Link Link Link DRAM row: 256B Link 8B NMP algorithms must consider access pattern & locality 9

Outline Overview Near-memory processing (NMP) Join operator Evaluation Conclusion 10

What is a Join? Iterates over a pair of tables Finds the matching keys in two tables Q: SELECT... FROM R, S WHERE R.Key = S.Key R Join S Result 11

Hash vs. Sort Join Hash-based algorithms: build and probe a hash table Algorithm: Radix-Hash Join ü Lower computational complexity: O(n) Random memory accesses [Manegold et al., 2002] Sort-based algorithms: sort and merge the two tables Algorithm: Partitioned Massively Parallel Sort-Merge (P-MPSM) Higher computational complexity: O(nlogn) ü Sequential memory access [Albutiu et al., 2012] Computational complexity vs. memory access patterns 12

NMP: Data Distribution Data randomly distributed across memory chips Data cannot fit in one chip In each chip: data randomly distributed across vaults S R R S Link R S 13

Hash Join 1. Partitioning phase: Partition two tables based on the keys CPU-centric: exploit locality in caches NMP: high locality in a vault # of partitions = # of vaults S partitions S R Random R partitions Random Random Partitioning Random Partitioning 14

Near Memory Hash Join 1. Partitioning phase Random access patterns: both R and S Low access locality: both R and S Random Random Link Costly random access patterns and low access locality 15

Hash Join 2. Build phase: Build a hash table on R Nearly constant look-up in the next phase S partitions S R R partitions R hash tables Building hash table 16

Near Memory Hash Join 2. Build phase: High locality Random access pattern: building R s hash tables Link Costly random access pattern 17

Hash Join 3. Probe phase: Probe the hash table with the other column Scan S and look up the keys in R s hash table S partitions S R R partitions R hash tables Probing the hash tables 18

Near Memory Hash Join 3. Probe phase: High locality Random access pattern: R s hash table Link Costly random access pattern 19

Hash Join: Summary Phases Hash Sort: O(nlogn) 1. Partitioning L L 2. Build / Sort L J 3. Probe / Merge L K L : Random access pattern (local or remote) K : Sequential accesses (remote) J : Sequential accesses (local) 20

Sort Join 1. Partitioning phase: Partition R table based on the keys Helps reducing merge-join time S R R partitions Random Random Partitioning 21

Near Memory Sort Join 1. Partitioning phase Random access pattern: only R Low locality: only R Random Random Link Costly random access pattern and low locality 22

Sort Join 2. Sort phase: Sort both tables Allows linear-time and sequential merge-join Sorted S S R R partitions Sorted R Sequential Sequential Sequential Sorting Sequential Sorting 23

Near Memory Sort Join 2. Sort phase: High locality Sequential access pattern Link Sequential access pattern and high locality 24

Sort Join 3. Merge phase: Merge-join R wtih S Access data sequentially Sorted S S R R partitions Sorted R Merging 25

Near Memory Sort Join 3. Merge phase: sequential access pattern Stream each chunk of R and S Low locality: only R Link Low locality: one table 26

Hash vs. Sort Join: Summary Phases Hash Sort 1. Partitioning L L 2. Build / Sort L J 3. Probe / Merge L K L : Random accesses (local or remote) K : Sequential accesses (remote) J : Sequential accesses (local) 27

Outline Overview Near-memory processing (NMP) Join operator Evaluation Conclusion 28

Methodology First order performance and energy model: CMP Feature 22nm, 16 cores Core OoO, 3-wide, 2.5 GHz 512-bit SIMD 64KB L1-I/D, 64B block LLC 4MB, 16-way HMC 4 cubes, ring topology 8GB per cube 32 vaults per cube 4 links per cube Join Logic 22nm logic die 256B SIMD 2D Mesh NoC 29

Performance and Energy Normalized Runtime 1 0.5 0 Performance Energy Sort Hash Sort Hash 1 0.5 0 Normalized Energy Consumption CPU S =16 R NMP Energy-efficiency: 5.9-10.1x, performance:1.9-5.1x 30

NMP: Hash vs. Sort Normalized Runtime 9 8 7 6 5 4 3 2 1 0 Performance Sort Hash Sort Hash Normalized Energy Consumption S = R S = R Sort-Join S =16 R is more efficient when S > R 7 6 5 4 3 2 1 0 Data movement energy DRAM energy Computation energy Sort Hash Sort Hash S =16 R 31

Conclusion NMP improves both performance & energy efficiency Exploits internal DRAM bandwidth and parallelism Reduces data movement NMP algorithms must consider memory access patterns Sequential accesses best leverage DRAM characteristics Intra-chip accesses minimize data movement Locality + Sequential access patterns à Sort join more efficient for NMP Hash join still best for CPU 32

Thanks! Question? 33