Large-Scale GPU programming

Similar documents
L22: SC Report, Map Reduce

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Map Reduce Group Meeting

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

The MapReduce Abstraction

Big Data Management and NoSQL Databases

Parallel Programming Concepts

MapReduce: A Programming Model for Large-Scale Distributed Computation

Parallel Computing: MapReduce Jin, Hai

The MapReduce Framework

Introduction to Data Management CSE 344

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

MapReduce: Simplified Data Processing on Large Clusters

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce, Hadoop and Spark. Bompotas Agorakis

CS 61C: Great Ideas in Computer Architecture. MapReduce

Introduction to Data Management CSE 344

Concurrency for data-intensive applications

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Distributed Computations MapReduce. adapted from Jeff Dean s slides

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

MI-PDB, MIE-PDB: Advanced Database Systems

Chapter 5. The MapReduce Programming Model and Implementation

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Lecture 11 Hadoop & Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CS427 Multicore Architecture and Parallel Computing

CS MapReduce. Vitaly Shmatikov

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Advanced Data Management Technologies

1. Introduction to MapReduce

Distributed computing: index building and use

Part A: MapReduce. Introduction Model Implementation issues

MapReduce-style data processing

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Introduction to Map Reduce

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

MapReduce and Friends

Map Reduce. Yerevan.

Introduction to MapReduce (cont.)

The Future of High Performance Computing

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Database Systems CSE 414

Introduction to MapReduce

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to Data Management CSE 344

Batch Processing Basic architecture

Introduction to MapReduce

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Advanced Database Technologies NoSQL: Not only SQL

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Introduction to MapReduce

CS 345A Data Mining. MapReduce

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters

Clustering Lecture 8: MapReduce

Distributed computing: index building and use

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce and Hadoop

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Improving the MapReduce Big Data Processing Framework

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Parallel and Distributed Computing

Introduction to MapReduce

Hadoop/MapReduce Computing Paradigm

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Databases 2 (VU) ( / )

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Map Reduce & Hadoop Recommended Text:

HADOOP FRAMEWORK FOR BIG DATA

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

MapReduce. Cloud Computing COMP / ECPE 293A

An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

MapReduce for Data Intensive Scientific Analyses

Cloud Computing CS

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Challenges for Data Driven Systems

Programming Systems for Big Data

DATABASE DESIGN II - 1DL400

MapReduce Simplified Data Processing on Large Clusters

Transcription:

Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University of Pennsylvania tim@kaldewey.com

Agenda Scale-up vs. scale-out GPU & scale-out (MPI) Introduction to MapReduce MapReduce & GPU Building a GPU MapReduce framework (UC Davis) Hadoop Pipes (Tokyo Institute of Tech.) Hadoop streams (UIUC) 2

Scale-up vs. Scale-out 3

Scale-up vs. Scale-out 4

How to leverage GPUs in scale-out architectures On foot Manage your own network connectivity, communication, data placement & movement, load balancing, fault tolerance, MPI (Message passing interface) Focus on communication (RDMA support), still have to manage data handling, load balancing, fault tolerance,... 5

How to leverage GPUs in scale-out architectures On foot Manage your own network connectivity, communication, data placement & movement, load balancing, fault tolerance, MPI (Message passing interface) Focus on communication (RDMA support), still have to manage data handling, load balancing, fault tolerance,... GPUDirect (nvidia & Mellanox) 6

How to leverage GPUs in scale-out architectures On foot Manage your own network connectivity, communication, data placement & movement, load balancing, fault tolerance, MPI (Message passing interface) Focus on communication (RDMA support), still have to manage data handling, load balancing, fault tolerance,... GPUDirect (nvidia & Mellanox) Infiniband card and GPU share DMA/RDMA region 7

How to leverage GPUs in scale-out architectures On foot Manage your own network connectivity, communication, data placement & movement, load balancing, fault tolerance, MPI (Message passing interface) Focus on communication (RDMA support), still have to manage data handling, load balancing, fault tolerance,... GPUDirect (nvidia & Mellanox) Iinfiniband card and GPU share DMA/RDMA region v2 uses UVA to make remote memory access transparent 8

Agenda Scale-up vs. scale-out GPU & scale-out (MPI) Introduction to MapReduce MapReduce & GPU Building a GPU MapReduce framework (UC Davis) Hadoop Pipes (Tokyo Institute of Tech.) Hadoop streams (UIUC) 9

MapReduce Programming Model for Large-Volume Data Processing Specialized for frequent use case: aggregation queries Map every input object to set of key/value pairs Reduce (aggregate) all mapped values for same key into one result for that key 10

MapReduce Programming Model for Large-Volume Data Processing Specialized for frequent use case: aggregation queries Map every input object to set of key/value pairs Reduce (aggregate) all mapped values for same key into one result for that key Use this structure as explicit API for cluster computing Submit jobs as (input data, map function, reduce function) Implementation details for given architecture can be hidden No need to manually implement parallelization, scheduling, fault tolerance,... Dynamic resource allocation by MR framework without code changes 11

Example Word Frequency map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 12

More Examples Grep Input: (many) text files Map: Emit line if matches Reduce: Concatenate intermediate results URL Access Frequency Input: Web server logs Map: <URL,1> Reduce: Add values for URL 13

MapReduce Environment Commodity PC Machines Commodity CPU, Network, Storage Distributed File System (Replicated) Map/Reduce is framework (library) First described by Google Dean, Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004. Implemented in C++ 14

Google Implementation 15

Fault Tolerance Master Data: For each task State: Complete / Idle / In-progress Assigned worker (for non-idle tasks) For each map task size and location of intermediate results for each reduce job 16

Fault Tolerance Master Data: For each task State: Complete / Idle / In-progress Assigned worker (for non-idle tasks) For each map task size and location of intermediate results for each reduce job Periodic Ping to Worker by Master Reply includes status information on pending tasks No reply: Reschedule all its incomplete tasks Also reschedule completed map tasks of that worker, results are inaccessible Completed reduce tasks available in distributed file system 17

Fault Tolerance cont. Slow machines can delay whole job faulty hard disks, faulty configuration answer to pings, but do not complete Near end of job process in-progress tasks redundantly remaining in-progress jobs likely to be stragglers redundant execution will overtake them if inherently slow, original execution finishes first typical results: 2x faster overall execution time 18

Fault Tolerance cont. Slow machines can delay whole job faulty hard disks, faulty configuration answer to pings, but do not complete Near end of job process in-progress tasks redundantly remaining in-progress jobs likely to be stragglers redundant execution will overtake them if inherently slow, original execution finishes first typical results: 2x faster overall execution time What happens if the master crashes Re-execute entire job Unlikely as there is only 1 master per job Checkpointing possible to reclaim partial results 19

MapReduce Scheduling Locality Input Locality Try to execute map jobs where data already is in DFS Output Locality User may specify reduce key-partitioning function Co-locates related output in same reducers 20

MapReduce Scheduling Locality Input Locality Try to execute map jobs where data already is in DFS Output Locality User may specify reduce key-partitioning function Co-locates related output in same reducers Combiner For commutative and associative reducer functions Pre-aggregation : Aggregate in several stages Reduces network traffic Combiner tasks code typically same as reduce output is fed to other reduce/combine functions, not output files 21

MapReduce Variants Related Products Hadoop (Yahoo, Apache) Open-source Java implementation of Map/Reduce Sawzall (Google) Domain-specific programming language (DSL) for Map stage Pig Latin (Yahoo) DSL for large-volume processing jobs can be compiled into Map/Reduce jobs Hive (Facebook) Data warehouse SQL-like interface Amazon EC2 (Elastic Cloud Computing) Rent machines by the hour, can be preloaded with Hadoop server images Nodes equipped with 2 Tesla M2050 available for $2.60/hr... 22

Agenda Scale-up vs. scale-out GPU & scale-out (MPI) Introduction to MapReduce MapReduce & GPU Building a GPU MapReduce framework (UC Davis) Hadoop Pipes (Tokyo Institute of Tech.) Hadoop streams (UIUC) 23

MapReduce & GPU MapReduce frameworks promise to sweeten the bitter taste of distributed programming by handling system tasks: communication, data placement & movement, load balancing, fault tolerance, The de-facto standard Implemented in Java, but Can handle binaries 24

MapReduce & GPU Mars frameworks for single CPU/GPU GPMR C++ MR implementation specifically for GPU clusters Extending Hadoop MR to GPUs Tokyo Institute of Technology MITHRA 25

GPMR A standalone MR library implemented in C/C++ Allows native use of CUDA Starting with a naïve implementation: Copy input data to GPU Map, sort, reduce kernel calls Copy data back What happens if the the amount of data > GPU memory? How to scale across nodes? Partitioning is required Optimizations: Relax 1 thread to 1 work-item constraint Use combiners for all map jobs of a node 26

GPMR Phases Scheduler = Master GPU map = (CPU map) - sort GPU Partial Reduce = Combiner Partition defaults to RR Bin = Data tx Sort potentially executed on different node GPU Reduce = CPU reduce 27

MR vs. GPMR Confused? Many new stages All stages open to the programmer... Wasn't MapReduce designed to be simple? 28

GPMR - Evaluation 5 common benchmarks for MR clusters Matrix multiply Sparse integer occurrence Word occurrence K-Means clustering Linear Regression Demonstrate scalability Comparison with existing MR libraries Mars, single GPU Phoenix, multi-core CPU 29

GPMR Results C=A+B, use blocking to partition matrix C_block(1,1)=A_block(1,1)*B_block(1,1) + A_block(1,2)*B_block(2,1) +... Each mapper processes a matrix multiplication for individual block(s) Reducer sums all sub-matrices. 30

GPMR Results String processing inefficient and mismatch for GPU Dictionary relatively small, 43k words Use integer hashing Use combiners to accumulate results Implemented using atomics 31

GPMR Results Comparison with MARS running on a single GPU Comparison with Phoenix running on 4 CPU cores Missing comparison with a real MR framework 32

Extending Hadoop MR Hadoop is implemented in Java Options to interface with CUDA Hadoop Streaming Unix streams with line-oriented or binary interface Works for any binary, but requires input parsing Hadoop Pipes Unix sockets C++ Hadoop library provides k-v abstractions JNI Allows to invoke external apps and libraries in C, asm,... Require platform specific libaries Access to Java data structures through JNI slow 33

Tokyo Institure of Technology Hadoop Pipes Hybrid approach CPU and GPU mappers Scheduling not affected Input/output formats unchanged Transparent on which hardware Job runs 34

K-Means 1) k initial "means" (k=3) randomly selected 2) k clusters by finding nearest mean 3) centroid of each clusters = new mean 4) Steps 2 and 3 are repeated until convergence, i.e. no more changes in assigments 35

K-Means on MapReduce Map Phase Load the cluster centers from a file Iterate over each cluster center for each input key/value pair Measure the distances and save the nearest center which has the lowest distance to the vector Write the cluster center with its vector to the filesystem 36

K-Means on MapReduce Map Phase Load the cluster centers from a file Iterate over each cluster center for each input key/value pair Measure the distances and save the nearest center which has the lowest distance to the vector Write the cluster center with its vector to the filesystem Reduce Phase (we get associated vectors for each center) Iterate over each value vector and calculate the average vector This is the new center, save it to a file Check the convergence between the cluster center that is stored in the key object and the new center Run this until nothing was updated anymore 37

Tokyo Institure of Technology K-Means Results 20GB input data (2-dimensional points), K=128, each node comprises 16 CPU cores, 32GB RAM, 2 GPUs 38

MITHRA Options pricing Multiple Data Independeed Tasks on Heterogenous Resource Architecture Uses CUDA SDK version of Black Scholes option pricing Hadoop streams to distribute and execute the CUDA binary CPUs idle Combiners to aggregate local results 39

Existing Approaches - issues GPMR Stand-alone library No Fault tolerance No DFS No communication handling (mapper reducer) Not a complete Framework Tokyo Institute of Technology Single application K-means Small problem set (20GB) does not scale beyond 16 nodes Cost of Hadoop streams MITHRA Single application Options Pricing Scalability evaluated on 2 nodes 40

References J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04 J. Stuart and J. Owens. Multi-GPU MapReduce on GPU Clusters. IPDPS'11 R. Favriar, A. Verma, E.Chan, R. Campbell. MITHRA: Multiple data independent Tasks on a Heterogeneous Resource Architecture. CLUSTER'09 K. Shirahata, H. Sato, S. Matsuoka. Hybrid Map Task Scheduling for GPU-Based Heterogeneous Clusters. CloudCom'10 41