PARALLEL ID3. Jeremy Dominijanni CSE633, Dr. Russ Miller

Similar documents
Parallel Merge Sort Using MPI

Parallel Scalar Multiplication of Elliptic Curve Points. CSE 633 George Gunner March 28, 2017 Professor: Dr. Russ Miller

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

PARALLELIZED IMPLEMENTATION OF LOGISTIC REGRESSION USING MPI

Algorithms and Applications

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY

Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Decision tree learning

PARALLEL IMPLEMENTATION OF DIJKSTRA'S ALGORITHM USING MPI LIBRARY ON A CLUSTER.

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

Decision Tree CE-717 : Machine Learning Sharif University of Technology

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Parallel Programming Multicore systems

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

CSC630/CSC730 Parallel & Distributed Computing

Transform & Conquer. Presorting

Parallel Algorithm Design. CS595, Fall 2010

Design of Parallel Algorithms. Course Introduction

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Performance impact of dynamic parallelism on different clustering algorithms

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

COMPUTING OVERLAPPING LINE SEGMENTS. - A parallel approach by Vinoth Selvaraju

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

Parallel Convex Hull using MPI. CSE633 Fall 2012 Alex Vertlieb

Intel Xeon Scalable Family Balanced Memory Configurations

Parallel K-means Clustering. Ajay Padoor Chandramohan Fall 2012 CSE 633

CSE Winter 2015 Quiz 2 Solutions

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Solving the 24-queens Problem using MPI on a PC Cluster

14.4 Description of Huffman Coding

Determining Line Segment Visibility with MPI

Section 5.3: Event List Management

Analytical Modeling of Parallel Programs

Biology, Physics, Mathematics, Sociology, Engineering, Computer Science, Etc

External Sorting. Chapter 13. Comp 521 Files and Databases Fall

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

CS 229 Midterm Review

CSE 100 Advanced Data Structures

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Efficient reduction over threads

Warped parallel nearest neighbor searches using kd-trees

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

Quiz for Chapter 1 Computer Abstractions and Technology

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Introduction to parallel Computing

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL

Chapter 18 - Multicore Computers

Energy Conservation In Computational Grids

Trees (Part 1, Theoretical) CSE 2320 Algorithms and Data Structures University of Texas at Arlington

Data Preprocessing. Data Preprocessing

CSE 530A. B+ Trees. Washington University Fall 2013

Data Mining Part 5. Prediction

Lecture 7. Transform-and-Conquer

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Searching, Sorting. Arizona State University 1

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

Parallel and Distributed Optimization with Gurobi Optimizer

Balanced Search Trees

Fall CSE 633 Parallel Algorithms. Cellular Automata. Nils Wisiol 11/13/12

CS 62 Midterm 2 Practice

Understanding Parallelism and the Limitations of Parallel Computing

Classification with Decision Tree Induction

CISC 235: Topic 4. Balanced Binary Search Trees

CSCI 4620/8626. Computer Graphics Clipping Algorithms (Chapter 8-5 )

Characterizing Data Mining Algorithms and Applications: Do they impact processor and system design?

COMP171. Hashing.

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Dynamic load-balancing algorithm for a decentralized gpu-cpu cluster

16/10/2008. Today s menu. PRAM Algorithms. What do you program? What do you expect from a model? Basic ideas for the machine model

Background: disk access vs. main memory access (1/2)

Lecture 27: Board Notes: Parallel Programming Examples

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Overview. Processor organizations Types of parallel machines. Real machines

ECE 122. Engineering Problem Solving Using Java

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises

Computer Science 210 Data Structures Siena College Fall Topic Notes: Binary Search Trees

Maximizing Memory Performance for ANSYS Simulations

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

Machine Learning (CSE 446): Decision Trees

Adaptively Mapping Parallelism Based on System Workload Using Machine Learning

Introduction to High-Performance Computing

Operations on Heap Tree The major operations required to be performed on a heap tree are Insertion, Deletion, and Merging.

ECE/CS 757: Homework 1

Parallelizing Partial MUS Enumeration

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

GPU-Accelerated Topology Optimization on Unstructured Meshes

Fast Edge Detection Using Structured Forests

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Delay Tolerant Networks

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

The exam is closed book, closed notes except your one-page cheat sheet.

Transcription:

PARALLEL ID3 Jeremy Dominijanni CSE633, Dr. Russ Miller 1

ID3 and the Sequential Case 2

ID3 Decision tree classifier Works on k-ary categorical data Goal of ID3 is to maximize information gain at each split Same as minimizing the difference in entropy before and after splitting on a feature Produces a tree which represents each class as a disjunction of conjunctions (of features) Can think of it as producing multiple DNF expressions, one for each class 3

Sequential Performance Analysis (Worst-Case) Given D dimensional data with each dimension taking k categories Number of nodes N = kd+1 1 k 1 Number of observations M = k D Since M and N are both of the same order, we can say performance is O NM = O N 2 4

Parallelizing ID3 5

Parallelization of ID3 Divide and conquer is the obvious approach, but it has a major flaw If the data is unbalanced, then a large amount of CPU time is wasted Solution: add a master processor and send out individual tasks to free processors whenever it is possible to do so Overhead is larger than divide and conquer with one processor forced to delegate, but tree imbalance no longer affects speedup Implemented using Intel MPI and C++ 11 6

0 0 1 2 1 2 3 4 5 6 3 4 5 6 Divide & Conquer Approach (Four Steps) Master & Workers Approach (Three Steps) 7

Master-Worker ID3 Algorithm Master: compute the topology of the perfect binary tree (maximum size scenario) for-each un-computed node, if there are no unresolved dependencies and there is a free worker, send feature restrictions to worker, else break wait for response from worker, on response, add new information to model, remove node from working list, and if it s a leaf, remove descendants from un-computed list goto for-each until completion Worker: wait for feature vector from master remove invalid observations from list of all observations compute correct ID3 decision send decision to master 8

Worst-Case Performance Analysis of Master-Worker ID3 Pre-processing work done by master O Nlog k N Time to send tasks to workers O logp O 1 Time to process data from workers O log k 2 N Reduce the set of observations O DM O N Maximum time to compute ID3 decision O DM O N Overall theoretical time complexity O Nlog k N + N2 P 9

Synthetic Benchmarks Tested on UB CCR systems with Intel Xeon E5645 processors Tested using 1, 2, 4, 8, and 11 workers plus one master 15 binary datasets with 2 2 through 2 16 observations Full datasets have each observation as its own class in order to force a perfect binary tree (worst-case scenario) Fractional datasets have the value of one feature determine the class (if 1 st feature is 0, then the class is 0, otherwise each value has a unique class) Times used exclude I/O and the initial time to transfer data to all processors, sequential overhead and all other MPI operations included 10

Benchmark Results 11

Sequential Performance Predicted performance of O N 2 Performance measured using two processors (one master and one worker) Fit curve using scipy.optimize.curve_fit 12

Parallel Performance On Different Tree Constructions Master-Worker architecture should have runtime proportional to number of nodes Performance curves show that at all measured times, runtime on the full-tree is twice that as on the half-tree On smaller test sets, effect is less pronounced due to the larger impact of sequential initialization 13

Parallel Speedup Optimal speedup would be equal to the number of worker processors O(1) overhead when sending a task to and O log k 2 N when receiving a result Inevitable queueing delays when multiple processors try to return simultaneously 14

Overhead Overtaking Even in cases where there are more tasks than processes, overhead can mean that fewer processors yield higher performance 15

Conclusions 16

Expectations Met and Unmet Sequential performance was as predicted Parallel speedup takes a large hit due to overhead, in part from having a single coordinator and frequent communication needs Improvement over simple divide and conquer in unbalanced datasets is linear to the reduction in nodes as expected 17

Future Improvements Reduce the O log k 2 N receiving overhead if at all possible Cluster individual processor tasks into multiple nodes instead of single nodes to reduce communication overhead, and reduce duplicate processing Dynamically create tree topology to reduce O Nlog k N initialization overhead in master processor 18