Data Processing on Large Clusters. By: Stephen Cardina

Similar documents
Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

For Volunteers An Elvanto Guide

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

Databases and Big Data Today. CS634 Class 22

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

IMPORTANCE OF A MINISTRY WEBSITE

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

SEO KEYWORD SELECTION

Introduction to MapReduce Algorithms and Analysis

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

WINDOWS 8.X SIG SEPTEMBER 22, 2014

Developing MapReduce Programs

Welcome to the New Era of Cloud Computing

A Review Paper on Big data & Hadoop

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

An Introduction to Big Data Formats

CMO Briefing Google+:

Parallel learning of content recommendations using map- reduce

AND BlackBerry JUL13 ISBN

SQLite vs. MongoDB for Big Data

Strong signs your website needs a professional redesign

what is cloud computing?

Lecture 11 Hadoop & Spark

Web Server Setup Guide

Project Design. Version May, Computer Science Department, Texas Christian University

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Stages of Data Processing

How to Sign Up for a Volunteer Activity

Adding content to your Blackboard 9.1 class

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

Content Curation Mistakes

Scalable Tools - Part I Introduction to Scalable Tools

THE GOOD, THE BAD AND THE UGLY. How Your Donation Process Impacts Your Workflow (and How To Fix It)

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

MapR Enterprise Hadoop

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

ANALYZING THE MILLION SONG DATASET USING MAPREDUCE

Speed Up Windows by Disabling Startup Programs

Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Organising . page 1 of 8. bbc.co.uk/webwise/accredited-courses/level-one/using- /lessons/your- s/organising-

Efficient and Scalable Friend Recommendations

How to Get a Help Desk Up and Running in a Day. May, 2011

GPS // Guide to Practice Success

Installing Ubuntu Server

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

CIS220 In Class/Lab 1: Due Sunday night at midnight. Submit all files through Canvas (25 pts)

Chee Kiam. to sieve through. and the next one. relevant. The advances in Big. (NLB) of Singapore.

How to Add or Invite Colleagues

Promo Buddy 2.0. Internet Marketing Database Software (Manual)

Microsoft Access: Let s create the tblperson. Today we are going to use advanced properties for the table fields and use a Query.

Real-time Data Engineering in the Cloud Exercise Guide

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Processing of big data with Apache Spark

Create quick link URLs for a candidate merge Turn off external ID links in candidate profiles... 4

Design Like a Pro. Boost Your Skills in HMI / SCADA Project Development. Part 2: Developing Dynamic HMI / SCADA Projects with Speed and Precision

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Webinar Series TMIP VISION

Distributed Itembased Collaborative Filtering with Apache Mahout. Sebastian Schelter twitter.com/sscdotopen. 7.

Making a PowerPoint Accessible

DIRECTV Message Board

Google Drive. Move Fully to Google Docs

Case study on PhoneGap / Apache Cordova

Windows 10 Hardware and Software

Myths about Links, Links and More Links:

. social? better than. 7 reasons why you should focus on . to GROW YOUR BUSINESS...

Why All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

To Barcode or Not To Barcode?

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Embedded Technosolutions

Jenkins: AMPLab s Friendly Butler. He will build your projects so you don t have to!

Using Microsoft Excel

5 REASONS YOUR BUSINESS NEEDS NETWORK MONITORING

TOP DEVELOPERS MINDSET. All About the 5 Things You Don t Know.

Gene Kim 9/9/2016 CSC 2/444 Lisp Tutorial

Learn Linux in a Month of Lunches by Steven Ovadia

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

4/28/2014. Defining A Replacement Cycle for Your Association. Introductions. Introductions. April Executive Director, Idealware. Idealware.

Top 25 Big Data Interview Questions And Answers

An Introduction to Apache Spark

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

A Guide to Condor. Joe Antognini. October 25, Condor is on Our Network What is an Our Network?

CS370 Operating Systems

GETTING STARTED WITH THE BLOOMZ APP

IE-35 / IE-33 FAQ Now that Ivie has introduced the IE-35, what kind of support can an IE-33 owner expect?

2 What kinds of hosting does the market offer?

Burning CDs in Windows XP

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Efficient Map Reduce Model with Hadoop Framework for Data Processing

15.1 Optimization, scaling, and gradient descent in Spark

Problems with PSQL and Windows 10 Release 1803

How Apache Beam Will Change Big Data

MainBoss 4.2 Installation and Administration

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

MapReduce, Hadoop and Spark. Bompotas Agorakis

Introduction to Big-Data

Transcription:

Data Processing on Large Clusters By: Stephen Cardina

Introduction MapReduce is used on clusters to get data that you are specifically looking for. MapReduce was made back in 2004 by Google in order to help reduce complexity on their search engine. They did this by finding all of the words used on a web page and finding the amount of times each word is used. In 2014 Google had stopped using MapReduce as better alternatives had come along. The purpose of this presentation is to go through some of them to get an idea of what works best for what kind of given situation.

MapReduce Pros: Excellent for one time or simple use Cons: It has essentially been discontinued by Google since at April 2014 after it upgraded to Apache Mahout and support for it has been phased out. It s limited in machine learning. When used for an constantly or is very complex, there are better alternatives.

Apache Mahout Is an ongoing project by a nonprofit organization called Apache Software Foundation. The first version was released in February 2012. Phased out MapReduce and led to it getting phased out. Being worked on by volunteers. Is currently being used by Google.

Apache Mahout Pros: Is very good for machine learning, such as recommendations on products for a site. Is getting new versions at a semi regular rate, last version was in April 2017. Cons: Doesn t scale the best.

Apache Spark Is an open source cluster computing framework. Was originally made at the University of California s Berkeley s AMPLab. Was donated in 2013 to Apache Software Foundation and they ve had it since. Become one of the top level projects at Apache and one of the most popular projects to be worked on, exceeding 1,000 contributions in 2015. Also being worked on by volunteers. Used by Amazon and Groupon.

Apache Spark Pros: Is getting newer versions at a decent pace, last version release was October 2017. Writes as little as possible to the disk; which lets it finish tasks faster. Also works good for machine learning. Is generally considered better than Apache Mahout. Very well known so it is one of the easier ones to learn compared to the later ones. Cons: Doesn t scale the best.

H20 Is an open source software for big data analysis. Was first released in 2011 by H20.ai. Focuses solely on machine learning algorithms instead of having a whole framework. Can be integrated into Apache Spark. Capital One and Ebay currently use H20.

H20 Pros Scales very well. Can handle a lot of data at once. Cons Not the most well known so odds are you will have to learn on your own.

XGBoost Is an open source software library. It was first made as a research project by Tianqi Chen in 2016. It won the Higgs machine Learning Challenge in 2016 and gained widespread attention in the machine learning community. It was later made to be able to integrate into Apache Spark and Apache Hadoop.

XGBoost Pros One of the best when it comes to scalability. Can handle the most out of the options here. Cons Hasn t been around as long as the others making it harder to learn from someone else so you have to learn it on your own.

How to figure out what s the best For the purpose of this presentation we won t be comparing MapReduce and Apache Mahout with the other options The reason for this is that they aren t the best for large projects; they are fine to work with if it s relatively simple; but they won t be able to handle as much as the other three options. So with Apache Spark, H20 and XGBoost we ll compare them based on scalability and accuracy,

The First Test We will be using random forest for our first test. A random forest is where you give a certain number of trees, 500 in this case, a certain amount of data points and ask it for a curve based on what it received. We will test this with 10,000 100,000 1,000,000 and 10,000,000 different data points. N will equal 1 million for the following charts

The First Test Results These are the results from the test; where Spark crashed before it did all 10 million

The First Test Results

The Second Test We will be using Gradient Boosted Trees in our second test. This time we will be running it twice. It s a lot like Random Forest but this time it doesn t allow a tree to sway the curve as much, as represented by the depth. Test A will be 1,000 trees and max depth of 16. Test B will be 300 trees and max depth of 6.

The Second Test Results

The Second Test

In Conclusion MapReduce and Apache Mahout are only good for small one time projects. H20 and XGBoost are considered the 2 leading options at the moment so they are the best to work with if you know how. XGBoost is generally the fastest and requires the least amount of RAM as compared to the other options. If you don t feel as confident about figuring them out yourself it s best to use Apache Spark as it s more well known and thus easier to learn.