Data Processing on Large Clusters. By: Stephen Cardina

Data Processing on Large Clusters By: Stephen Cardina

Introduction MapReduce is used on clusters to get data that you are specifically looking for. MapReduce was made back in 2004 by Google in order to help reduce complexity on their search engine. They did this by finding all of the words used on a web page and finding the amount of times each word is used. In 2014 Google had stopped using MapReduce as better alternatives had come along. The purpose of this presentation is to go through some of them to get an idea of what works best for what kind of given situation.

MapReduce Pros: Excellent for one time or simple use Cons: It has essentially been discontinued by Google since at April 2014 after it upgraded to Apache Mahout and support for it has been phased out. It s limited in machine learning. When used for an constantly or is very complex, there are better alternatives.

Apache Mahout Is an ongoing project by a nonprofit organization called Apache Software Foundation. The first version was released in February 2012. Phased out MapReduce and led to it getting phased out. Being worked on by volunteers. Is currently being used by Google.

Apache Mahout Pros: Is very good for machine learning, such as recommendations on products for a site. Is getting new versions at a semi regular rate, last version was in April 2017. Cons: Doesn t scale the best.

Apache Spark Is an open source cluster computing framework. Was originally made at the University of California s Berkeley s AMPLab. Was donated in 2013 to Apache Software Foundation and they ve had it since. Become one of the top level projects at Apache and one of the most popular projects to be worked on, exceeding 1,000 contributions in 2015. Also being worked on by volunteers. Used by Amazon and Groupon.

Apache Spark Pros: Is getting newer versions at a decent pace, last version release was October 2017. Writes as little as possible to the disk; which lets it finish tasks faster. Also works good for machine learning. Is generally considered better than Apache Mahout. Very well known so it is one of the easier ones to learn compared to the later ones. Cons: Doesn t scale the best.

H20 Is an open source software for big data analysis. Was first released in 2011 by H20.ai. Focuses solely on machine learning algorithms instead of having a whole framework. Can be integrated into Apache Spark. Capital One and Ebay currently use H20.

H20 Pros Scales very well. Can handle a lot of data at once. Cons Not the most well known so odds are you will have to learn on your own.

XGBoost Is an open source software library. It was first made as a research project by Tianqi Chen in 2016. It won the Higgs machine Learning Challenge in 2016 and gained widespread attention in the machine learning community. It was later made to be able to integrate into Apache Spark and Apache Hadoop.

XGBoost Pros One of the best when it comes to scalability. Can handle the most out of the options here. Cons Hasn t been around as long as the others making it harder to learn from someone else so you have to learn it on your own.

How to figure out what s the best For the purpose of this presentation we won t be comparing MapReduce and Apache Mahout with the other options The reason for this is that they aren t the best for large projects; they are fine to work with if it s relatively simple; but they won t be able to handle as much as the other three options. So with Apache Spark, H20 and XGBoost we ll compare them based on scalability and accuracy,

The First Test We will be using random forest for our first test. A random forest is where you give a certain number of trees, 500 in this case, a certain amount of data points and ask it for a curve based on what it received. We will test this with 10,000 100,000 1,000,000 and 10,000,000 different data points. N will equal 1 million for the following charts

The First Test Results These are the results from the test; where Spark crashed before it did all 10 million

The First Test Results

The Second Test We will be using Gradient Boosted Trees in our second test. This time we will be running it twice. It s a lot like Random Forest but this time it doesn t allow a tree to sway the curve as much, as represented by the depth. Test A will be 1,000 trees and max depth of 16. Test B will be 300 trees and max depth of 6.

The Second Test Results

The Second Test

In Conclusion MapReduce and Apache Mahout are only good for small one time projects. H20 and XGBoost are considered the 2 leading options at the moment so they are the best to work with if you know how. XGBoost is generally the fastest and requires the least amount of RAM as compared to the other options. If you don t feel as confident about figuring them out yourself it s best to use Apache Spark as it s more well known and thus easier to learn.