"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute

"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute erickj4@rpi.edu @olyerickson

Director of Operations, The Rensselaer IDEA Deputy Director, Rensselaer Web Science Research Center at the Tetherless World Constellation, RPI

Bridgewater, NH (12 Sep 2016)

Today... 1. 2. 3. 4. 5. 6. 7. 8. What is "Big Data?" Why is Big Data such a Big Deal? How do we meet the challenges of Big Data? What tools do we use to work with Big Data? What is different about (really) Big Data Analytics? What can you do to get into Big Data "today?" How to get a job in Big Data? To learn more...

What is Big Data? Typically you hear about size...and the data really IS big!...but some say it's more about a way of thinking about the data Usually we talk about the "Four V's" Volume: Handling the scale of the data Velocity: Analyzing streaming data Variety: Managing data in wildly different forms...and formats Veracity: Uncertainty of the data (some people leave this out!) The ability of society to harness information in novel ways to produce useful insights or goods and services of significant value and things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value. [1] [1] Viktor Mayer-Schönberger and Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live, Work, and Think (2013)

Why is Big Data such a Big Deal? It breaks everything... Volume: new storage architectures (physical and virtual) Velocity: new computational models (parallel, highly distributed) Variety: new approaches to extracting information, meaning, knowledge from any "data" Veracity: modelling to handle validity, errors, missingness at a Very Large Scale NSA (UT) Google (GA) Apple (NC)

How do we meet the challenges of Big Data? Massively parallel hardware (>100K CPUs) Highly distributed computational models (esp. MapReduce -> Hadoop) Highly distributed file systems (esp. Hadoop File System) New database models (e.g. NoSQL -> "Not only SQL") Google (1996) Google (1998) Google (2016)

How it all began... 2003, 2004: Google publishes key papers 2006: Hadoop emerges as open source project under Apache 2007: Yahoo runs 1000-machine cluster with Hadoop 2011: Yahoo reaches 42K Hadoop nodes, 100k CPUs, and 100's of petabytes of storage http://hadoop.apache.org/

What tools do we use to work with Big Data? Infrastructure... Analytics... Applications... Commercial vs OpenSource

What is different about (really) Big Data Analytics Usually*, trying to do conventional analytics Small problems writ huge E.g. indexing TB's of data very quickly Distributed computational model: MapReduce/Hadoop (Highly) distributed file system: Hadoop File System (HDFS)

Quick-and-dirty: MapReduce Split a problem into many sub-problems Assign sub-problems to many agents ("Map") Collect results "Reduce" Jonas Widriksson, "Raspberry PI Hadoop Cluster." (Oct 2014)

Quick-and-dirty: Hadoop File System (HDFS) Designed to run on low cost hardware; highly fault tolerant Files split up into blocks that are replicated to DataNodes By default blocks have a size of 64MB; replicated to 3 nodes in cluster Jonas Widriksson, "Raspberry PI Hadoop Cluster." (Oct 2014)

What can you do to get into Big Data "today?" Everything you really need to manage and analyze Big Data is open source R and Python are great fantastic entry points to data analytics Real Big Data requires hands-on experience with Hadoop and HDFS... R and Python (with NumPy and SciPy) are top data science languages "Big Data Analytics" is more than "R on Steroids"...so find some machines and start playing! RStudio RHadoop: Integrating R and Hadoop!

Kaggle Competitions: Challenge yourself

Jonas Widriksson, "Raspberry PI Hadoop Cluster." (Oct 2014)

How to get a job in Big Data? Data Analysis: Data Warehousing: Familiarity with large data stores and new database models Relational DBs: MySQL, SQL Server, et.al., ad infinitum... NoSQL: HDFS, HBase, CouchDB, MongoDB,... Data Transformation: Machine learning, statistical analysis (R, MATLAB, Python) MapReduce -> Hadoop Data visualization ETL[1], scripting (Linux shells, Python, etc) Data Collection: Extracting data from existing databases via Web APIs, et.al. Crawling, scraping the Web [1] Extract, Transform, Load

To learn more... Follow the links in these slides... Download, install, and learn R and RStudio Play with the Hadoop framework Read Cringely's series, Thinking about Big Data (Parts 1-3) Listen to this week's TED Radio Hour, Big Data Revolution