CS 378 Big Data Programming

Size: px

Start display at page:

Download "CS 378 Big Data Programming"

Alannah Jordan
5 years ago
Views:

1 CS 378 Big Data Programming Fall 2015 Lecture 1 Introduc?on

2 Class Logis?cs Class meets MW, 9:30 AM 11:00 AM Office Hours GDC MTh 11:00 12:00 AM By appointment Web page: cs.utexas.edu/~dfranke/courses/2015fall/cs378- BDP.htm TA: Vivek Pradhan Office hours: TBD

3 Course Content Programming in Hadoop (map- reduce) and Spark Use Elas?cMapReduce (EMR) on Amazon Web Services (AWS) You can us a local install of Hadoop if you want Local install of Spark I will inves?gate the cloud based services from DataBricks and AWS

4 Textbooks MapReduce Design Paderns Main content for Hadoop assignments Hadoop The Defini?ve Guide 4 RD Edi?on Recommended for your understanding, not required Learning Spark Main content for Spark assignments All textbooks are available as ebooks from O Reilly

5 Lectures PDF of lecture notes accessible via syllabus For your note taking, review, or whatever These notes are my outline for each class

6 Assignments Assignments will be programming assignments All work can be done using Java Scala, Python are op?ons for Spark IDE for developing code recommended Eclipse, IntelliJ IDE (community edi?on) are free Use maven to build uber JAR to upload to the cloud I ll provide the pom.xml file used by maven

7 Assignments I ll review a solu?on in class on the due date Work submided aker the start of class considered late 25% penalty for late submission Can be submided un?l the next assignment is due Aker that deadline, no credit is given Will consider these in determining final grade I encourage you to keep pace with the assignments Most assignments will build on previous work

8 Ques?ons?

9 Managing Big Data What can we do when the data gets big? Too big for the CPU memory of any single machine Larger than the disk storage of a single machine Recent data point: Facebook has ~800 petabyte data cluster (Hadoop) 1 petabyte = bytes Big data is spread across a network of machines

10 Managing Big Data Need to bring distributed storage and distributed processing to bear to handle big data Issues: Distribu?ng computa?on across many machines Maximizing performance Minimize I/O to disk, minimize transfers across the network Combining the results of distributed computa?on Recovering from failures

11 Managing Big Data We ll look at two popular tools/systems One well established Hadoop One up and coming Spark Basic concepts of each How they address the aforemen?oned issues How to solve various problems with these systems

12 Managing Big Data When wri?ng a program with these tools You don t know the size of the data You don t know the extent of the parallelism Both try to collocate the computa?on with the data Parallelize the I/O Make the I/O local (versus across network) Data is oken unstructured (vs. rela?onal model)

13 Big Data vs. Rela?onal RDBMS normaliza?on Goal is to remove redundancy and retain/insure integrity Big data apps want reads to be local Send the code to the data, as it much smaller (Jim Gray) Normaliza?on makes read non- local Processing examines one input record at a?me Minimal state in programs it s in the data

14 Big Data Tools This all sounds great. What are the issues? Coordina?ng the distributed computa?on Handling par?al failures Combining the results of distributed computa?on Tools offer a programming model that abstracts Disk read and write Paralleliza?on (computa?on and I/O) Combining data (keys and values)

15 MapReduce Design Paderns Summariza?on Filtering Data Organiza?on Par??oning/binning, sor?ng, shuffle Joins Merging data sets Meta- paderns Op?mizing map- reduce chains (data pipelines)

16 Resources for Hadoop Hadoop: The Defini/ve Guide, 3rd Edi/on, by Tom White O Reilly Media Print ISBN: ISBN 10: Ebook ISBN: ISBN 10: MapReduce Design Pa=erns, by Donald Miner and Adam Shook O Reilly Media Print ISBN: ISBN 10: Ebook ISBN: ISBN 10: hdp://hadoop.apache.org/ Several vendors provide Hadoop distribu?ons Amazon Web Services Elas?cMapReduce

17 Resources for Spark Learning Spark, by Holden Karau, Andy Konwinsky, Patrick Wendell, Matei Zaharia O Reilly Media Print ISBN: ISBN 10: Ebook ISBN: ISBN 10: hdp://spark.apache.org/ Can download a version that runs on your local machine Cloud services Spark on AWS DataBricks offers a cloud service Others will join the party

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

CSC 261/461 Database Systems Lecture 24 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Term Paper due on April 20 April 23 Project 1 Milestone 4 is out Due on 05/03 But I would