CS 378 Big Data Programming
|
|
- Alannah Jordan
- 5 years ago
- Views:
Transcription
1 CS 378 Big Data Programming Fall 2015 Lecture 1 Introduc?on
2 Class Logis?cs Class meets MW, 9:30 AM 11:00 AM Office Hours GDC MTh 11:00 12:00 AM By appointment Web page: cs.utexas.edu/~dfranke/courses/2015fall/cs378- BDP.htm TA: Vivek Pradhan Office hours: TBD
3 Course Content Programming in Hadoop (map- reduce) and Spark Use Elas?cMapReduce (EMR) on Amazon Web Services (AWS) You can us a local install of Hadoop if you want Local install of Spark I will inves?gate the cloud based services from DataBricks and AWS
4 Textbooks MapReduce Design Paderns Main content for Hadoop assignments Hadoop The Defini?ve Guide 4 RD Edi?on Recommended for your understanding, not required Learning Spark Main content for Spark assignments All textbooks are available as ebooks from O Reilly
5 Lectures PDF of lecture notes accessible via syllabus For your note taking, review, or whatever These notes are my outline for each class
6 Assignments Assignments will be programming assignments All work can be done using Java Scala, Python are op?ons for Spark IDE for developing code recommended Eclipse, IntelliJ IDE (community edi?on) are free Use maven to build uber JAR to upload to the cloud I ll provide the pom.xml file used by maven
7 Assignments I ll review a solu?on in class on the due date Work submided aker the start of class considered late 25% penalty for late submission Can be submided un?l the next assignment is due Aker that deadline, no credit is given Will consider these in determining final grade I encourage you to keep pace with the assignments Most assignments will build on previous work
8 Ques?ons?
9 Managing Big Data What can we do when the data gets big? Too big for the CPU memory of any single machine Larger than the disk storage of a single machine Recent data point: Facebook has ~800 petabyte data cluster (Hadoop) 1 petabyte = bytes Big data is spread across a network of machines
10 Managing Big Data Need to bring distributed storage and distributed processing to bear to handle big data Issues: Distribu?ng computa?on across many machines Maximizing performance Minimize I/O to disk, minimize transfers across the network Combining the results of distributed computa?on Recovering from failures
11 Managing Big Data We ll look at two popular tools/systems One well established Hadoop One up and coming Spark Basic concepts of each How they address the aforemen?oned issues How to solve various problems with these systems
12 Managing Big Data When wri?ng a program with these tools You don t know the size of the data You don t know the extent of the parallelism Both try to collocate the computa?on with the data Parallelize the I/O Make the I/O local (versus across network) Data is oken unstructured (vs. rela?onal model)
13 Big Data vs. Rela?onal RDBMS normaliza?on Goal is to remove redundancy and retain/insure integrity Big data apps want reads to be local Send the code to the data, as it much smaller (Jim Gray) Normaliza?on makes read non- local Processing examines one input record at a?me Minimal state in programs it s in the data
14 Big Data Tools This all sounds great. What are the issues? Coordina?ng the distributed computa?on Handling par?al failures Combining the results of distributed computa?on Tools offer a programming model that abstracts Disk read and write Paralleliza?on (computa?on and I/O) Combining data (keys and values)
15 MapReduce Design Paderns Summariza?on Filtering Data Organiza?on Par??oning/binning, sor?ng, shuffle Joins Merging data sets Meta- paderns Op?mizing map- reduce chains (data pipelines)
16 Resources for Hadoop Hadoop: The Defini/ve Guide, 3rd Edi/on, by Tom White O Reilly Media Print ISBN: ISBN 10: Ebook ISBN: ISBN 10: MapReduce Design Pa=erns, by Donald Miner and Adam Shook O Reilly Media Print ISBN: ISBN 10: Ebook ISBN: ISBN 10: hdp://hadoop.apache.org/ Several vendors provide Hadoop distribu?ons Amazon Web Services Elas?cMapReduce
17 Resources for Spark Learning Spark, by Holden Karau, Andy Konwinsky, Patrick Wendell, Matei Zaharia O Reilly Media Print ISBN: ISBN 10: Ebook ISBN: ISBN 10: hdp://spark.apache.org/ Can download a version that runs on your local machine Cloud services Spark on AWS DataBricks offers a cloud service Others will join the party
CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101
CSC 261/461 Database Systems Lecture 24 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Term Paper due on April 20 April 23 Project 1 Milestone 4 is out Due on 05/03 But I would
More informationCS 378 Big Data Programming
CS 378 Big Data Programming Lecture 5 Summariza9on Pa:erns CS 378 Fall 2017 Big Data Programming 1 Review Assignment 2 Ques9ons? mrunit How do you test map() or reduce() calls that produce mul9ple outputs?
More informationAnalytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation
Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable
More informationCMSC433 - Programming Language Technologies and Paradigms. Introduction
CMSC433 - Programming Language Technologies and Paradigms Introduction Course Goal To help you become a better programmer Introduce advanced programming technologies Deconstruct relevant programming problems
More informationCOSC-589 Web Search and Sense-making Information Retrieval In the Big Data Era. Spring Instructor: Grace Hui Yang
COSC-589 Web Search and Sense-making Information Retrieval In the Big Data Era Spring 2016 Instructor: Grace Hui Yang The Web provides abundant information which allows us to live more conveniently and
More informationDr. Chuck Cartledge. 4 Feb. 2015
CS-495/595 Hadoop (part 1) Lecture #3 Dr. Chuck Cartledge 4 Feb. 2015 1/23 Table of contents I 1 Miscellanea 2 Assignment 3 The Book 4 Chapter 1 5 Chapter 2 7 Break 8 Assignment #2 9 Conclusion 10 References
More informationCSC 261/461 Database Systems. Fall 2017 MW 12:30 pm 1:45 pm CSB 601
CSC 261/461 Database Systems Fall 2017 MW 12:30 pm 1:45 pm CSB 601 Agenda Administrative aspects Brief overview of the course Introduction to databases and SQL ADMINISTRATIVE ASPECTS Teaching Staff Instructor:
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationCS 378 Big Data Programming
CS 378 Big Data Programming Lecture 11 more on Data Organiza:on Pa;erns CS 378 - Fall 2016 Big Data Programming 1 Assignment 5 - Review Define an Avro object for user session One user session for each
More informationDr. Chuck Cartledge. 18 Feb. 2015
CS-495/595 Pig Lecture #6 Dr. Chuck Cartledge 18 Feb. 2015 1/18 Table of contents I 1 Miscellanea 2 The Book 3 Chapter 11 4 Conclusion 5 References 2/18 Corrections and additions since last lecture. Completed
More informationArticle from. CompAct. April 2018 Issue 57
rticle from Compct pril 2018 Issue 57 Introduction to Distributed Computing By Jason ltieri he rise of big data has required changes to the way that data is processed. Distributed computing is one approach
More informationCloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationCSC 261/461 Database Systems Lecture 26. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101
CSC 261/461 Database Systems Lecture 26 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Poster Presentation on May 03 (During our usual lecture time) Mandatory for all Graduate
More informationOffice Hours Time: Monday/Wednesday 12:45PM-1:45PM (SB237D) More Information: Office Hours Time: Monday/Thursday 3PM-4PM (SB002)
Professor: Ioan Raicu Office Hours Time: Monday/Wednesday 12:45PM-1:45PM (SB237D) More Information: http://www.cs.iit.edu/~iraicu/ http://datasys.cs.iit.edu/ TAs (cs553-f14@datasys.cs.iit.edu): Ke Wang
More informationDealing with Data Especially Big Data
Dealing with Data Especially Big Data INFO-GB-2346.01 Fall 2017 Professor Norman White nwhite@stern.nyu.edu normwhite@twitter Teaching Assistant: Frenil Sanghavi fps241@stern.nyu.edu Administrative Assistant:
More informationHadoop, Yarn and Beyond
Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets
More informationCS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey If an RDD is read once, transformed will Spark
More informationCS / Cloud Computing. Recitation 11 November 5 th and Nov 8 th, 2013
CS15-319 / 15-619 Cloud Computing Recitation 11 November 5 th and Nov 8 th, 2013 Announcements Encounter a general bug: Post on Piazza Encounter a grading bug: Post Privately on Piazza Don t ask if my
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationCS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014
CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions
More informationJosh Bloch Charlie Garrod Darya Melicher
Principles of So3ware Construc9on: Objects, Design, and Concurrency Part 1: Introduc9on Course overview and introduc9on to so3ware design Josh Bloch Charlie Garrod Darya Melicher 1 So3ware is everywhere
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationCS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Return type for collect()? Can
More informationMap Reduce & Hadoop Recommended Text:
Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange
More information2/4/2019 Week 3- A Sangmi Lee Pallickara
Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1
More informationCentral Washington University Department of Computer Science Course Syllabus
Central Washington University Department of Computer Science Course Syllabus CS 110: Programming Fundamentals I December 27, 2015 1 Course Information Course Information Lecture: Mo,Tu,We: 10:00AM - 10:50AM,
More informationChapter 4: Apache Spark
Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,
More informationCS 378 Big Data Programming. Lecture 4 Summariza9on Pa:erns
CS 378 Big Data Programming Lecture 4 Summariza9on Pa:erns Review Assignment 1 Ques9ons? Using maven Using AWS Simple Debugging Counters controller syslog Custom counters context.getcounter(group, counter).increment(1l);
More informationCS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [SPARK] Frequently asked questions from the previous class survey 48-bit bookending in Bzip2: does the number have to be special? Spark seems to have too many
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More information/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016
15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationThe Evolution of a Data Project
The Evolution of a Data Project The Evolution of a Data Project Python script The Evolution of a Data Project Python script SQL on live DB The Evolution of a Data Project Python script SQL on live DB SQL
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationLarge- Scale Sor,ng: Breaking World Records. Mike Conley CSE 124 Guest Lecture 12 March 2015
Large- Scale Sor,ng: Breaking World Records Mike Conley CSE 124 Guest Lecture 12 March 2015 Sor,ng Given an array of items, put them in order 5 2 8 0 2 5 4 9 0 1 0 0 0 0 0 0 1 2 2 4 5 5 8 9 Many algorithms
More informationBig Data Analytics with Apache Spark. Nastaran Fatemi
Big Data Analytics with Apache Spark Nastaran Fatemi Apache Spark Throughout this part of the course we will use the Apache Spark framework for distributed data-parallel programming. Spark implements a
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationDatabricks, an Introduction
Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,
More informationBig Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme
Big Data Analytics C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationHBase Design Patterns By Sujee Maniyam, Mark Kerzner READ ONLINE
HBase Design Patterns By Sujee Maniyam, Mark Kerzner READ ONLINE with HBase In Detail With the increasing use of NoSQL in general and HBase in particular Speaker: Francis Liu (Yahoo!) HBase's introduction
More informationDatabase Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu
Database Architecture 2 & Storage Instructor: Matei Zaharia cs245.stanford.edu Summary from Last Time System R mostly matched the architecture of a modern RDBMS» SQL» Many storage & access methods» Cost-based
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationMicrosoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud
Microsoft Azure Databricks for data engineering Building production data pipelines with Apache Spark in the cloud Azure Databricks As companies continue to set their sights on making data-driven decisions
More informationBig Data Analytics Overview with Hadoop and Spark
Big Data Analytics Overview with Hadoop and Spark Harrison Carranza, MSIS Marist College, USA, Harrison.Carranza2@marist.edu Mentor: Aparicio Carranza, PhD New York City College of Technology - CUNY, USA,
More informationCloud, Big Data & Linear Algebra
Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes
More informationThe Evolution of Big Data Platforms and Data Science
IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering
More informationh7ps://bit.ly/citustutorial
Before We Start Setup a Citus Cloud account for the exercises: h7ps://bit.ly/citustutorial Designing a Mul
More informationKey aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling
Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment
More informationSubmitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay
Submitted to: Dr. Sunnie Chung Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunny Chung Presented by: Sonal Deshmukh Jay Upadhyay What is Apache Survey shows huge popularity spike for Apache
More informationMCT620 Distributed Systems Module Handbook
MCT620 Distributed Systems Module Handbook Master of Science in Software Engineering & Database Technologies (MScSED) Diploma in Software Engineering Table of Contents 1 Module Details 2 1.1 Module Description
More informationData! CS 133: Databases. Goals for Today. So, what is a database? What is a database anyway? From the textbook:
CS 133: Databases Fall 2018 Lec 01 09/04 Introduction & Relational Model Data! Need systems to Data is everywhere Banking, airline reservations manage the data Social media, clicking anything on the internet
More informationDr. Chuck Cartledge. 4 Mar. 2015
CS-495/595 Pig (part 2) Lecture #8 Dr. Chuck Cartledge 4 Mar. 2015 1/23 Table of contents I 1 Miscellanea 2 The Book 3 Chapter 11 4 Assignment #3. 5 Project 6 Conclusion 7 References 2/23 Corrections and
More informationBig Data Analytics with Apache Spark (Part 2) Nastaran Fatemi
Big Data Analytics with Apache Spark (Part 2) Nastaran Fatemi What we ve seen so far Spark s Programming Model We saw that, at a glance, Spark looks like Scala collections However, internally, Spark behaves
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationSan Jose State University College of Science Department of Computer Science CS185C, NoSQL Database Systems, Section 1, Spring 2018
San Jose State University College of Science Department of Computer Science CS185C, NoSQL Database Systems, Section 1, Spring 2018 Course and Contact Information Instructor: Suneuy Kim Office Location:
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationBest Practices and Performance Tuning on Amazon Elastic MapReduce
Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 2016, Amazon Web Services, Inc. or
More informationSimple Step-by-Step Install Guide for Stand-Alone Apache Spark on Windows 10 Jonathan Haynes Last Updated September 14, 2018
Simple Step-by-Step Install Guide for Stand-Alone Apache Spark on Windows 10 Jonathan Haynes Last Updated September 14, 2018 Overview Why use Spark? Spark is an in-memory computing engine and a set of
More informationDistributed Systems Intro and Course Overview
Distributed Systems Intro and Course Overview COS 418: Distributed Systems Lecture 1 Wyatt Lloyd Distributed Systems, What? 1) Multiple computers 2) Connected by a network 3) Doing something together Distributed
More informationBig Data Analysis Using Hadoop and MapReduce
Big Data Analysis Using Hadoop and MapReduce Harrison Carranza, MSIS Marist College, Harrison.Carranza2@marist.edu Mentor: Aparicio Carranza, PhD New York City College of Technology - CUNY, USA, acarranza@citytech.cuny.edu
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationCSE 336. Introduction to Programming. for Electronic Commerce. Why You Need CSE336
CSE 336 Introduction to Programming for Electronic Commerce Why You Need CSE336 Concepts like bits and bytes, domain names, ISPs, IPAs, RPCs, P2P protocols, infinite loops, and cloud computing are strictly
More informationWelcome! COMS 4118 Opera3ng Systems I Spring 2018
Welcome! COMS 4118 Opera3ng Systems I Spring 2018 Teaching staff 5 Teaching Assistants (TAs) John Hui jzh2106@columbia.edu (Head TA) JiaYan Hu jh3541@columbia.edu Mert Ussakli mu2228@columbia.edu Kundan
More informationLecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems
CS-E4610 Modern Database Systems 05.01.2018-05.04.2018 Lecture 09 Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ PHD STUDENT I N COMPUTER SCIENCE, ELT E UNIVERSITY VISITING R ESEA RCHER,
More informationProject Design. Version May, Computer Science Department, Texas Christian University
Project Design Version 4.0 2 May, 2016 2015-2016 Computer Science Department, Texas Christian University Revision Signatures By signing the following document, the team member is acknowledging that he
More informationCS/ENGRD 2110 Object-Oriented Programming and Data Structures Spring 2012 Thorsten Joachims
CS/ENGRD 2110 Object-Oriented Programming and Data Structures Spring 2012 Thorsten Joachims Lecture 1: Overview http://courses.cs.cornell.edu/cs2110 1 Course Staff Instructor Thorsten Joachims (tj@cs.cornell.edu)
More informationAn exceedingly high-level overview of ambient noise processing with Spark and Hadoop
IRIS: USArray Short Course in Bloomington, Indian Special focus: Oklahoma Wavefields An exceedingly high-level overview of ambient noise processing with Spark and Hadoop Presented by Rob Mellors but based
More informationSan José State University Computer Science Department CS49J, Section 3, Programming in Java, Fall 2015
Course and Contact Information San José State University Computer Science Department CS49J, Section 3, Programming in Java, Fall 2015 Instructor: Aikaterini Potika Office Location: MacQuarrie Hall 215
More informationGeneralizing Map- Reduce
Generalizing Map- Reduce 1 Example: A Map- Reduce Graph map reduce map... reduce reduce map 2 Map- reduce is not a solu;on to every problem, not even every problem that profitably can use many compute
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationProblem Set 0. General Instructions
CS246: Mining Massive Datasets Winter 2014 Problem Set 0 Due 9:30am January 14, 2014 General Instructions This homework is to be completed individually (no collaboration is allowed). Also, you are not
More information9/8/2018. Prerequisites. Grading. People & Contact Information. Textbooks. Course Info. CS430/630 Database Management Systems Fall 2018
CS430/630 Database Management Systems Fall 2018 People & Contact Information Instructor: Prof. Betty O Neil Email: eoneil AT cs DOT umb DOT edu (preferred contact) Web: http://www.cs.umb.edu/~eoneil Office:
More informationSan Jose State University College of Science Department of Computer Science CS185C, Introduction to NoSQL databases, Spring 2017
San Jose State University College of Science Department of Computer Science CS185C, Introduction to NoSQL databases, Spring 2017 Course and Contact Information Instructor: Dr. Kim Office Location: MacQuarrie
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationCS 245: Principles of Data-Intensive Systems. Instructor: Matei Zaharia cs245.stanford.edu
CS 245: Principles of Data-Intensive Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Why study data-intensive systems? Course logistics Key issues and themes A bit of history CS 245 2 My Background
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationToday s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce 8/29/12 Instructors Krste Asanovic, Randy H. Katz hgp://inst.eecs.berkeley.edu/~cs61c/fa12 Fall 2012 - - Lecture #3 1 Today
More informationObject-Oriented Programming for Managers
95-807 Object-Oriented Programming for Managers 12 units Prerequisites: 95-815 Programming Basics is required for students with little or no prior programming coursework or experience. (http://www.andrew.cmu.edu/course/95-815/)
More informationDistributed Systems CS6421
Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool
More informationDocument Databases: MongoDB
NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/~svoboda/courses/171-ndbi040/ Lecture 9 Document Databases: MongoDB Marn Svoboda svoboda@ksi.mff.cuni.cz 28. 11. 2017 Charles University
More informationCS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University
CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Indexing Process Indexes Indexes are data structures designed to make search faster Text search
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #1: Course Introduction U Kang Seoul National University U Kang 1 In This Lecture Motivation to study data mining Administrative information for this course U Kang 2
More informationCS Spark. Slides from Matei Zaharia and Databricks
CS 5450 Spark Slides from Matei Zaharia and Databricks Goals uextend the MapReduce model to better support two common classes of analytics apps Iterative algorithms (machine learning, graphs) Interactive
More informationDistributed Systems. CS422/522 Lecture17 17 November 2014
Distributed Systems CS422/522 Lecture17 17 November 2014 Lecture Outline Introduction Hadoop Chord What s a distributed system? What s a distributed system? A distributed system is a collection of loosely
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationThe age of Big Data Big Data for Oracle Database Professionals
The age of Big Data Big Data for Oracle Database Professionals Oracle OpenWorld 2017 #OOW17 SessionID: SUN5698 Tom S. Reddy tom.reddy@datareddy.com About the Speaker COLLABORATE & OpenWorld Speaker IOUG
More informationCERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)
CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program
More informationIntro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect
Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Igor Roiter Big Data Cloud Solution Architect Working as a Data Specialist for the last 11 years 9 of them as a Consultant specializing
More informationCSC 172 Data Structures and Algorithms. Fall 2017 TuTh 3:25 pm 4:40 pm Aug 30- Dec 22 Hoyt Auditorium
CSC 172 Data Structures and Algorithms Fall 2017 TuTh 3:25 pm 4:40 pm Aug 30- Dec 22 Hoyt Auditorium Agenda Administrative aspects Brief overview of the course Hello world in Java CSC 172, Fall 2017, UR
More informationSyllabus INFO-GB Design and Development of Web and Mobile Applications (Especially for Start Ups)
Syllabus INFO-GB-3322 Design and Development of Web and Mobile Applications (Especially for Start Ups) Fall 2015 Stern School of Business Norman White, KMEC 8-88 Email: nwhite@stern.nyu.edu Phone: 212-998
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More informationDeep Learning Frameworks with Spark and GPUs
Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,
More informationDr. Chuck Cartledge. 18 Mar. 2015
CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge 18 Mar. 2015 1/25 Table of contents I 1 Miscellanea 2 Assignment #3 3 The Book 4 Chapter 12 6 Project 7 Conclusion 8 References 5 Break 2/25 Corrections and
More information