Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Similar documents
Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018


Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Cloud Computing & Visualization

Lecture 11 Hadoop & Spark

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

CSE 444: Database Internals. Lecture 23 Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Scalable Tools - Part I Introduction to Scalable Tools

Databases and Big Data Today. CS634 Class 22

DATA SCIENCE USING SPARK: AN INTRODUCTION

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Webinar Series TMIP VISION

Data Platforms and Pattern Mining

Embedded Technosolutions

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Project Design. Version May, Computer Science Department, Texas Christian University

MapReduce, Hadoop and Spark. Bompotas Agorakis

Hadoop, Yarn and Beyond

Hadoop/MapReduce Computing Paradigm

Hadoop Development Introduction

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

Massive Online Analysis - Storm,Spark

Hadoop An Overview. - Socrates CCDH

An Introduction to Big Data Analysis using Spark

An Introduction to Apache Spark

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Big Data Hadoop Course Content

A Tutorial on Apache Spark

Review - Relational Model Concepts

Distributed Systems CS6421

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Distributed File Systems II

Cloud, Big Data & Linear Algebra

HADOOP FRAMEWORK FOR BIG DATA

Big Data Infrastructures & Technologies

CompSci 516: Database Systems

Hadoop. Introduction / Overview

Apache Spark and Scala Certification Training

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Big data systems 12/8/17

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

A brief history on Hadoop

Shark: Hive (SQL) on Spark

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Stages of Data Processing

An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

Bringing Data to Life

Big Data Infrastructure at Spotify

Big Data with Hadoop Ecosystem

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

Based on Big Data: Hype or Hallelujah? by Elena Baralis

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Big Data and Object Storage

Chapter 5. The MapReduce Programming Model and Implementation

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Specialist ICT Learning

Fast, Interactive, Language-Integrated Cluster Computing

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Spark, Shark and Spark Streaming Introduction

Unifying Big Data Workloads in Apache Spark

I am a Data Nerd and so are YOU!

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Distributed Computation Models

Introduction to Big-Data

Big Data Architect.

Challenges for Data Driven Systems

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Certified Big Data and Hadoop Course Curriculum

A Review Approach for Big Data and Hadoop Technology

Twitter data Analytics using Distributed Computing

CISC 7610 Lecture 2b The beginnings of NoSQL

MapReduce programming model

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Programming Systems for Big Data

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Outline. CS-562 Introduction to data analysis using Apache Spark

Microsoft Big Data and Hadoop

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

Transcription:

Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University

Learning Objectives You will learn about big data and cloud computing for analyzing the big data in fast turnaround time. 2

Getting Started: Did you complete these? https://aws.amazon.com/getting-started/ Today: Launch a Linux Virtual Machine Launch a WordPress Website Store and Retrieve a File 3

What is Data Science? Data Science aims to derive knowledge from big data, efficiently and intelligently Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government http://www.oreilly.com/data/free/what-is-datascience.csp 4

Data Science Domain Expertise to define the problem space Mathematics for theoretical structure and problem solving Computer Science to provide the environment where data is manipulated 5

Data Explosion! Every minute: Google receives over 4 million search queries Facebook users share nearly 2.5 million pieces of content. Twitter users tweet nearly 300,000 times. Instagram users post nearly 220,000 new photos. YouTube users upload 72 hours of new video content. Apple users download nearly 50,000 apps. Email users send over 200 million messages. Amazon generates over $80,000 in online sales. 6

What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions. 7

Big Data 3V s 8

Volume (Scale) Data Volume 44x increase from 2009 to 2020 From 0.8 zettabytes to 35zb Data volume is increasing exponentially Exponential increase in collected/ generated data 9

Variety (Complexity) Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), Streaming Data You can only scan the data once A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc) To extract knowledgeè all these types of data need to linked together 10 10

Velocity (Speed) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions è missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like è send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body è any abnormal measurements require immediate reaction 11

Harnessing Big Data OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) http://slideplayer.com/slide/3550756/ 12 12

Big Data Analytics Big data is more real-time in nature than traditional DW applications Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 13

Open Source Big Data Technologies https://sranka.wordpress.com/2014/01/29/big-data-technologies/ 14

Big Data Technology Stacks https://blogs.informatica.com/2017/04/05/big-data-moving-from-technology-to-business-valuedelivery/#fbid=ukwmdsw95gv 15

Cloud Computing IT resources provided as a service Compute, storage, databases, queues Clouds leverage economies of scale of commodity hardware Cheap storage, high bandwidth networks & multicore processors Geographically distributed data centers Offerings from Microsoft, Amazon, Google, 16

Cloud Computing wikipedia: Cloud Computing 17

Benefits Cost & management Economies of scale, out-sourced resource management Reduced Time to deployment Ease of assembly, works out of the box Scaling On demand provisioning, co-locate data and compute Reliability Massive, redundant, shared resources Sustainability Hardware not owned 18

Typical Large-Data Problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output The problem: Diverse input format (data diversity & heterogeneity) Large Scale: Terabytes, Petabytes Parallelization 19

How to leverage a number of cheap off-the-shelf computers? 20

Parallelization Challenges How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? What is the common theme of all of these problems? 21

Common Theme? Parallelization problems arise from: Communication between workers (e.g., to exchange state) Access to shared resources (e.g., data) Thus, we need a synchronization mechanism 22

Apache Hadoop Scalable fault-tolerant distributed system for Big Data: Data Storage Data Processing A virtual Big Data machine Borrowed concepts/ideas from Google; Open source under the Apache license Core Hadoop has two main systems: Hadoop/MapReduce: distributed big data processing infrastructure (abstract/paradigm, fault-tolerant, schedule, execution) HDFS (Hadoop Distributed File System): fault-tolerant, high-bandwidth, high availability distributed storage 23

Hadoop Distributed File System Files split into 128MB blocks Blocks replicated across several datanodes (usually 3) Namenode stores metadata (file names, locations, etc) Optimized for large files, sequential reads Files are append-only 24

Typical Large-Data Problem Map Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Reduce Key idea: provide a functional abstraction for these two operations 25

MapReduce Programmers specify two functions: map (k, v) [(k, v )] reduce (k, [v ]) [(k, v )] All values with the same key are sent to the same reducer The execution framework handles everything else Example: Word Count 26

Word Count Execution 27

Amazon Elastic Map Reduce https://aws.amazon.com/elasticmapreduce/ 28

Apache Spark In-Memory Cluster Computing for Iterative and Interactive Applications Apache Spark is an open-source cluster computing framework for real-time processing. It is of the most successful projects in the Apache Software Foundation. Spark has clearly evolved as the market leader for Big Data processing. 29

Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage E.g., MapReduce: Map Reduce Input Map Output Map Reduce 30

Motivation Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data: Iterative algorithms (many in machine learning) Interactive data mining tools (R, Excel, Python) Spark makes working sets a first-class concept to efficiently support these apps 31

Why Spark when Hadoop is already there? https://acadgild.com/blog/hadoop-vs-spark-best-big-data-frameworks/ 32

Spark Goal Provide distributed memory abstractions for clusters to support apps with working sets Retain the attractive properties of MapReduce: Fault tolerance (for crashes & stragglers) Data locality Scalability Solution: augment data flow model with resilient distributed datasets (RDDs) 33

Spark Speed Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Ease of Use Write applications quickly in Java, Scala, Python, R. 34

Spark Features: Polyglot Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python. The Scala shell can be accessed through./bin/spark-shell and Python shell through./bin/pyspark from the installed directory. 35

Spark Features: Multiple Formats Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra apart from the usual formats such as text files, CSV and RDBMS tables. The Data Source API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark. 36

Spark Features: Real Time Computation Spark s computation is real-time and has low latency because of its in-memory computation. Spark is designed for massive scalability and the Spark team has documented users of the system running production clusters with thousands of nodes and supports several computational models. 37

Spark Features: Hadoop Integration Apache Spark provides smooth compatibility with Hadoop. This is a boon for all the Big Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling. 38

Spark Features: Machine Learning Spark s MLlib is the machine learning component which is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use. 39

Apache Spark on Amazon Install Apache Spark in your Desktop or Cluster https://spark.apache.org/docs/latest/ Spark Amazon EC2: scripts that let you launch a cluster on EC2 https://github.com/amplab/spark-ec2 Amazon EMR https://docs.aws.amazon.com/elasticmapreduce/latest/releaseguide/ emr-spark.html 40