Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University
Learning Objectives You will learn about big data and cloud computing for analyzing the big data in fast turnaround time. 2
Getting Started: Did you complete these? https://aws.amazon.com/getting-started/ Today: Launch a Linux Virtual Machine Launch a WordPress Website Store and Retrieve a File 3
What is Data Science? Data Science aims to derive knowledge from big data, efficiently and intelligently Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government http://www.oreilly.com/data/free/what-is-datascience.csp 4
Data Science Domain Expertise to define the problem space Mathematics for theoretical structure and problem solving Computer Science to provide the environment where data is manipulated 5
Data Explosion! Every minute: Google receives over 4 million search queries Facebook users share nearly 2.5 million pieces of content. Twitter users tweet nearly 300,000 times. Instagram users post nearly 220,000 new photos. YouTube users upload 72 hours of new video content. Apple users download nearly 50,000 apps. Email users send over 200 million messages. Amazon generates over $80,000 in online sales. 6
What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions. 7
Big Data 3V s 8
Volume (Scale) Data Volume 44x increase from 2009 to 2020 From 0.8 zettabytes to 35zb Data volume is increasing exponentially Exponential increase in collected/ generated data 9
Variety (Complexity) Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), Streaming Data You can only scan the data once A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc) To extract knowledgeè all these types of data need to linked together 10 10
Velocity (Speed) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions è missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like è send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body è any abnormal measurements require immediate reaction 11
Harnessing Big Data OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) http://slideplayer.com/slide/3550756/ 12 12
Big Data Analytics Big data is more real-time in nature than traditional DW applications Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 13
Open Source Big Data Technologies https://sranka.wordpress.com/2014/01/29/big-data-technologies/ 14
Big Data Technology Stacks https://blogs.informatica.com/2017/04/05/big-data-moving-from-technology-to-business-valuedelivery/#fbid=ukwmdsw95gv 15
Cloud Computing IT resources provided as a service Compute, storage, databases, queues Clouds leverage economies of scale of commodity hardware Cheap storage, high bandwidth networks & multicore processors Geographically distributed data centers Offerings from Microsoft, Amazon, Google, 16
Cloud Computing wikipedia: Cloud Computing 17
Benefits Cost & management Economies of scale, out-sourced resource management Reduced Time to deployment Ease of assembly, works out of the box Scaling On demand provisioning, co-locate data and compute Reliability Massive, redundant, shared resources Sustainability Hardware not owned 18
Typical Large-Data Problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output The problem: Diverse input format (data diversity & heterogeneity) Large Scale: Terabytes, Petabytes Parallelization 19
How to leverage a number of cheap off-the-shelf computers? 20
Parallelization Challenges How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? What is the common theme of all of these problems? 21
Common Theme? Parallelization problems arise from: Communication between workers (e.g., to exchange state) Access to shared resources (e.g., data) Thus, we need a synchronization mechanism 22
Apache Hadoop Scalable fault-tolerant distributed system for Big Data: Data Storage Data Processing A virtual Big Data machine Borrowed concepts/ideas from Google; Open source under the Apache license Core Hadoop has two main systems: Hadoop/MapReduce: distributed big data processing infrastructure (abstract/paradigm, fault-tolerant, schedule, execution) HDFS (Hadoop Distributed File System): fault-tolerant, high-bandwidth, high availability distributed storage 23
Hadoop Distributed File System Files split into 128MB blocks Blocks replicated across several datanodes (usually 3) Namenode stores metadata (file names, locations, etc) Optimized for large files, sequential reads Files are append-only 24
Typical Large-Data Problem Map Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Reduce Key idea: provide a functional abstraction for these two operations 25
MapReduce Programmers specify two functions: map (k, v) [(k, v )] reduce (k, [v ]) [(k, v )] All values with the same key are sent to the same reducer The execution framework handles everything else Example: Word Count 26
Word Count Execution 27
Amazon Elastic Map Reduce https://aws.amazon.com/elasticmapreduce/ 28
Apache Spark In-Memory Cluster Computing for Iterative and Interactive Applications Apache Spark is an open-source cluster computing framework for real-time processing. It is of the most successful projects in the Apache Software Foundation. Spark has clearly evolved as the market leader for Big Data processing. 29
Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage E.g., MapReduce: Map Reduce Input Map Output Map Reduce 30
Motivation Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data: Iterative algorithms (many in machine learning) Interactive data mining tools (R, Excel, Python) Spark makes working sets a first-class concept to efficiently support these apps 31
Why Spark when Hadoop is already there? https://acadgild.com/blog/hadoop-vs-spark-best-big-data-frameworks/ 32
Spark Goal Provide distributed memory abstractions for clusters to support apps with working sets Retain the attractive properties of MapReduce: Fault tolerance (for crashes & stragglers) Data locality Scalability Solution: augment data flow model with resilient distributed datasets (RDDs) 33
Spark Speed Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Ease of Use Write applications quickly in Java, Scala, Python, R. 34
Spark Features: Polyglot Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python. The Scala shell can be accessed through./bin/spark-shell and Python shell through./bin/pyspark from the installed directory. 35
Spark Features: Multiple Formats Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra apart from the usual formats such as text files, CSV and RDBMS tables. The Data Source API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark. 36
Spark Features: Real Time Computation Spark s computation is real-time and has low latency because of its in-memory computation. Spark is designed for massive scalability and the Spark team has documented users of the system running production clusters with thousands of nodes and supports several computational models. 37
Spark Features: Hadoop Integration Apache Spark provides smooth compatibility with Hadoop. This is a boon for all the Big Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling. 38
Spark Features: Machine Learning Spark s MLlib is the machine learning component which is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use. 39
Apache Spark on Amazon Install Apache Spark in your Desktop or Cluster https://spark.apache.org/docs/latest/ Spark Amazon EC2: scripts that let you launch a cluster on EC2 https://github.com/amplab/spark-ec2 Amazon EMR https://docs.aws.amazon.com/elasticmapreduce/latest/releaseguide/ emr-spark.html 40