An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

Similar documents
Map Reduce Group Meeting

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Big Data Management and NoSQL Databases

L22: SC Report, Map Reduce

CS 61C: Great Ideas in Computer Architecture. MapReduce

The MapReduce Abstraction

MI-PDB, MIE-PDB: Advanced Database Systems

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Lecture 11 Hadoop & Spark

MapReduce: Simplified Data Processing on Large Clusters

Improving the MapReduce Big Data Processing Framework

Parallel Computing: MapReduce Jin, Hai

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

HADOOP FRAMEWORK FOR BIG DATA

Introduction to MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CS427 Multicore Architecture and Parallel Computing

Introduction to MapReduce (cont.)

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Introduction to Map Reduce

Map-Reduce. John Hughes

Large-Scale GPU programming

Mitigating Data Skew Using Map Reduce Application

Hadoop/MapReduce Computing Paradigm

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Embedded Technosolutions

Parallel Programming Concepts

CS 345A Data Mining. MapReduce

CISC 7610 Lecture 2b The beginnings of NoSQL

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Map-Reduce (PFP Lecture 12) John Hughes

Introduction to Data Management CSE 344

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

1. Introduction to MapReduce

Introduction to Big-Data

Big Data and Object Storage

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

MapReduce. Cloud Computing COMP / ECPE 293A

Introduction to Hadoop and MapReduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Chapter 5. The MapReduce Programming Model and Implementation

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Next-Generation Cloud Platform

Scalable Tools - Part I Introduction to Scalable Tools

Global Journal of Engineering Science and Research Management

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Map Reduce. Yerevan.

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Distributed Systems CS6421

MapReduce: A Programming Model for Large-Scale Distributed Computation

Hadoop An Overview. - Socrates CCDH

Distributed Computation Models

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Introduction to MapReduce

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

The MapReduce Framework

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Batch Processing Basic architecture

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Databases 2 (VU) ( / )

Clustering Lecture 8: MapReduce

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Introduction to Data Management CSE 344

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

A Review Paper on Big data & Hadoop

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

A REVIEW PAPER ON BIG DATA ANALYTICS

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Parallel data processing with MapReduce

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

Clustering Documents. Case Study 2: Document Retrieval

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Programming Models MapReduce

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS MapReduce. Vitaly Shmatikov

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Introduction to MapReduce

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Introduction to MapReduce

Transcription:

An introduction to Big Data Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

About the presenters Andreas Oven Aalsaunet Bachelor s Degree in Computer Engineering from HiOA (2011-2014) Software Developer Associate at Accenture (2014-2015) Master s Degree in Programming and Networks at UiO (2015-2017) Master s Thesis with Simula Research Laboratory: Big Data Analytics for Photovoltaic Power Generation Interests: Big Data, Machine Learning, Computer games, Guitars and adding as many pictures to presentations as I can fit in.

About the presenters Devesh Sharma Bachelor s Degree in Computer Science Engineering. Master s Degree in Programming and Networks at UiO Master s Thesis with Simula Research Laboratory: Big Data with high performance computing

About the presenters Zubair Asghar Bachelor s In Information Technology (2002-2006) Master s Degree in Programming and Networks at UiO Worked with Ericsson for 6 years Currently working as a developer with DHIS2. Master s Thesis with DHIS2 Disease Surveillance Program Passion For Technology

About the presenters Zubair Asghar Bachelor s Degree in IT. Worked in Ericsson for 6 years Master s Degree in Programming and Networks at UiO (2015-2017) Master s Thesis with DHIS2: Disease Surveillance Program Passion for technology

Articles This presentation is based on the following articles: [1] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In: 6th Symposium on Operating Systems Design and Implementation, USENIX Association., 2004. [2] Abadi D, Franklin MJ, Gehrke J, et al. The beckman report on database research. Commun ACM. 2016;59(2):92-99. doi: 10.1145/2845915.

What is Big Data, and where did it come from? Some definitions: Wikipedia: data sets so large or complex that traditional data processing applications are inadequate McKinsey: datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze Oxford English Dictionary: data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.

What is Big Data, and where did it come from? Arose due to the confluence of three major trends: 1. Cheaper to generate lots of data due to increases in: a. Inexpensive storage (e.g increase in storage capacity in commodity computers) b. Smart Devices (e.g Smart Phones) c. Social software (e.g Facebook, Twitter, Multiplayer games) d. Embedded software, sensors and Internet of Things (IoT)

What is Big Data, and where did it come from? Arose due to the confluence of three major trends: 1. 2. 3. Check! Cheaper to process large amounts of data due to advances in: a. Multi Core CPUs b. Solid state storage (e.g SSD) c. Inexpensive cloud computing (e.g SaaS, PaaS and IaaS) d. Open Source Software Democratization of data: a. The process of generating, processing, and consuming data is no longer just for database professionals, [1]. e.g more people have access to data through dashboards etc

What is Big Data, and where did it come from? Due to these trends, an unprecedented volume of data needs to be captured, stored, queried, processed, and turned into [actionable] knowledge [1] Big Data thus produces many challenges in the whole Data Analysis Pipeline Most of them related to the four V s: Volume - Handling the storage and organization of massive data amounts Velocity - Capturing data and storing it at live time Variety - Handling unstructured data, with a variety of sources, formats, encoding, context etc Value - Turning that data into actionable knowledge (e.g making the data useful )

Big Data comes with Big Challenges Big Data also brings enormous challenges, whose solutions will require massive disruptions to the design, implementation, and deployment of data management solutions. [1] The Beckman Report identifies a few overarching challenges with Big data: Scalable big/fast data infrastructures Coping with diversity in data management End-to-end processing of data The roles of people in the data life cycle

Scalable big/fast data infrastructures Parallel and distributed processing: Seen success in scaling up data on large number of unreliable, commodity machine using constrained programming models such as MapReduce (more on that later!). The Open Source platform Hadoop, with its MapReduce programming model, soon followed. Query processing and optimization: To fully utilize large clusters of many-core processors more powerful, cost-aware and set-oriented query optimizers are needed. Query processors will need to integrate data sampling, data mining, and machine learning into their flows

Scalable big/fast data infrastructures New hardware: In addition to clusters of general-purpose multicore processors, more specialized processors should be considered. Researchers should continue to explore ways of leveraging specialized processors for processing very large data sets. Examples of this are: Graphics processing units (GPUs) Field-programmable gate arrays Application-specific integrated circuits (ASICs) Adapt parallel and distributed query-processing algorithms to focus more on heterogeneous hardware environments

Scalable big/fast data infrastructures High-speed data streams: High-speed data sources with low information density will need to be processed online, so only the relevant parts are stored. Late-bound schemas: For data that is only used once or never, it makes little sense to pay the substantial price of storing and indexing it in the database. Requires query engines that can run efficiently over raw files with late-bound schemas. Cost-efficient storage SSDs are expensive per gigabyte but cheap per I/O operation.

Scalable big/fast data infrastructures Metrics and benchmarks: Scalability should not only be measured in petabytes of data and queries per second, but also: Total cost of ownership (e.g management and energy costs) End-to-end processing speed (time from raw data to eventual insights/value/actionable knowledge) Brittleness (e.g. the ability to continue despite failures or exceptions, such as parsing errors) Usability (e.g. for entry level users).

Coping with diversity in data management NO one-size-fits-all: Variety of Data is managed by different softwares (programming interfaces, query processors & analysing tools ). Multiple classes of systems are expected to emerge each addressing a particular need like data deduplication, online stream processing. Cross-platform integration: Due to diversity in systems, Platforms are needed to be integrated for combining and analysing data. By doing this it will: Hide the heterogeneity of data formats and access languages. Optimize the performance of data access from and data flow in, big data systems.

Coping with diversity in data management (Cont.) Programming model: More Big data analysis languages are needed in addition to : Julia, SAS, SQL, Hive, Pig, R, Python, a Domain specific language, MapReduce, etc Data processing workflows: To handle data diversity, Platforms are needed that can span (bridge) both raw and cooked data. Cooked data, eg: tables, matrices, or graphs. Raw data : unprocessed data. Big data systems should become easily operable. Eg: Cluster resource managers, such as Hadoop 2.0 s YARN.

End-to-end processing of data Few tools can go from raw data all the way to extracted knowledge without significant human intervention at each step. Data-to-knowledge pipeline: Data tools should exploit human feedback in every step of analytical pipeline, and must be usable by subject-matter experts, not just by IT professionals. Tool diversity: Multiple tools are needed for solving a step of the raw-data-to-knowledge pipeline and these must be seamlessly integrated and easy to use for both lay and expert users.

End-to-end processing of data (cont.) Open source: Few tools are open source, Most other tools are expensive. Therefore, existing tools cannot easily benefit from ongoing contributions made by data integration research community. Knowledge bases (KBs): Domain knowledge, which is created, shared and used to understand data, such knowledge is captured in KBs. KBs are used for improving the accuracy of the raw-data-to-knowledge pipeline. Knowledge centers provide the tools to query, share and use KBs for data analysis. Goal: More work is needed on tools for building, maintaining, querying, and sharing domain-specific KBs.

Roles of people in the data life cycle An individual can play multiple roles in the data life cycle. People s roles can be classified into four general categories: producers, curators, consumers and community members. Data producers: Anyone can generate data from mobiles, social platforms, applications, wearable devices. Key challenge for database community is: To develop algorithms that guide people to produce and share most useful data, while maintaining the desired level of data privacy. Data curators: Data is not just controlled by DBA or by an IT department, but variety of people are empowered to curate data. Approach: crowdsourcing. Challenge is to obtain high-quality datasets from a process based on imperfect human curators. Solution can be by building platforms and extending relevant applications to incorporate such curation.

Roles of people in the data life cycle (cont.) Data consumers: People want to use messier data in complex ways. Common man may not know how to formulate a query. Therefore require new query interfaces. We need multimodal interfaces that combine visualization, querying, and navigation. Online communities: (people interact virtually) People want to create, share and manage data with other community members. Challenge: Build tools that help communities to produce usable data as well as to share and mine it.

Community Challenges Database education: Old technology taught today: memory was small relative to database size,servers used single core processors. Knowledge of key-value stores, data stream processors, MapReduce frameworks should also be given with SQL & DBMSs. Data science: Data scientists can transform large volumes of data into actionable knowledge. Data scientists need skills not only in data management, but also in business intelligence, computer systems, mathematics, statistics, machine learning, and optimization.

Introduction to MapReduce A bit of history first: MapReduce was created at Google in the early 2000 s to simplify the processes of creating application that utilized large clusters of commodity computer. It simplifies this by handling the error-prone tasks of parallelization, fault-tolerance, data distribution and load balancing in a library. A typical MapReduce computation processes many terabytes of data on thousands of machines [2] - Dean J., Sanjay G. 2004.

Introduction to MapReduce What exactly is MapReduce then? In short terms, it is: A programming model with a restrictive functional style No mutability simplifies parallelization a great deal as mutable objects can more easily lead to nasty stuff like starvation, race-condition, deadlocks etc. The MapReduce run-time system takes care of: The details of partitioning the input data Scheduling the program s execution across a set of machines Handling machine failures by re-execution Managing the required inter-machine communication. The user specifies a map- and a reduce function (hence MapReduce ), in addition to optional functions (like a combiner) and number of partitions.

Introduction to MapReduce More on Mapping and Reducing The Map- and Reduce functions operate on sets of key/value pairs. The Mapper takes the input and generates a set of intermediate key/value pairs The Reducer merges all intermediate values associated with the same intermediate key.

Introduction to MapReduce Let's take a simple example: Let s say we have a text and we want to count the occurrence of each individual word. The original input (1), Hello class and hello world, is first split by line (2). Each split is then sent as a single key/value pair to separate mappers, on separate nodes, where it is mapped to several key/value pairs. In our simple example the key is the word itself, and the value is set to 1. We shuffle and sort (4) the keys/values pairs and send them to the reducers. Our reducers (5) is sent different maps and merges the values that shares a key. The different partitions are merges to an output file (6). 3. Map at node 1 Key Value hello 1 class 1 4. Shuffle/Sort 5. Reduce and 1. Input 2. Split class Hello class and hello world Hello class and hello world and 1 1 6. Output Key Value and 1 class 1 hello 2 world 1 1 3. Map at node 2 Key Value hello 1 world 1 hello world 2 1

Implementation Different implementations are possible, it depends on the environment. This environment has large clusters of commodity PCs connected together with switched Ethernet. (wide use at Google) Execution overview: User specifies the input/output files, number of Reduce tasks(r), Program code for map() and reduce() and submit the job to scheduling system. One machines becomes the master and other workers. Input files are automatically splitted into M pieces. (16-64MB) Master assigns map-task to idle workers. Worker read the input split and perform the map operation. Worker stores the output on local disk which is partitioned into R regions by partitioning function and informs about the location to the master. Master check for ideal reduce worker and gives the map output location. Reduce worker remotely reads the file and performs the reduce task and stores the output on GFS. When the whole process is completed and the output is generated, the master wakes up the user program.

Fault Tolerance: MapReduce library is designed in a way that it handles machine failures gracefully. Worker failure: Master pings every worker periodically > if ( noreply) > mark worker fail. Any completed/in-progress map task (because output on local disk) or in-progress reduce task > Reset > scheduled to other worker. Completed reduce tasks > not re-executed > Because output stored on global file system(gfs). if any map task executing worker fails, All reduce task executing workers are informed. Master failure: Master write periodic checkpoints of master data structure. if master task dies > new copy can be started from last checkpoint. if master fails (rarely happens) > MapReduce computation fails. Locality: For saving network bandwidth, GFS divides input data into 64 MB blocks and stores several copies of each block (typically 3) on different machines.

Backup Tasks mechanism: Straggler: In a MapReduce operation, it is a machine that takes an unusually long time to complete one of the last few map or reduce tasks. Why stragglers arise?: Many reasons, eg. Machine with bad disk (slow read), Moreover other scheduled tasks waiting in queue. How to reduce this problem: To save time, when MapReduce is near to completion, Backup task mechanism backup the remaining in-progress tasks. And the task is marked as completed whenever either the primary or backup execution is complete.

Refinements These are few useful extensions, which can be used together with the Map & Reduce functions. Partitioning Function: A partitioner function partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a user-defined value, which works like a hash function. The total number of partitions is same as the number of Reducer tasks for the job. Default partitioning function >>> hash(key) mod R Combiner function: Used for partial merging of data. Combiner function is executed on each machine that performs map task. Same code is used to implement both combiner and reduce functions. Difference is: How MapReduce library handles output of these functions. Combiner function output > Intermediate file. Reduce function output > Final output file.

Input and Output types: MapReduce library provides support for reading input data and writing output data in several different formats. User can add support for new input type by implementing Reader interface. Skipping bad records: If there are Bugs in user code or in a third party library that causes map and reduce functions to crash on certain records, the records are skipped. Because sometimes, it is not feasible to fix bugs.

Performance Factors affecting performace Input/Output (Harddisk) Computation Temporary Storage (Optional) Bandwidth

Performance Traditional Approach Vs Big Data Approach Data is fetched Code is fetched Data is transferred Code/Result is transferred ( That too on local Disk ) Data Data / Processing Data Fetched Code Fetched Master Node Cluster Data Data Fetched Data / Processing Code Fetched

Performance Performance Analysis based on Experimental Computation Running MapReduce for GREP and SORT.

Performance Effect of Backup tasks and Machine Failure.

Performance Improved Performance cases Google search engine indexes. Smaller Code Separating unrelated concepts Self Healing of failures

MapReduce Conclusions Easy to use. Simplifies large-scale computations that fit this model. Allows user to focus on the problem without worrying about details. Commodity Hardware Any questions?