Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Similar documents
Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Data contains value and knowledge

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Big Data Architect.

CISC 7610 Lecture 2b The beginnings of NoSQL

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Lambda Architecture for Batch and Stream Processing. October 2018

Databases 2 (VU) ( / )

VOLTDB + HP VERTICA. page

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Flash Storage Complementing a Data Lake for Real-Time Insight

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Data Analytics at Logitech Snowflake + Tableau = #Winning

Certified Big Data Hadoop and Spark Scala Course Curriculum

DATA SCIENCE USING SPARK: AN INTRODUCTION

Technical Sheet NITRODB Time-Series Database

Understanding NoSQL Database Implementations

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Big Data and Object Storage

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Certified Big Data and Hadoop Course Curriculum

Data Lake Based Systems that Work

Big Data Analytics. Rasoul Karimi

Hadoop. Introduction / Overview

Big Data with Hadoop Ecosystem

Data Storage Infrastructure at Facebook

Innovatus Technologies

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Chase Wu New Jersey Institute of Technology

Data Acquisition. The reference Big Data stack

Apache Hive for Oracle DBAs. Luís Marques

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Introduction to NoSQL Databases

Lecture 11 Hadoop & Spark

An Introduction to Big Data Formats

Distributed File Systems II

Microsoft Big Data and Hadoop

A Tutorial on Apache Spark

An InterSystems Guide to the Data Galaxy. Benjamin De Boe Product Manager

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Big Data Hadoop Course Content

Hadoop Development Introduction

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10. Onur Kahraman

Current Schema Version Is 30 Not Supported Dfs

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

BIG DATA TESTING: A UNIFIED VIEW

Shark: Hive (SQL) on Spark

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

microsoft

Aaron Sun, in collaboration with Taehoon Kang, William Greene, Ben Speakmon and Chris Mills

CloudExpo November 2017 Tomer Levi

Implementing the speed layer in a lambda architecture

Data Analytics with HPC. Data Streaming

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

Stages of Data Processing

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Big Data Analytics using Apache Hadoop and Spark with Scala

Hacking PostgreSQL Internals to Solve Data Access Problems

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Hadoop An Overview. - Socrates CCDH

WHITEPAPER. MemSQL Enterprise Feature List

CSE 190D Spring 2017 Final Exam Answers

Hadoop/MapReduce Computing Paradigm

Challenges for Data Driven Systems

Hadoop, Yarn and Beyond

Data Preprocessing. Slides by: Shree Jaswal

CockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS. Mark Brooks - Principal System Kinetica May 09, 2017

Data Acquisition. The reference Big Data stack

Map-Reduce. Marco Mura 2010 March, 31th

Big Data Hadoop Stack

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

An Information Asset Hub. How to Effectively Share Your Data

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Introduction to Big-Data

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Apache Spark and Scala Certification Training

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

Basics of Data Management

Transcription:

Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Definition SQL DDL, Avro, Protobuf, CSV Storage Systems HDFS, RDBMS, Column Stores, Graph Databases Data Serving BI, Cubes, RDBMS, Keyvalue Stores, Tableau, Computer Platforms Distributed Commodity, Clustered High-Performance, Single Node

Analytics solutions start with data ingestion Data integration challenges: volume (many similar integrations) variety (many different integrations) velocity (batch v.s real-time) (or all of the above)

Maslow s hierarchy of needs* Hierarchy of effective analytics Predictive needs Prediction, Clustering, Classification Understanding needs Real-time, streaming Visualization, Query, OLAP Basic needs Aggregation, Join, Filtering, Indexing Data Quality, Structure, Data Ingest Data, Persistence, Architecture, ETL * A theory in psychology proposed by Abraham Maslow in 1943. Needs lower down in the hierarchy must be satisfied before individuals can attend to needs higher up.

Business Systems Web Logs Email Logs Data Management Systems Transaction Logs Data Storage Systems

Business Systems Web Logs Data Management Systems Email Logs Transaction Logs Data Storage Systems Different structures Different manifest data types Different literal for same data type Different Keys

Consumer 1 Business Systems Web Logs Email Logs Data Management Systems Transaction Logs Data Storage Systems Consumer 2 Failures when producers push data Failures storing received data Failures while transferring data over network

Full dumps All data is provided at once and replaces all previous data. Incremental Increments are provided in any order, data interval in an increment is provided as metadata. On an hourly, daily or weekly cadence. Append Always appended at the end of the dataset. Order is assumed to be correct. Stream Stream of incoming data as individual rows or in small batches. On a second, minute or hourly cadence.

Prepare data before loading so that target system can spend cycles on reporting, query, etc. Requires transforms to know what reporting, query to enable

Made possible by more powerful target systems Provides more flexibility at later stages than ETL

Kafka Kinesis S4 Storm Samza Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. Amazon Kinesis is a fully managed, cloud-based service for real-time data processing over large, distributed data streams. Amazon Kinesis can continuously capture and store terabytes of data per hour from hundreds of thousands of sources. S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Apache Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

Scalability Schema Variety Network Bottlenecks Consumer Bottlenecks Bursts Reliability, Fault Tolerance Allows many producers and consumers. Partitions are the unit of scale. Does not really solve this. Can handle some variability without losing messages. Kafka acts as a buffer allowing producers and consumers to work at different speeds. Handles buffering of messages between producers and consumers. Allows reading of messages if a consumer fails.

Data is produced in Bursts Consumers can only consume at a certain rate Dropping data can be a problem

Misspellings {Montgromery street} Outliers {1,2,4,2,4,123,3,4} Incorrect values {invalid zip, neg number} Missing values {6,7,4,3,,4,5,6} Incorrect values {94025, -345,96066, }

It depends! Entry Quality: Is the record entered correctly? Process Quality: Was the record correct throughout the system? Identity Quality: Are similar entities resolved to be the same? Integration Quality: Is the data sufficiently integrated to answer relevant questions? Usage Quality: Is the data being used correctly? Age Quality: Is the data new-enough to be trusted?

Things we check in the organization. Fixes may be non-technical. Things we check in architecture. Fixes can be costly! Things we check across many data sets. Fixes may need extra intelligence. Things we check in single record sets and data streams. Fixes can be automatic and independent.

Observation It s too expensive to clean all the data every way How do we decide what to clean? We need a framework that helps to: Determine what issues might occur in the data Weight the criticality of the issues Profile the data to score quality The framework allows: to approach quality as an ever-increasing standard To prioritize data cleaning activities

Definitional Incorrect Too-Soon-to-Tell Constants Definition Mismatches Filler Containing Data Inconsistent Cases Inconsistent Data Types Inconsistent Null Rules Invalid Keys Invalid Values Miscellaneous Missing Values Orphans Out of Range Pattern Exceptions Potential Constants Potential Defaults Potential Duplicates Potential Invalids Potential Redundant Values Potential Unused Fields Rule Exceptions Unused Fields

Weight Factor Issue Type 1 Constants 2 Definition Mismatches 2 Filler Containing Data 1 Inconsistent Cases Issues should be weighted in the context of our system Invalid or missing data is most problematic Potentially issues can be weighted lower 2 Inconsistent Data Types 3 Inconsistent Null Rules 5 Invalid Keys 5 Invalid Values

Weight Issue Type Disco vered 1 Constants 1 59 2 Definition Mismatches 4 59 2 Filler Containing Data 1 59 1 Inconsistent Cases 3 59 2 Inconsistent Data Types 15 59 3 Inconsistent Null Rules 6 59 5 Invalid Keys 1 3 5 Invalid Values 5 59 Possible Any issue has a possible maximum This can be cardinality of Rows Keys etc. Quality Score Raw Score 91.3% Weighted Score 90.4%

Nuclear Option: Listwise Removal If a record has a missing value Ignore the entire record in analyses Softer Option: Pairwise Removal If a record has a missing value Ignore the record iff the missing value is in the analyzed set Infer Substitute Values Statistical (Interpolation): predict new data points in the range of a set (Simple linear, Splines, Kriging) Model-based: Impute missing values based on a model (EM algorithm, etc.)

What is it? The task of finding records in a data set that refer to the same entity across different data sources Problems Attributes/records not matching 100% How to apply to different/ changing data sets Missing values Data quality, errors etc. Semantic relations

Normalize data Link records on a table Link records cross tables

Deterministic rules + record level match Fuzzy Match similar values and records and attempt to determine if match, no match or possible match

Deterministic: rules + attribute level match if a.ssn == b.ssn => match if (missing(a.ssn) or missing(b.ssn)) => if ((a.dob == b.dob) and (a.zip == b.zip) and (a.sex == b.sex)) => match

Fuzzy: similarity of attribute + distance of record + match thresholds + learning SSN Name DOB Place of Birth match, non-match, possible match 605782877 Nilsson 1/14/76 Sainte-Paul <missing> Smith 8/8/68 Barstow 762009827 Smyth <m>/8/68 Barstow 720384793 Carlson 8/8/68 Barstow if (a.ssn == b.ssn) true else Name a <missing> Nilson 1/14/76 Minnesota <missing> Nilsson 1/13/76 Saint-Paul DOB b POB

Recall vs Precision Fuzzy + rules => practice Generate rules by using machine learning

Levenshtein Distance (Edit Distance) Jaro Distance Jaro-Winkler Distance

Minimum number of single character edits: Insert Delete Substitute Stokcholm -> Stockholm sanfransicso-> san francisco sanfransicso-> San Francisco {Distance = 2} {Distance = 4} {Distance = 6}

distance

Hamming Distance Damerau-Levenshtein Distance Longest Common Substring Distance Cosine Distance Jaccard Distance Jaro Distance N-grams

Assume you want to link records across sources that do not share a common record key These records may have data quality issues How would you create such linkage using fuzzy and deterministic methods?

Yes Join on that attribute Clean each single source Common identifier attribute? No Select set of comparable attributes Calculate similarity, identify similar, mark Join on mark attribute

Defining similarity is application specific Assume you have no common, unique key What can you do to identify similar record in different (or the same) source Example: Cosine similarity for numeric /continuous attributes Edit distance for strings Jaccard for binary or categorical values

a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 b6 t1.1 t5.100 We map columns: a2<->b5 a3<->b3 a5<->b6 Compare record 1 from table 1, with record 100 from table 5. Are they representing the same entity? Are the attributes continuous, categorical, string, etc.?

For continuous attributes we can calculate cosine distance

String: calculate an {edit} distance. Categorical: bit vectors (0/1) for different/same

Score = cosine distance of continuous attributes + normalized score of string similarities + weighted score of categorical values score = w 1 *(cd(cont(r 1,r 5 )))+w 2 *(ed(r 1.a n,ry.by))+w 3 *(compare(r 1.a k,r 5.a l )) Cosine distance Ps. this just an illustration, how to model similarity is application specific. You define score ranges for same, possibly same, not same using thresholds

Entity matching can be very computationally intensive. How to scale it? Understand basic scaling techniques Understand the techniques pros and cons

Size N Size M Polynomial complexity N*M record comparisons Expensive comparison operations May require multiple passes over the data

Precision and Recall Efficiency Processing Time MB of processed data/ cost of infrastructure

Efficiency Improvements Reduce # of Comparisons Reduce comparison complexity Blocking Canopy Sliding window Clustering Filtering: Decision tree Feature Selection

Sliding Window Create key from relevant data Sort on key Compare records in sliding window Pros: Fewer comparison Cons: Sensitivity to key selection, missed matches

Blocking Define a blocking key Create blocks Compare within blocks Pros: fewer comparisons Cons: missed matches

Clustering Similarity is transitive a->b->c Union-find algorithm Pros: Fewer comparison Cons: Potentially complex computation

Canopy Clustering Records can belong to multiple canopies Cluster into canopies using cheap algorithm Pros: Fewer comparison Cons: Sensitivity to key selection, missed matches 1. Begin with the set of data points to be clustered. 2. Remove a point from the set, beginning a new 'canopy' 3. For each point left in the set, assign it to the new canopy if the distance less than the loose distance 4. If the distance of the point is additionally less than the tight distance, remove it from the original set 5. Repeat from step 2 until there are no more data points in the set to cluster 6. Clustered canopies are sub-clustered using an expensive but accurate algorithm

Decision Tree

Speed up Scale-up Scale-out Algorithm Bigger Servers Hadoop