Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Similar documents
Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Data contains value and knowledge

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Big Data Architect.

VOLTDB + HP VERTICA. page

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Lambda Architecture for Batch and Stream Processing. October 2018

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Flash Storage Complementing a Data Lake for Real-Time Insight

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

CISC 7610 Lecture 2b The beginnings of NoSQL

Certified Big Data Hadoop and Spark Scala Course Curriculum

Databases 2 (VU) ( / )

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Data Analytics at Logitech Snowflake + Tableau = #Winning

Certified Big Data and Hadoop Course Curriculum

DATA SCIENCE USING SPARK: AN INTRODUCTION

Data Lake Based Systems that Work

Technical Sheet NITRODB Time-Series Database

Understanding NoSQL Database Implementations

A Tutorial on Apache Spark

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Innovatus Technologies

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Data Acquisition. The reference Big Data stack

Hadoop. Introduction / Overview

Apache Hive for Oracle DBAs. Luís Marques

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Introduction to NoSQL Databases

Lecture 11 Hadoop & Spark

Big Data with Hadoop Ecosystem

Data Storage Infrastructure at Facebook

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Big Data and Object Storage

Hadoop Development Introduction

Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10. Onur Kahraman

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Chase Wu New Jersey Institute of Technology

Shark: Hive (SQL) on Spark

Implementing the speed layer in a lambda architecture

Data Analytics with HPC. Data Streaming

Distributed File Systems II

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

Microsoft Big Data and Hadoop

Stages of Data Processing

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

Big Data Analytics. Rasoul Karimi

Hacking PostgreSQL Internals to Solve Data Access Problems

Big Data Hadoop Course Content

BIG DATA TESTING: A UNIFIED VIEW

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

An Introduction to Big Data Formats

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

microsoft

Data Acquisition. The reference Big Data stack

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS. Mark Brooks - Principal System Kinetica May 09, 2017

Big Data Hadoop Stack

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

An Information Asset Hub. How to Effectively Share Your Data

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Aaron Sun, in collaboration with Taehoon Kang, William Greene, Ben Speakmon and Chris Mills

CloudExpo November 2017 Tomer Levi

An InterSystems Guide to the Data Galaxy. Benjamin De Boe Product Manager

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Big Data Analytics using Apache Hadoop and Spark with Scala

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Introduction to Big-Data

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Basics of Data Management

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Hadoop An Overview. - Socrates CCDH

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

CSE 190D Spring 2017 Final Exam Answers

Massive Online Analysis - Storm,Spark

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

WHITEPAPER. MemSQL Enterprise Feature List

New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Apache Kylin. OLAP on Hadoop

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

CSE 190D Spring 2017 Final Exam

Challenges for Data Driven Systems

Hadoop/MapReduce Computing Paradigm

Rule 14 Use Databases Appropriately

Data Preprocessing. Slides by: Shree Jaswal

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Transcription:

Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Definition SQL DDL, Avro, Protobuf, CSV Storage Systems HDFS, RDBMS, Column Stores, Graph Databases Data Serving BI, Cubes, RDBMS, Keyvalue Stores, Tableau, Computer Platforms Distributed Commodity, Clustered High-Performance, Single Node

Analytics solutions start with data ingestion The problem can be that of volume (many similar integrations), variety (many different integrations) or velocity (batch v.s real-time) or all of the above.

Maslow s Pyramid of needs Pyramid of effective Analytics Predictive needs Prediction, Clustering, Classification Understanding needs Real-time, streaming Visualization, Query, OLAP Basic needs Aggregation, Join, Filtering, Indexing Data Quality, Structure, Data Ingest Data, Persistence, Architecture, ETL

Business Systems Web Logs Email Logs Data Management Systems Transaction Logs Data Storage Systems

Business Systems Web Logs Data Management Systems Email Logs Transaction Logs Data Storage Systems Different structures Different manifest data types Different literal for same data type Different Keys

Consumer 1 Business Systems Web Logs Email Logs Transaction Logs Data Management Systems Failure when producers push data Failure storing received data Network failures while transferring Data Storage Systems Consumer 2

Full dumps All data is provided at once and replaces all previous data. Incremental Increments are provided in any order, data interval in an increment is provided as metadata. On an hourly, daily or weekly cadence. Append Always appended at the end of the dataset. Order is assumed to be correct. Change stream Stream of incoming data as individual rows or in small batches. On a sb second, minute or hourly cadence.

Prepare data before loading so that target system can spend cycles on reporting, query etc. Requires transforms to know what reporting, query to enable.

Made possible by more powerful target systems. Provides more flexibility at later stages than ETL.

Kafka Kinesis S4 Storm Samza Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. Amazon Kinesis is a fully managed, cloud-based service for real-time data processing over large, distributed data streams. Amazon Kinesis can continuously capture and store terabytes of data per hour from hundreds of thousands of sources. S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Apache Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

Scalability Schema Variety Network Bottlenecks Consumer Bottlenecks Bursts Reliability, Fault Tolerance Allows many producers and consumers. Partitions are the unit of scale. Does not really solve this. Can handle some variability without loosing messages. Kafka acts as a buffer allowing producers and consumers to work at different speeds. Handles buffering of messages between producers and consumers. Allows reading of messages if a consumer fails.

Data is produced in Bursts Consumers can only consume at a certain rate Dropping data can be a problem

Misspellings {Montgromery street} Outliers {1,2,4,2,4,123,3,4} Incorrect values {invalid zip, neg number} Missing values {6,7,4,3,,4,5,6} Incorrect values {94025, -345,96066, }

It depends! Entry Quality: Is the record entered correctly? Process Quality: Was the record correct throughout the system? Identity Quality: Are similar entities resolved to be the same? Integration Quality: Is the data sufficiently integrated to answer relevant questions? Usage Quality: Is the data being used correctly? Age Quality: Is the data new-enough to be trusted?

Things we check in the organization. Fixes may be non-technical. Things we check in architecture. Fixes can be costly! Things we check across many data sets. Fixes may need extra intelligence. Things we check in single record sets and data streams. Fixes can be automatic and independent.

It s too expensive to clean all the data every way How do we decide what to clean? We need a framework Determine what issues might occur in the data Weight the criticality of the issues Profile the data to score quality Allows quality to be an ever-increasing standard Allows us to prioritize data cleaning activities

Definitional Incorrect Too-Soon-to-Tell Constants Definition Mismatches Filler Containing Data Inconsistent Cases Inconsistent Data Types Inconsistent Null Rules Invalid Keys Invalid Values Miscellaneous Missing Values Orphans Out of Range Pattern Exceptions Potential Constants Potential Defaults Potential Duplicates Potential Invalids Potential Redundant Values Potential Unused Fields Rule Exceptions Unused Fields

Weight Factor Issue Type 1 Constants 2 Definition Mismatches 2 Filler Containing Data 1 Inconsistent Cases Issues should be weighted in the context of our system Invalid or missing data is most problematic Potentially issues can be weighted lower 2 Inconsistent Data Types 3 Inconsistent Null Rules 5 Invalid Keys 5 Invalid Values

Weight Issue Type Disco vered 1 Constants 1 59 2 Definition Mismatches 4 59 2 Filler Containing Data 1 59 1 Inconsistent Cases 3 59 2 Inconsistent Data Types 15 59 3 Inconsistent Null Rules 6 59 Possible Any issue has a possible maximum This can be cardinality of Rows Keys Constants etc. 5 Invalid Keys 1 3 5 Invalid Values 1 59

Raw Score 89.7% Weighted Score 76.2% Priority 5 Priority 4 Priority 3 Priority 2 Priority 1 Total 8 35 30 21 16 Average for Each Priority 35.03% 31.19% 10.17% 8.90% 9.04% Weighted Average 175.1% 124.7% 30.5% 17.8% 9.0%

Nuclear Option: Listwise Removal Softer Option: Pairwise Removal Understand Data and Infer Substitute Values How to infer: Statistical (Interpolation): predict new data points in the range of a discrete set (Simple linear, Splines, Kriging) Model-based: Impute missing values based on a model (EM algorithm, etc.)

What is it? The task of finding records in a data set that refer to the same entity across different data sources Problems Attributes/records not matching 100% How apply to different/changing data sets Missing values Data quality, errors etc. Semantic relations

Normalize data. Link records on a table. Link records cross tables.

Deterministic rules + record level match Fuzzy Match similar values and records and attempt to determine if match, no match or possible match

Deterministic: rules + attribute level match if a.ssn == b.ssn => match if (missing(a.ssn) or missing(b.ssn)) => if ((a.dob == b.dob) and (a.zip == b.zip) and (a.sex == b.sex)) => match

Fuzzy: similarity of attribute + distance of record + match thresholds + learning SSN Name DOB Place of Birth 605782877 Nilsson 1/14/76 Sainte-Paul <missing> Smith 8/8/68 Barstow 762009827 Smyth <m>/8/68 Barstow 720384793 Carlson 8/8/68 Barstow if (a.ssn == b.ssn) true else match, non-match, possible match Name a <missing> Nilson 1/14/76 Minnesota <missing> Nilsson 1/13/76 Saint-Paul DOB b POB

Recall vs Precision Fuzzy + rules => practice Generate rules by using machine learning

Levenshtein Distance (Edit Distance) Jaro Distance Jaro-Winkler Distance

Minimum number of single character edits (insert, delete, substitute) Stokcholm -> Stockholm {Distance = 2} sanfransicso-> san francisco {Distance = 4} sanfransicso-> San Francisco {Distance = 6}

distance

Hamming Distance Damerau-Levenshtein Distance Longest Common Substring Distance Cosine Distance Jaccard Distance Jaro Distance N-grams

Assume you want to link records across sources that do not share a common record key These records may have data quality issues How would you create such linkage using fuzzy and deterministic methods?

Yes Join on that attribute. Clean each single source Common identifier attribute? No Select set of comparabl e attributes Calculate similarity, identify similar, mark Join on mark attribute.

Defining similarity is application specific Assume you have no common, unique key What can you do to identify similar record in different (or the same) source Example: Cosine similarity for numeric /continuous attributes Edit distance for strings Jaccard for binary or categorical values

a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 b6 r1.1 r5.100 We map columns: a2<->b5 a3<->b3 a5<->b6 Compare record 1 from table 1, with record 100 from table 5. Are they representing the same entity? Are the attributes continuous, categorical, string, etc.?

For continuous attributes we can calculate cosine distance

String: calculate an {edit} distance. Categorical: 0 or 1 for different or same

Score = cosine distance of continuous attributes + normalized score of string similarities + weighted score of categorical values. score = w 1 *(cd(cont(r 1,r 5 )))+w 2 *(ed(r 1.a n,ry.by))+w 3 *(compare(r 1.a k,r 5.a l )) Cosine distance Ps. this just an illustration, how to model similarity is application specific. You define score ranges for same, possibly same, not same.

Entity matching can be very computationally intensive. How to scale it? Understand basic scaling techniques Understand the techniques pros and cons

Size N Size M Polynomial complexity NM record comparisons Expensive comparison operations May require multiple passes over the data

Precision and Recall Efficiency Processing Time MB of processed data/cost of infra

Efficiency Improvements Reduce # of Comparisons Reduce comparison complexity Blocking Canopy Sliding window Clustering Filtering: Decision tree Feature Selection

Sliding Window Create key from relevant data. Sort on key Compare records in sliding window Pros: Fewer comparison Cons: Sensitivity to key selection, missed matches

Blocking Define a blocking key Create blocks Compare within blocks Pros: fewer comparisons Cons: missed matches

Clustering Similarity is transitive a->b->c Union-find algorithm Pros: Fewer comparison Cons: Potentially complex computation

Canopy Clustering Records can belong to multiple canopies Cluster into canopies using cheap algorithm Pros: Fewer comparison Cons: Sensitivity to key selection, missed matches. 1. Begin with the set of data points to be clustered. 2. Remove a point from the set, beginning a new 'canopy'. 3. For each point left in the set, assign it to the new canopy if the distance less than the loose distance. 4. If the distance of the point is additionally less than the tight distance, remove it from the original set. 5. Repeat from step 2 until there are no more data points in the set to cluster. 6. Clustered canopies are sub-clustered using an expensive but accurate algorithm

Decision Tree

Speed up Scale-up Scale-out Algorithm Bigger Servers Hadoop