Lazy Big Data Integration

Similar documents
Efficient Entity Matching over Multiple Data Sources with MapReduce

Analyzing Big Data with Microsoft R

Overview. Audience profile. At course completion. Course Outline. : 20773A: Analyzing Big Data with Microsoft R. Course Outline :: 20773A::

Load Balancing for Entity Matching over Big Data using Sorted Neighborhood

Load Balancing for MapReduce-based Entity Resolution

Challenges for Data Driven Systems

CISC 7610 Lecture 2b The beginnings of NoSQL

Oracle Big Data Connectors

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop Map Reduce 10/17/2018 1

Advanced Data Management Technologies Written Exam

Embedded Technosolutions

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Introduction to NoSQL

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Accelerating Analytical Workloads

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Abstractions for Data Intensive Computing

CSE 344 MAY 2 ND MAP/REDUCE

Map-Reduce. Marco Mura 2010 March, 31th

A Fast and High Throughput SQL Query System for Big Data

/ Cloud Computing. Recitation 10 March 22nd, 2016

Stages of Data Processing

An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

Hadoop/MapReduce Computing Paradigm

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Comparative Analysis of Range Aggregate Queries In Big Data Environment

CS 61C: Great Ideas in Computer Architecture. MapReduce

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Large-Scale Duplicate Detection

Parallel Computing: MapReduce Jin, Hai

Introduction to Data Management CSE 344

I am: Rana Faisal Munir

Map/Reduce. Large Scale Duplicate Detection. Prof. Felix Naumann, Arvid Heise

OLTP vs. OLAP Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

MapReduce, Hadoop and Spark. Bompotas Agorakis

Data Management Glossary

Architectural Solutions for Next Generation Software Systems

DATA SCIENCE USING SPARK: AN INTRODUCTION

Global Journal of Engineering Science and Research Management

Batch Processing Basic architecture

Microsoft Big Data and Hadoop

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Huge market -- essentially all high performance databases work this way

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

An InterSystems Guide to the Data Galaxy. Benjamin De Boe Product Manager

Part I: Data Mining Foundations

Big Data Development CASSANDRA NoSQL Training - Workshop. November 20 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Improving the MapReduce Big Data Processing Framework

BIG DATA COURSE CONTENT

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

FROM PEER TO PEER...

/ Cloud Computing. Recitation 7 October 10, 2017


Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Introduction to Hadoop and MapReduce

Distributed Computing.

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Big Data Architect.

CIB Session 12th NoSQL Databases Structures

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

MapReduce Algorithms

Don t Match Twice: Redundancy-free Similarity Computation with MapReduce

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Data Informatics. Seon Ho Kim, Ph.D.

CS 655 Advanced Topics in Distributed Systems

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Large-Scale Web Applications

Big Data with Hadoop Ecosystem

MAGE-ML: MicroArray Gene Expression Markup Language

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

REVIEW ON BIG DATA ANALYTICS AND HADOOP FRAMEWORK

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Massive Scalability With InterSystems IRIS Data Platform

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

An Introduction to Big Data Formats

An overview of Graph Categories and Graph Primitives

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

1. Introduction to MapReduce

Mitigating Data Skew Using Map Reduce Application

Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist

Transcription:

Lazy Big Integration Prof. Dr. Andreas Thor Hochschule für Telekommunikation Leipzig (HfTL) Martin-Luther-Universität Halle-Wittenberg 16.12.2016

Agenda Integration analytics for domain-specific questions Use cases: Bibliometrics & Life Sciences Big Integration Techniques for efficient big data management Exploiting cloud infrastructures (MapReduce, NoSQL data stores) Lazy Big Integration (Efficient and) effective goal-oriented data integration Integrated analytical approach for big data analytics 2

Use Case: Bibliometrics Does the peer review process actually work? Does it select the best papers? from reviewing process (e.g., Easy Chair) Bibliographic information (title, authors, ) of submitted papers Review score(s) incl. editorial decision from bibliographic data sources (e.g., Google Scholar) Accepted papers and rejected papers that are published elsewhere Number of citations Determine covariance between review score(s) and #citations 3

Integration Combining data residing at different sources and providing the user with a unified view of this data 4 Added value by linking & merging data Queries that can only be answered using multiple sources Schema Matching Finding mappings of corresponding attributes Entity Matching Finding equivalent data objects

Quality Can/should we use Google Scholar citations for ranking Papers Researchers Institutions etc. Google Scholar Web of Science Coverage Huge Limited quality Medium (fully automatic) High (manually curated) Costs Free Expensive Convergent validity of citation analyses? Comparison of analysis results for source overlap 5

Use Case: Life Sciences Gene Annotation Graph Genes are annotated with Gene Ontology (GO) and Plant Ontology (PO) terms Links form a graph that captures meaningful biological knowledge Sense making of graph is important Prediction of new annotations hypothesis for wet lab experiments 6 Is this annotation likely to be added in the future?

Graph Summarization + Link Prediction Graph summary = Signature + Corrections Signature: graph pattern / structure Super nodes = partitioning of nodes Super edges = edges between super nodes = all edges between nodes of super nodes PO_9006 PO_20030 HY5 PHOT1 CIB5 Corrections: edges e between individual nodes Additions: e G but e signature Deletions: e G but e signature PO_37 PO_20038 = CRY2 COP1 CRY1 p(po_20030, CIB5) 0.96 7 High prediction score because it is the only missing piece to a perfect 4x6 pattern PO_9006 PO_20030 PO_37 PO_20038 HY5 PHOT1 CIB5 CRY2 COP1 CRY1

(Big) Analytics Pipeline Bibliometrics Life Sciences Acquisition Extraction/ Cleaning Schema & Entity Matching Fusion/ Aggregation Analytics / Visualization 8

Agenda Integration analytics for domain-specific questions Use cases: Bibliometrics & Life Sciences Big Integration Techniques for efficient big data management Exploiting cloud infrastructures (MapReduce, NoSQL data stores) Lazy Big Integration (Efficient and) effective goal-oriented data integration Integrated analytical approach for big data analytics 9

(Big) Analytics Pipeline Bibliometrics Life Sciences Acquisition Extraction/ Cleaning Schema & Entity Matching Fusion/ Aggregation Analytics / Visualization Query Strategies Product Code Extraction Load Balancing NoSQL Sharding 10 Dedoop (Deduplication with Hadoop)

How to speed up entity matching? Entity matching is expensive (due to pair-wise comparisons) Blocking to reduce search space Group similar entities within blocks based on blocking key Restrict matching to entities from the same block Parallelization Split match computation in sub-tasks to be executed in parallel Exploitation of cloud infrastructures and frameworks like MapReduce 11

Input Split Partitioning: BlockKey mod #Reduce Tasks Blocking + MapReduce: Naïve skew leads to unbalanced workload S A w B w Cx D x E x F z G z H w Iw K y L y M z N z O z 12 Π 1 A w B w Cx D x E x F z G z Π 2 H w Iw K y L y M z N z O z Map: Blocking map 1 map 2 Key Value w A w B x C x D x E z F z G Key Value w H w I y K y L z M z N z O Key Value w A w B w H w I z F z G z M z N z O Key Value y K y L Key Value x C x D x E Reduce: Matching reduce 1 reduce 2 reduce 3 Pairs A-B, A-H, A-I, B-H, B-I, H-I F-G, F-M, F-N, F-O, G-M, G-N, G-O, M-N, M-O, N-O Pairs K-L Pairs C-D, C-E, D-E

Blocks Load Balancing for MR-based EM Partition Π 1 Π 2 Entity A B C D E F G H I K L M N O Blocking Key w w x x x z z w w y y z z z Analysis to identify blocking key distribution Global load balancing algorithms assigning similar number of pairs to reduce tasks 13 Input Entities MR Job1: Analysis Blocking Count Occurrences Π 1 map 1 reduce 1 Input Partitions Π 2 map 2 reduce 2 Π 1 Π 2 map 1 map 2 Load Balancing z Φ 4 2 3 5 10 Block Distribution Matrix MR Job2: Matching reduce 1 reduce 2 reduce 3 Similarity Computation Partition Match Partitions M 1 M 2 M 3 Overall Π 1 Π 2 #E #P w Φ 1 2 2 4 6 y Φ 2 0 2 2 1 x Φ 3 3 0 3 3 Match Result

Match Tasks Blocks BlockSplit Large blocks split into m sub-blocks according to m input partitions large if #P Block > #P Overall / #Reducer Two types of match tasks Single (small blocks and sub-blocks) Two sub-blocks Greedy load balancing Sort match tasks by number of pairs in descending order Assign match task to reducer with lowest number of pairs Example r=3 reduce tasks, split Φ 4 in m=2 sub-blocks Partition Overall Π 1 Π 2 #E #P w Φ 1 2 2 4 6 y Φ 2 0 2 2 1 x Φ 3 3 0 3 3 z Φ 4 2 3 5 10 #P Reducer Φ 1 6 1 Φ 4.1 2 6 2 Φ 3 3 3 Φ 4.2 3 3 Φ 2 1 1 Φ 4.1 1 2 Φ 4 s match tasks: Φ 4.1, Φ 4.2, and Φ 4.1 2 14

Partitioning by Reducer-Index BlockSplit: MR-flow MapReduce Techniques MapKey = ReducerIndex + MatchTask Replicate entities of sub-blocks 15 Π 1 A w B w Cx D x E x F z G z Π 2 H w Iw K y L y M z N z O z map 1 map 2 Map Key Value 1.1.* A 1.1.* B 3.3.* C 3.3.* D 3.3.* E 2.4.1 F 2.4.1 2 F 1 2.4.1 G 2.4.1 2 G 1 Key Value 1.1.* H 1.1.* I 1.2.* K 1.2.* L 2.4.1 2 M 2 3.4.2 M 2.4.1 2 N 2 3.4.2 N 2.4.1 2 O 2 3.4.2 O Reduce Key Value 1.1.* A 1.1.* B 1.1.* H 1.1.* I 1.2.* K 1.2.* L Key Value 2.4.1 2 F 0 2.4.1 2 G 0 2.4.1 2 M 1 2.4.1 2 N 1 2.4.1 2 O 1 2.4.1 F 2.4.1 G Key Value 3.3.* C 3.3.* D 3.3.* E 3.4.2 M 3.4.2 N 3.4.2 O reduce 1 reduce 2 reduce 3 Pairs A-B, A-H, A-I, B-H, B-I, H-I K-L Pairs F-M, F-N, F-O, G-M, G-N, G-O F-G Pairs C-D, C-E, D-E M-N, M-O, N-O

Evaluation: Robustness + Scalability Robust against data skew Uniform distribution All entities in a single block Scalable 16

Agenda Integration analytics for domain-specific questions Use cases: Bibliometrics & Life Sciences Big Integration Techniques for efficient big data management Exploiting cloud infrastructures (MapReduce, NoSQL data stores) Lazy Big Integration (Efficient and) effective goal-oriented data integration Integrated analytical approach for big data analytics 17

(Big) Analytics Pipeline Acquisition Extraction/ Cleaning Schema & Entity Matching Fusion/ Aggregation Analytics / Visualization Integration Interpretation 18

Citation Analysis Pipeline For a given set of Bibtex entries Find matching Google Scholar entries Determine aggregated citation counts Analytical questions for a researcher Complete publication list + #citations Top-5 publications H-Index, Average Number of citations Analytical questions for comparing Institutions Research fields 19

Lazy Machine : Effectiveness Do the right thing! Do only things that are needed! Priorization / filtering of data objects to be processed Example: Top-5 publications of a researcher Entity Matching for highly cited Google Scholar entries Cutoff data that does not contribute to the analytical result (anymore) does not is not likely to Pipeline stages Akquisition: query strategies Extraction: on-demand Matching: relevant entities only 20

Lazy User : Quality Automatic data integration does not give 100% data quality acquisition might miss relevant data Matching is imperfect (precision, recall) Pipeline & integrated result should effectively point the user to the weak points Examples What (non-)matching pairs have the most effect on the analytical result? Outlier detection What pipeline stage caused the effect? 21

Lazy Big Integration Integrated approach for both integration workflow Analytical query Current work based on Gradoop (Graph Analytics on Hadoop) Graph model + operators for analytical pipelines Efficient execution in distributed environment Next steps Operators for complex analytical queries / statistics (e.g., h-index) provenance model for measuring the impact of data objects to specific results 22

Summary Integration analytics for domain-specific questions Use cases: Bibliometrics & Life Sciences Big Integration Techniques for efficient big data management Exploiting cloud infrastructures (MapReduce, NoSQL data stores) Lazy Big Integration (Efficient and) effective goal-oriented data integration Integrated analytical approach for big data analytics 23