Lazy Big Data Integration - PDF Free Download

Lazy Big Integration Prof. Dr. Andreas Thor Hochschule für Telekommunikation Leipzig (HfTL) Martin-Luther-Universität Halle-Wittenberg 16.12.2016

Agenda Integration analytics for domain-specific questions Use cases: Bibliometrics & Life Sciences Big Integration Techniques for efficient big data management Exploiting cloud infrastructures (MapReduce, NoSQL data stores) Lazy Big Integration (Efficient and) effective goal-oriented data integration Integrated analytical approach for big data analytics 2

Use Case: Bibliometrics Does the peer review process actually work? Does it select the best papers? from reviewing process (e.g., Easy Chair) Bibliographic information (title, authors, ) of submitted papers Review score(s) incl. editorial decision from bibliographic data sources (e.g., Google Scholar) Accepted papers and rejected papers that are published elsewhere Number of citations Determine covariance between review score(s) and #citations 3

Integration Combining data residing at different sources and providing the user with a unified view of this data 4 Added value by linking & merging data Queries that can only be answered using multiple sources Schema Matching Finding mappings of corresponding attributes Entity Matching Finding equivalent data objects

Quality Can/should we use Google Scholar citations for ranking Papers Researchers Institutions etc. Google Scholar Web of Science Coverage Huge Limited quality Medium (fully automatic) High (manually curated) Costs Free Expensive Convergent validity of citation analyses? Comparison of analysis results for source overlap 5

Use Case: Life Sciences Gene Annotation Graph Genes are annotated with Gene Ontology (GO) and Plant Ontology (PO) terms Links form a graph that captures meaningful biological knowledge Sense making of graph is important Prediction of new annotations hypothesis for wet lab experiments 6 Is this annotation likely to be added in the future?

Graph Summarization + Link Prediction Graph summary = Signature + Corrections Signature: graph pattern / structure Super nodes = partitioning of nodes Super edges = edges between super nodes = all edges between nodes of super nodes PO_9006 PO_20030 HY5 PHOT1 CIB5 Corrections: edges e between individual nodes Additions: e G but e signature Deletions: e G but e signature PO_37 PO_20038 = CRY2 COP1 CRY1 p(po_20030, CIB5) 0.96 7 High prediction score because it is the only missing piece to a perfect 4x6 pattern PO_9006 PO_20030 PO_37 PO_20038 HY5 PHOT1 CIB5 CRY2 COP1 CRY1

(Big) Analytics Pipeline Bibliometrics Life Sciences Acquisition Extraction/ Cleaning Schema & Entity Matching Fusion/ Aggregation Analytics / Visualization 8

(Big) Analytics Pipeline Bibliometrics Life Sciences Acquisition Extraction/ Cleaning Schema & Entity Matching Fusion/ Aggregation Analytics / Visualization Query Strategies Product Code Extraction Load Balancing NoSQL Sharding 10 Dedoop (Deduplication with Hadoop)

How to speed up entity matching? Entity matching is expensive (due to pair-wise comparisons) Blocking to reduce search space Group similar entities within blocks based on blocking key Restrict matching to entities from the same block Parallelization Split match computation in sub-tasks to be executed in parallel Exploitation of cloud infrastructures and frameworks like MapReduce 11

Input Split Partitioning: BlockKey mod #Reduce Tasks Blocking + MapReduce: Naïve skew leads to unbalanced workload S A w B w Cx D x E x F z G z H w Iw K y L y M z N z O z 12 Π 1 A w B w Cx D x E x F z G z Π 2 H w Iw K y L y M z N z O z Map: Blocking map 1 map 2 Key Value w A w B x C x D x E z F z G Key Value w H w I y K y L z M z N z O Key Value w A w B w H w I z F z G z M z N z O Key Value y K y L Key Value x C x D x E Reduce: Matching reduce 1 reduce 2 reduce 3 Pairs A-B, A-H, A-I, B-H, B-I, H-I F-G, F-M, F-N, F-O, G-M, G-N, G-O, M-N, M-O, N-O Pairs K-L Pairs C-D, C-E, D-E

Blocks Load Balancing for MR-based EM Partition Π 1 Π 2 Entity A B C D E F G H I K L M N O Blocking Key w w x x x z z w w y y z z z Analysis to identify blocking key distribution Global load balancing algorithms assigning similar number of pairs to reduce tasks 13 Input Entities MR Job1: Analysis Blocking Count Occurrences Π 1 map 1 reduce 1 Input Partitions Π 2 map 2 reduce 2 Π 1 Π 2 map 1 map 2 Load Balancing z Φ 4 2 3 5 10 Block Distribution Matrix MR Job2: Matching reduce 1 reduce 2 reduce 3 Similarity Computation Partition Match Partitions M 1 M 2 M 3 Overall Π 1 Π 2 #E #P w Φ 1 2 2 4 6 y Φ 2 0 2 2 1 x Φ 3 3 0 3 3 Match Result

Match Tasks Blocks BlockSplit Large blocks split into m sub-blocks according to m input partitions large if #P Block > #P Overall / #Reducer Two types of match tasks Single (small blocks and sub-blocks) Two sub-blocks Greedy load balancing Sort match tasks by number of pairs in descending order Assign match task to reducer with lowest number of pairs Example r=3 reduce tasks, split Φ 4 in m=2 sub-blocks Partition Overall Π 1 Π 2 #E #P w Φ 1 2 2 4 6 y Φ 2 0 2 2 1 x Φ 3 3 0 3 3 z Φ 4 2 3 5 10 #P Reducer Φ 1 6 1 Φ 4.1 2 6 2 Φ 3 3 3 Φ 4.2 3 3 Φ 2 1 1 Φ 4.1 1 2 Φ 4 s match tasks: Φ 4.1, Φ 4.2, and Φ 4.1 2 14

Partitioning by Reducer-Index BlockSplit: MR-flow MapReduce Techniques MapKey = ReducerIndex + MatchTask Replicate entities of sub-blocks 15 Π 1 A w B w Cx D x E x F z G z Π 2 H w Iw K y L y M z N z O z map 1 map 2 Map Key Value 1.1.* A 1.1.* B 3.3.* C 3.3.* D 3.3.* E 2.4.1 F 2.4.1 2 F 1 2.4.1 G 2.4.1 2 G 1 Key Value 1.1.* H 1.1.* I 1.2.* K 1.2.* L 2.4.1 2 M 2 3.4.2 M 2.4.1 2 N 2 3.4.2 N 2.4.1 2 O 2 3.4.2 O Reduce Key Value 1.1.* A 1.1.* B 1.1.* H 1.1.* I 1.2.* K 1.2.* L Key Value 2.4.1 2 F 0 2.4.1 2 G 0 2.4.1 2 M 1 2.4.1 2 N 1 2.4.1 2 O 1 2.4.1 F 2.4.1 G Key Value 3.3.* C 3.3.* D 3.3.* E 3.4.2 M 3.4.2 N 3.4.2 O reduce 1 reduce 2 reduce 3 Pairs A-B, A-H, A-I, B-H, B-I, H-I K-L Pairs F-M, F-N, F-O, G-M, G-N, G-O F-G Pairs C-D, C-E, D-E M-N, M-O, N-O

Evaluation: Robustness + Scalability Robust against data skew Uniform distribution All entities in a single block Scalable 16

(Big) Analytics Pipeline Acquisition Extraction/ Cleaning Schema & Entity Matching Fusion/ Aggregation Analytics / Visualization Integration Interpretation 18

Citation Analysis Pipeline For a given set of Bibtex entries Find matching Google Scholar entries Determine aggregated citation counts Analytical questions for a researcher Complete publication list + #citations Top-5 publications H-Index, Average Number of citations Analytical questions for comparing Institutions Research fields 19

Lazy Machine : Effectiveness Do the right thing! Do only things that are needed! Priorization / filtering of data objects to be processed Example: Top-5 publications of a researcher Entity Matching for highly cited Google Scholar entries Cutoff data that does not contribute to the analytical result (anymore) does not is not likely to Pipeline stages Akquisition: query strategies Extraction: on-demand Matching: relevant entities only 20

Lazy User : Quality Automatic data integration does not give 100% data quality acquisition might miss relevant data Matching is imperfect (precision, recall) Pipeline & integrated result should effectively point the user to the weak points Examples What (non-)matching pairs have the most effect on the analytical result? Outlier detection What pipeline stage caused the effect? 21

Lazy Big Integration Integrated approach for both integration workflow Analytical query Current work based on Gradoop (Graph Analytics on Hadoop) Graph model + operators for analytical pipelines Efficient execution in distributed environment Next steps Operators for complex analytical queries / statistics (e.g., h-index) provenance model for measuring the impact of data objects to specific results 22

Summary Integration analytics for domain-specific questions Use cases: Bibliometrics & Life Sciences Big Integration Techniques for efficient big data management Exploiting cloud infrastructures (MapReduce, NoSQL data stores) Lazy Big Integration (Efficient and) effective goal-oriented data integration Integrated analytical approach for big data analytics 23