RDFPath. Path Query Processing on Large RDF Graphs with MapReduce. 29 May 2011

Similar documents
Sempala. Interactive SPARQL Query Processing on Hadoop

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Introduction to Hadoop and MapReduce

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Computing: MapReduce Jin, Hai

A brief history on Hadoop

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Hadoop/MapReduce Computing Paradigm

Introduction to Data Management CSE 344

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Clustering Lecture 8: MapReduce

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

Optimal Query Processing in Semantic Web using Cloud Computing

Database Systems CSE 414

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Databases 2 (VU) ( / )

MapReduce Design Patterns

Introduction to Data Management CSE 344

Map-Reduce Applications: Counting, Graph Shortest Paths

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Map Reduce & Hadoop Recommended Text:

Importing and Exporting Data Between Hadoop and MySQL

HaLoop Efficient Iterative Data Processing on Large Clusters

Improving the MapReduce Big Data Processing Framework

MapReduce. Stony Brook University CSE545, Fall 2016

1. Introduction to MapReduce

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Data Analysis Using MapReduce in Hadoop Environment

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Hadoop Map Reduce 10/17/2018 1

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Big Data Hadoop Stack

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Embedded Technosolutions

Database System Architectures Parallel DBs, MapReduce, ColumnStores

The MapReduce Framework

Big Data with Hadoop Ecosystem

BigData and MapReduce with Hadoop

MapReduce-style data processing

Certified Big Data and Hadoop Course Curriculum

Shark: Hive (SQL) on Spark

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Developing MapReduce Programs

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Chapter 5. The MapReduce Programming Model and Implementation

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

A Research on RDF Analytical Query Optimization using MapReduce with SPARQL

Map Reduce. Yerevan.

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

50 Must Read Hadoop Interview Questions & Answers

A Comparison of MapReduce Join Algorithms for RDF

Welcome to the New Era of Cloud Computing

Pivoting Entity-Attribute-Value data Using MapReduce for Bulk Extraction

Apache Hive for Oracle DBAs. Luís Marques

MapReduce, Hadoop and Spark. Bompotas Agorakis

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Gearing Hadoop towards HPC sytems

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Map Reduce Group Meeting

Anurag Sharma (IIT Bombay) 1 / 13

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Pig A language for data processing in Hadoop

Big Data Analytics. Rasoul Karimi

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

BigData and Map Reduce VITMAC03

CS 61C: Great Ideas in Computer Architecture. MapReduce

HADOOP FRAMEWORK FOR BIG DATA

Massive Online Analysis - Storm,Spark

Graph Analytics in the Big Data Era

Distributed computing: index building and use

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

High Performance Computing on MapReduce Programming Framework

Introduction to MapReduce

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Distributed computing: index building and use

Data Partitioning and MapReduce

Understanding NoSQL Database Implementations

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

MapReduce for Data Warehouses 1/25

Large-Scale Duplicate Detection

Introduction to MapReduce (cont.)

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Advanced Database Technologies NoSQL: Not only SQL

Transcription:

29 May 2011 RDFPath Path Query Processing on Large RDF Graphs with MapReduce 1 st Workshop on High-Performance Computing for the Semantic Web (HPCSW 2011) Martin Przyjaciel-Zablocki Alexander Schätzle Thomas Hornung Georg Lausen University of Freiburg Databases & Information Systems

Overview 1. Motivation 2. MapReduce 4. Experimental Results 5. Summary 0. Overview RDFPath: Path Query Processing on Large RDF Graphs 2

Large RDF Graphs Semantic Web Amount of Semantic Data increases steadily RDF is W3C standard for representing Semantic Data Social Networks > 500 million active users in Facebook Interacting with >900 million objects (pages, groups, events) Social Graphs can also be represented in RDF How to explore, analyze such large RDF Graphs? Our Approach: Distributed analysis of large RDF Graphs by mapping RDFPath to MapReduce 1. Motivation Large RDF Graphs RDFPath: Path Query Processing on Large RDF Graphs 3

MapReduce MapReduce Popularized by Google & widely used Automatic parallelization of computations Fixed two-stage process: Map Reduce Map Distributed File System Clusters of commodity hardware Fault tolerance by replication Very large files / write-once, read-many pattern Apache Hadoop Well-known Open-Source implementation Used by Yahoo, Facebook, Amazon, IBM, Last.fm, 2. MapReduce RDFPath: Path Query Processing on Large RDF Graphs 4

MapReduce (2) Steps of a MapReduce execution Map phase Shuffle & Sort Reduce phase split 0 split 1 split 2 split 3 split 4 split 5 Map Map Map Reduce Reduce output 0 output 1 Input (DFS) Intermediate Results (Local) Output (DFS) Signatures Map: (in_key, in_value) list (out_key, intermediate_value) Reduce: (out_key, list (intermediate_value)) list (out_value) 2. MapReduce RDFPath: Path Query Processing on Large RDF Graphs 5

Path queries on RDF Graphs RDFPath: Path Query Processing on Large RDF Graphs 6

RDFPath Idea Navigational queries on RDF Graphs Declarative path specification inspired by XPath Particularly designed with regard to MapReduce Every location step is mapped to one MapReduce job Extendibility Features Fixed-length paths, shortest paths Aggregate functions, degree of a node Adjacent nodes and edges Path Language RDFPath: Path Query Processing on Large RDF Graphs 7

Example (1) Allen :: knows [country=equals("de")] > age Results Allen (knows) Jacob [country="de"] (age) 42 Allen (knows) Sarah [country=""] Allen (knows) Chris [country=""] 26 Sarah Chris Allen DE Jacob 42 Emily knows country age Example Queries RDFPath: Path Query Processing on Large RDF Graphs 8

Example (2) examples features Allen :: knows(*3) Results Allen (knows) Jacob Allen (knows) Jacob (knows) Emily Allen (knows) Chris Allen (knows) Sarah 26 Sarah Chris Allen DE Jacob 42 Emily knows country age Example Queries RDFPath: Path Query Processing on Large RDF Graphs 9

Implementation System Architecture RDFPath Query RDF Graph Query Processor Triple Loader Query Engine Hadoop Cluster HDFS MapReduce Framework Storing RDF Graphs RDF Graph is loaded in advance Vertical Partitioning by predicate Sequence Files stored in HDFS contain PathObjects Implementation RDFPath: Path Query Processing on Large RDF Graphs 10

Implementation (2) Query Processing Query Query Parser Instruction Sequence Instruction Interpreter MapReduce Plan Query Processor Instruction 1... Instruction n Query Engine Assignment 1 Job 1 Config 1... Assignment n Job n Config n MapReduce Framework HDFS Hadoop Cluster Results Implementation RDFPath: Path Query Processing on Large RDF Graphs 11

Implementation (3) mapping Mapping to MapReduce Jobs Allen :: knows(*2) > country [equals("de")] intermediate paths Allen (knows) Jacob (knows) Emily Allen (knows) Jacob Allen (knows) Sarah Join country Emily Jacob DE Sarah Map task: Tagging intermediate paths and country partition for join Applying filter condition Reduce task: Perform join and store resulting paths back to HDFS Implementation RDFPath: Path Query Processing on Large RDF Graphs 12

4. Experimental Results Properties of RDFPath 4. Evaluation RDFPath: Path Query Processing on Large RDF Graphs 13

Experiments Environment Setup Cluster of 10 machines (Dual Core CPU, 4GB RAM, 1GB HD) Cloudera s Distribution for Hadoop 3 Beta 1 Default configuration with 9 reducers (one per HD) Real-World Last.fm dataset 225 million RDF triples Different input-sizes instead of varying the number of machines 4. Evaluation Setup RDFPath: Path Query Processing on Large RDF Graphs 14

Time (mm:ss) Storage (GB) Query 1 The Disco Boys - I Surrender :: tracksimilar(*4) > trackalbum 6:00 5:00 50 40 SHUFFLE LOCAL 4:00 30 HDFS 3:00 20 2:00 10 1:00 0 50 100 150 200 0 0 50 100 150 200 RDF Triples (million) RDF Triples (million) Linear scaling behavior for execution times & amount of data Execution times mainly influenced by number of intermediate results stored locally as well as transferred data 4. Evaluation Query 1 RDFPath: Path Query Processing on Large RDF Graphs 15

% of reachable people Time (mm:ss) queries Query 3 Chris :: knows(*x), 1 X 10 100 80 60 40 20 Triples (million) 50 100 200 7:00 6:00 5:00 4:00 3:00 2:00 1:00 Triples (million) 50 100 200 0 1 2 3 4 5 6 7 8 9 10 0:00 1 2 3 4 5 6 7 8 9 10 Path length Path length 98% of all reachable persons can be reached by at most 7 edges Corresponds to the well-known six degrees of separation paradigm Linear scaling behavior, mainly determined by number of joins 4. Evaluation Query 3 RDFPath: Path Query Processing on Large RDF Graphs 16

% of reachable people Time (mm:ss) queries Query 3 Chris :: knows(*x), 1 X 10 100 80 60 40 20 Triples (million) 50 100 200 7:00 6:00 5:00 4:00 3:00 2:00 1:00 Triples (million) 50 100 200 0 1 2 3 4 5 6 7 8 9 10 0:00 1 2 3 4 5 6 7 8 9 10 Path length Path length 98% of all reachable persons can be reached by at most 7 edges Corresponds to the well-known six degrees of separation paradigm Linear scaling behavior, mainly determined by number of joins 4. Evaluation Query 3 RDFPath: Path Query Processing on Large RDF Graphs 17

Summary Amount of Semantic Web data is growing constantly! RDFPath Intuitive syntax for path queries Expression of interesting graph issues such as Friend of a Friend queries Investigating small world properties like six degrees of separation Designed with MapReduce in mind Execution Direct mapping to MapReduce Scaling properties of MapReduce enable analysis of large RDF Graphs Evaluation Based on real-world data from Last.fm Shows linear scaling behavior in the size of the graph Confirms the effectiveness of our approach 5. Summary RDFPath: Path Query Processing on Large RDF Graphs 18

Thanks for your attention. Thanks RDFPath: Path Query Processing on Large RDF Graphs 19

Backup Slides a) Last.fm b) Examples cont. c) Evaluation cont. d) RDFPath Features e) Mapping f) Comparison g) Reduce-Side-Join Thanks RDFPath: Path Query Processing on Large RDF Graphs 20

26,49 22,20 19,41 5,91 5,85 5,84 4,00 3,44 3,31 1,70 0,42 0,12 0,12 0,12 0,12 0,12 in % a) Last.fm knows Realname Gender Playcount User Age topfans Country listenedto+ toptracks topartists Duration Track Playcount tracksimilar artistsimilar Album Artist 4. Evaluation Last.fm RDFPath: Path Query Processing on Large RDF Graphs 21

b) Example (3) Allen :: knows > knows [country=prefix("c")] > age [min(18)][max(80)].avg() Result 34 26 Sarah Allen Jacob Emily Chris DE 42 knows country age Example Queries RDFPath: Path Query Processing on Large RDF Graphs 22

b) Example (4) * :: knows [equals("sarah")] Results Allen (knows) Sarah Chris (knows) Sarah Jacob (knows) Emily Allen (knows) Jacob Allen (knows) Chris 26 Sarah Chris Allen DE Jacob 42 Emily knows country age Example Queries RDFPath: Path Query Processing on Large RDF Graphs 23

b) Example (5) * :: knows > knows [equals("sarah")] Results Allen (knows) Chris (knows) Sarah Allen (knows) Jacob (knows) Emily 26 Sarah Chris Allen DE Jacob 42 Emily knows country age Example Queries RDFPath: Path Query Processing on Large RDF Graphs 24

b) Example (6) * :: * > * [equals("sarah")] Results Allen (knows) Chris (knows) Sarah Allen (knows) Jacob (knows) Emily... 26 Sarah Chris Allen DE Jacob 42 Emily knows country age Example Queries RDFPath: Path Query Processing on Large RDF Graphs 25

b) Example (7) Allen :: *.count() Result 3 26 Sarah Allen Jacob Emily Chris DE 42 knows country age Example Queries RDFPath: Path Query Processing on Large RDF Graphs 26

b) Example (Last.fm) Michael_Jackson :: artisttracks [trackalbum = equals(michael_jackson_-_thriller)] > tracksimilar [trackduration = min(50000)] > tracktopfans [usercountry = equals(de)]. Results Michael_Jackson (artisttracks) Michael_Jackson_-_Beat_It (tracksimilar) Michael_Jackson_-_Billie_Jean (tracktopfans) Mark Michael_Jackson (artisttracks) Michael_Jackson_-_Someone_in_the_Dark (tracksimilar) Rihanna_-_Russian_Roulette (tracktopfans) Megan Example Last.fm RDFPath: Path Query Processing on Large RDF Graphs 27

Time (mm:ss) Storage (GB) c) Query 2 (Backup) Michael_Jackson :: artisttracks [ trackalbum = equals( Michael_Jackson_-_Thriller ) ] > tracksimilar [ trackduration = min(50000) ] > tracktopfans [ country = equals( DE ) ] 3:25 20 SHUFFLE 3:15 3:05 15 10 LOCAL HDFS 2:55 5 2:45 0 50 100 150 200 0 0 50 100 150 200 RDF Triples (million) RDF Triples (million) 4. Evaluation Query 2 RDFPath: Path Query Processing on Large RDF Graphs 28

d) RDFPath Features by Example Startnode Chris :: knows * :: knows Location Step Chris :: knows > knows > age Chris :: knows (2) > age Chris :: * Filters: equals(), prefix(), suffix(), min(), max() Chris :: knows > age [min(18)] [max(67)] (6) Chris :: * > * [equals('peter')] (7) Chris :: knows [age = min(30)] [country = prefix('d')] > name Bounded Search Chris :: knows [country = equals('de')] (*3) Bounded Shortest Path Chris :: knows (*3).distance('Peter') Aggregation Functions: count(), sum(), avg(), min() and max() Chris :: *.count() Chris :: knows > age.avg() Comparison RDFPath: Path Query Processing on Large RDF Graphs 29

e) Mapping to MapReduce Jobs Composition Queries in RDFPath are composed of location steps One location step is mapped to a join in MapReduce Map tasks: Tagging for RSJs Applying filter conditions Reduce tasks: Performing RSJs Aggregating values Computing Shortest paths Checking for Cycles Implementation RDFPath: Path Query Processing on Large RDF Graphs 30

f) RDF Query Languages Comparison RDFPath: Path Query Processing on Large RDF Graphs 31

f) RDF Query Languages (2) Adjacent nodes: All nodes adjacent to a startnode Adjacent edges: All predicates of statements involving a given startnode Degree of a node: Number of predicates involving a startnode Path: Find paths between a fixed startnode and an endnote of arbitrary length Fixed-length paths: Find all paths of length 2 between startnote and endnote Distance between two resources: (length of shortest path without a max length!) Diameter of a graph: Diameter of a given RDF Graph. Comparison RDFPath: Path Query Processing on Large RDF Graphs 32

f) Comparison to SPARQL 1.0 Fixed-length paths SELECT?tmp1,?tmp2,?tmp3,?tmp4,?tmp5 WHERE { Allen knows?tmp1.?tmp1 knows?tmp2.?tmp2 knows?tmp3.?tmp3 knows?tmp4.?tmp5 knows?tmp5 } Allen :: knows(5) SPARQL RDFPath Comparison RDFPath: Path Query Processing on Large RDF Graphs 33

f) Comparison to SPARQL 1.0 Filters SELECT?tmp1,?tmp2,?tmp3,?tmp4 WHERE { Allen knows?tmp1.?tmp1 age?age1. Filter (?age1 >= 20)?tmp1 knows?tmp2.?tmp2 country DE. } Allen :: knows [age=min(20)] > knows [country=equals( DE )] SPARQL RDFPath Comparison RDFPath: Path Query Processing on Large RDF Graphs 34

f) Comparison to SPARQL 1.0 Drawbacks of SPARQL Limited navigational capabilities No Shortest Path Queries No Aggregate functions (count, min, max, ) No Abbreviations for following same edges Advantages of SPARQL Expressive general purpose query language for RDF Intuitive Syntax, related to SQL SPARQL 1.1 adds support for some path queries Comparison RDFPath: Path Query Processing on Large RDF Graphs 35

g) Reduce-Side Join Example: * :: knows(*2) > knows intermediate paths Peter (knows) Frank (knows) Chris Peter (knows) Simon (knows) Johan Klaus (knows) Simon (knows) Johan Frank (knows) Chris Peter (knows) Klaus Reduce-Side Join knows Chris Peter Johan Frank Johan Lukas Frank Chris Reduce-Side-Join RDFPath: Path Query Processing on Large RDF Graphs 36

g) Reduce-Side Join (2) Mapper Input Mapper Output intermediate paths Key Value Peter (knows) Frank (knows) Chris Peter (knows) Simon (knows) Johan Klaus (knows) Simon (knows) Johan Frank (knows) Chris Peter (knows) Klaus (Chris, 1) (Johan, 1) (Johan, 1) (Chris, 1) (Klaus, 1) Peter (knows) Frank (knows) Chris Peter (knows) Simon (knows) Johan Klaus (knows) Simon (knows) Johan Frank (knows) Chris Peter (knows) Klaus knows Chris Peter Johan Frank Johan Lukas Frank Chris (Chris, 0) (Johan, 0) (Johan, 0) (Frank, 0) Peter Frank Lukas Chris Reduce-Side-Join RDFPath: Path Query Processing on Large RDF Graphs 37

g) Reduce-Side Join (3) Sorting phase: (1) Partition according to the first keypair % #reducer (2) Sort within a partition according to the whole keypair Consequences A reducer gets all values with the same first keypair The values within a partition contain at first all new nodes and thereafter all intermediate paths Reduce-Side-Join RDFPath: Path Query Processing on Large RDF Graphs 38

g) Reduce-Side Join (4) Reducer Input Reducer Output Chris Peter Manu Peter (knows) Frank (knows) Chris Frank (knows) Chris Peter (knows) Frank (knows) Chris (knows) Peter Peter (knows) Frank (knows) Chris (knows) Manu Frank (knows) Chris (knows) Peter Frank (knows) Chris (knows) Manu Johan Frank Lukas Peter (knows) Simon (knows) Johan Klaus (knows) Simon (knows) Johan Peter (knows) Simon (knows) Johan (knows) Frank Peter (knows) Simon (knows) Johan (knows) Lukas Klaus (knows) Simon (knows) Johan (knows) Frank Klaus (knows) Simon (knows) Johan (knows) Lukas Reduce-Side-Join RDFPath: Path Query Processing on Large RDF Graphs 39