On Statistical Characteristics of Real-life Knowledge Graphs

Size: px

Start display at page:

Download "On Statistical Characteristics of Real-life Knowledge Graphs"

Alfred Hamilton
5 years ago
Views:

1 On Statistical Characteristics of Real-life Knowledge Graphs Wenliang Cheng, Chengyu Wang, Bing Xiao, Weining Qian, Aoying Zhou Institute for Data Science and Engineering East China Normal University Shanghai, China

2 Outline Introduction & Motivation Statistical Characteristics Data Description Empirical Studies Conclusion 2

3 What is a knowledge graph? Essential elements in the knowledge graph Entities,and relationships among them. Entity Person, location, organization, concepts etc. Relationship Semantic relation between two entities. Fact The triple of an entity, a relation and an entity. 3

Knowledge graph can serve as the backbone of Web-scale applications, such as search

4 Famous knowledge graphs Academic: WordNet, YAGO, DBpedia, Probase, FreeBase, Linked Open Data etc. Knowledge graph can serve as the backbone of Web-scale applications, such as search engine, question Google Knowledge answering, Graph/Vault text understanding etc. Industry: Microsoft Satori Facebook Graph Search Baidu Zhixin IBM Watson 4

5 Knowledge graph management How to effectively and efficiently manage a largescale knowledge graph? MySQL, Oracle, Neo4j, Triple store etc. Knowledge graph is different with social network More semantic labels in both entities and relations Topic or domain sensitive. Contain various kinds of knowledge Hard to define a unified schema 5

6 Benchmarking a knowledge graph A benchmark for management of knowledge graph is required Understanding the real-life knowledge graph data is the first effort and is meaningful for us to design a synthetic data generator As a comparison, we also need to analyze the data distributions of the social networks 6

7 Evaluate the graphs We evaluate 4 kinds of real-life knowledge graphs and 2 synthetic social networks via 13 statistical metrics and 4 distributions We have conducted a series of in-depth analysis about the evaluation results 7

8 Outline Introduction & Motivation Statistical Characteristics Data Description Empirical Studies Conclusion 8

9 Large-scale graph properties Previous research works on analyzing structural properties of large scale graphs [Broder et al. Comput. Netw. 2000] studied the web structure as a graph via a series of metrics, e.g degree, diameter, component. [Kumar et al. KDD, 2006] studied the dynamic social network s structure properties, e.g. degree, hop etc. [Boccaletti et al. Phys. Rep. 2006] surveyed the studies of the structure and dynamics of complex network. 9

10 Thirteen statistical metrics 10

11 Four kinds of distributions Distribution of degrees In-degree and out-degree Power-law distribution Distribution of hops Reflects the connectivity cost inside a graph Distribution of connected components Strongly and weakly connected components Reflects the connectivity of a graph Distribution of clustering coefficients Measures the nodes tendency to cluster together 11

12 Outline Introduction & Motivation Statistical Characteristics Data Description Empirical Studies Conclusion 12

13 Four knowledge graphs WordNet A lexical network for the English language. Synonym set as node and semantic relation as edge. 98,000 entities, 154,000 relationships 13

14 Four knowledge graphs YAGO2 A huge semantic knowledge graph based on WordNet, Wikipedia and GeoNames 10+ million entities, 120+ million facts Separate the YAGO2 into three sub-graphs YagoTax: Taxonomy tree of YAGO2 YagoFact: Facts in YAGO2 YagoWiki: Hyperlink relations in YAGO2 based on Wikipedia 14

15 Four knowledge graphs DBpedia A multi-language knowledge base extracted from Wikipedia info-boxes English version of DBPedia 4.58 million things and 2,795 different kinds of properties Enterprise Knowledge Graph (EKG) Describes an enterprise ontology in Chinese. Domain specific knowledge graph 9,450 entities and 12,100 relations. 15

16 Two social networks SNRand 0.2 million randomly selected users 5 million fellowship relations between users SNRank 0.2 million most active users. 36+ million fellowship relations between users The raw data is collected from a famous social media platform named Sina Weibo in China 16

17 Outline Introduction & Motivation Statistical Characteristics Data Description Empirical Studies Conclusion 17

18 Empirical studies Analysis for statistical metrics Comparison between different parts within a knowledge graph.take YAGO2 as a case study Comparison between different knowledge graphs Comparison between knowledge graphs and social networks Analysis for distributions Six distributions Analysis for semantic labels relatedness 18

19 Analysis for statistical metrics 19

20 All Analysis the in-degree distributions for distributions exhibit the power-law, except for some initial segments that deviate the power-law. Node degree distributions 20

21 Hop distributions Analysis for distributions They are all in the S shape, and can be fitted by a sigmoid like function: f(x) = a / (1+ e^(b-c*x)) 21

22 Hop distributions Analysis for distributions The max number of hops between different parts of a knowledge graph is different with each other. 22

23 Hop distributions Analysis for distributions The max number of hops between knowledge graphs and social networks are also different with each other. 23

knowledge graphs exhibit the power-law distribution in general.

24 Analysis for distributions Connected component distributions Both the strongly and weakly connected component distributions of knowledge graphs exhibit the power-law distribution in general. While the social networks are nearly in a whole strongly connected component. 24

25 Analysis for distributions Clustering coefficient distributions Node degree in this experiment is the sum of in-degree and out-degree 25

26 Analysis for distributions Clustering coefficient distributions Despite the points in the scatter diagram are dispersive, their tendencies are in power-law distribution 26

27 Analysis for distributions Clustering coefficient distributions The ACCs of social networks are higher than knowledge graphs in general. 27

28 Analysis for labels relatedness The semantic labels are indeed topic related. 28

29 Outline Introduction & Motivation Statistical Characteristics Data Description Empirical Studies Conclusion 29

30 Conclusions Different parts of a knowledge graph have different properties in some certain statistical characteristics The different knowledge graphs have different performances in several statistical characteristics, and their data distributions are also different Knowledge graphs are different with social networks in many ways. 30

31 Discussions on benchmarking The data generator should generate synthetic data of a knowledge graph in different aspects The generator should take the semantic labels in knowledge graphs into consideration and preserve the statistical characteristics of the real-life data The generator should not only generate the static synthetic data of a knowledge graph, but also the different stages of knowledge graph s development 31

32 Thanks!

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems KNOWLEDGE GRAPHS Lecture 1: Introduction and Motivation Markus Krötzsch Knowledge-Based Systems TU Dresden, 16th Oct 2018 Introduction and Organisation Markus Krötzsch, 16th Oct 2018 Knowledge Graphs slide