An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization Pedro Ribeiro (DCC/FCUP & CRACS/INESC-TEC)
Part 1 Motivation and emergence of Network Science
Complexity I think the next century will be the century of complexity Stephen Hawking (Jan, 2000)
The Real World is Complex World Population: 7 billions
The Real World is Complex World Population: 7 billions Human Brain Neurons: 100 billions
The Real World is Complex World Population: 7 billions Human Brain Neurons: 100 billions Internet Devices: 8 billions
Complex Systems Complex Networks Flights Map
Complex Networks are Ubiquitous Social
Complex Networks are Ubiquitous Social Facebook
Complex Networks are Ubiquitous Social Facebook Co-authorship
Complex Networks are Ubiquitous Social Facebook Co-authorship Biological Nodes + Edges
Complex Networks are Ubiquitous Social Facebook Co-authorship Biological Nodes + Edges Brain
Complex Networks are Ubiquitous Social Facebook Co-authorship Biological Nodes + Edges Brain Metabolism (proteins)
Complex Networks are Ubiquitous Spatial
Complex Networks are Ubiquitous Spatial Power
Complex Networks are Ubiquitous Spatial Power Roads
Complex Networks are Ubiquitous Spatial Power Roads Software
Complex Networks are Ubiquitous Spatial Power Roads Software Module Dependency
Complex Networks are Ubiquitous Spatial Roads Power Software Text Module Dependency
Complex Networks are Ubiquitous Spatial Roads Power Software Text Module Dependency Semantic
Network Science Behind many complex systems there is a network that defines the interactions between the components In order to understand the systems... we need to understand the networks!
Network Science Network Science has been emerging on this century as a new discipline: Origins on graph theory and social network research Image: Adapted from (Barabasi, 2015)
Why now? Two main contributing factors:
Why now? Two main contributing factors: 1) The emergence of network maps
Why now? Two main contributing factors: 1) The emergence of network maps Movie actor network: 1998 World Wide Web: 1999 Citation Network: 1998 Metabolic Network: 2000 PPI Network: 2001
Why now? Two main contributing factors: 1) The emergence of network maps Movie actor network: 1998 World Wide Web: 1999 Citation Network: 1998 Metabolic Network: 2000 PPI Network: 2001 436 nodes 2003 (email exchange, Adamic-Adar, SocNets) 43,553 nodes 2006 (email exchange, Kossinets-Watts, Science) 4.4 million nodes 2005 (friendships, Liben-Nowell, PNAS) 800 million nodes 2011 (Facebook, Backstrom et al.) ters! t a m Size
Why now? Two main contributing factors: 2) Universality of network characteristics Image: Adapted from (Newman, 2005)
Why now? Two main contributing factors: 2) Universality of network characteristics The architecture and topology of networks from different domains exhibit more similarities that what one would expect
Why now? Two main contributing factors: 2) Universality of network characteristics The architecture and topology of networks from different domains exhibit more similarities that what one would expect laws r e w o E.g. p Image: Adapted from (Newman, 2005) Image: Adapted from Leskovec, 2015
Impact of Network Science Economic Impact
Impact of Network Science Network Biology/Network Medicine
Impact of Network Science Fighting Terrorism and Military
Impact of Network Science Scientific Impact 1998: Watts-Strogatz paper in the most cited Nature publication from 1998; highlightedby ISI as one of the ten most cited papers in physics in the decade after its publication. 1999: Barabasi and Albert paper is the most cited Science paper in 1999;highlighted by ISI as one of the ten most cited papers in physics in the decade after its publication. 2001: Pastor -Satorras and Vespignani is one of the two most cited papers among the papers published in 2001 by Physical Review Letters. 2002: Girvan-Newman is the most cited paper in 2002 Proceedings of the National Academy of Sciences. REVIEWS The first review of network science by Albert and Barabasi (2001 is the most cited paper published in Reviews of Modern Physics, the highest impact factor physics journal, published since 1929. The SIAM review of Newman on network science is the most cited paper of any SIAM journal Network Biology, by Barabasi and Oltvai (2004), is the second most cited paper in the history of Nature Reviews Genetics, the top review journal in genetics.
Impact of Network Science Books
Impact of Network Science Books
Impact of Network Science Books (General Audience) And even award an winning documentary!
Impact of Network Science Example Real Application: Epidemics
Network Science Topics Some possible tasks:
Network Science Topics Some possible tasks: General Patterns Ex: scale-free, small-world
Network Science Topics Some possible tasks: General Patterns Ex: scale-free, small-world Community Detection What groups of nodes are related?
Network Science Topics Some possible tasks: General Patterns Community Detection Ex: scale-free, small-world What groups of nodes are related? Node Classification Importance and function of a certain node?
Network Science Topics Some possible tasks: General Patterns Community Detection What groups of nodes are related? Node Classification Ex: scale-free, small-world Importance and function of a certain node? Network Comparison What is the type of the network?
Network Science Topics Some possible tasks: General Patterns Community Detection Importance and function of a certain node? Network Comparison What groups of nodes are related? Node Classification Ex: scale-free, small-world What is the type of the network? Information Propagation Epidemics? Robustness?
Network Science Topics Some possible tasks: General Patterns Community Detection What is the type of the network? Information Propagation Importance and function of a certain node? Network Comparison What groups of nodes are related? Node Classification Ex: scale-free, small-world Epidemics? Robustness? Link prediction Future connections? Errors in graph constructions?
Part 2 A brief introduction to Graph Theory and network vocabulary
Graph Terminology Objects: nodes, vertices Interactions: links, edges System: network, graph N E G(N,E)
Graph Terminology Undirected Directed co-authorship networks www hyperlinks actor networks phone calls facebook friendships roads network
Graph Terminology Edge Attributes Examples: Weight (duration call, distance road,...) Ranking (best friend, second best friend, ) Type (friend, relative, co-worker,...) [colored edges] We can have a set of multiple attributes Node Attributes Examples: Type (nationality, sex, age, ) [colored nodes] We can have a set of multiple attributes
Node Properties From immediate connections Outdegree how many directed edges originate at node Indegree how many directed edges are incident on a node Outdegree=3 Indegree=2 Degree (in or out) number of outgoing and incoming edges Degree=5
Node Properties Degree related metrics: Degree sequence an ordered list of the (in,out) degree of each node In-degree sequence: [4, 2, 1, 1, 0] Out-degree sequence: [3, 2, 2, 1, 0] Degree sequence: [4, 3, 3, 3, 3] Degree Distribution a frequency count of the occurrences of each degree [usually plotted as probability normalization] In-degree Distribution 2.5 2 1.5 1 0.5 0 Out-degree Distribution 2.5 2 1.5 1 0.5 0 0 1 2 3 4 0 1 2 3 4 Degree Distribution 5 4 3 2 1 0 0 1 2 3 4
Sparsity of Networks Real Networks are usually very Sparse! Network Dir/Undir Nodes Edges Avg. Degree Internet Undirected 192,244 609,066 6.33 WWW Directed 325,729 1,479,134 4.60 Power Grid Undirected 4,941 6,594 2.67 Mobile Phone Calls Directed 36,595 91,826 2.51 Email Directed 57,194 103,731 1.81 Science Collaboration Undirected 23,133 93,439 8.08 Actor Network Undirected 702,388 29,397,908 83.71 Citation Network Directed 449,673 4,689,479 10.43 E. Coli Metabolism Directed 1,039 5,082 5.58 Protein Interactions Undirected 2,018 2,930 2.90 A graph where every pair of nodes is connected is called a complete graph (or a clique) Table: Adapted from (Barabasi, 2015)
Power Law in the Degree Sequence
Connectivity Not everything is connected
Connectivity A strongly connected component is a maximal subset of nodes where each pair of nodes is reachable trough a directed path 3 strongly connected components: - {1, 2, 5} - {3, 4, 8} - {6, 7} In a weakly connected component we can use the links in any direction 1 Weakly connected component: - {1, 2, 3, 4, 5, 6, 7, 8}
Connectivity If the largest component has a large fraction of the nodes we call it the giant component
Bipartite A bipartite graph is a graph whose nodes can be divided into two disjoint sets U and V such that every edge connects a node in U to one in V. Example: - Actor Network. U = Actor. V = Movies Image: Adapted from Leskovec, 2015
Bipartite
Bipartite Human Disease Network
Paths A path between two nodes is a sequence of adjacent nodes and their respective connecting edges The distance between two nodes (in an unweighted network) is the number of edges in the shortest path between them Example: - Distance from A to D is 3 - Distance from A to E is 4 - Distance from E to F is 2 Diameter: maximum distance between any pair of nodes Example: for the graph above, the diameter is 4
Node Centrality Centrality (how important a node is?) Betweenness: percentage of all shortest paths the node is part of Closeness: average distance to all other nodes Eigenvector: how important a node is depends on its neighbours PageRank: importance is related to in-links Image: Mateo, 2015
Clustering Clustering Coefficient (to which extent do the nodes cluster) Node : Ci = nr connection between neighbours nr maximum possible connections Global: i) Average C (Watts and Strogatz) i ii) nr triangles (cliques of size 3) nr connected triplets of vertices Real World networks typically have high clustering coefficients
Community Structure Communities Groups of nodes that are densely connected between themselves Several variations and algorithms Girvan-Newman Modularity Hierarchical clustering... Image: Newman, 2012
Part 3 Network Visualization and Exploration
Why Visualization? The greatest value of a picture is when it forces to notice what we never expected to see
Exploratory Data Analysis Visualization alone is not enough Part of a larger process to extract insight Data process chain near Non-li Error! d n a Trial Images: Ben Fry, 2004
Exploring a Network 1) See the network Draw using a certain layout,... 2) Interact in real time Group, filter, compute metrics,... 3) Build a visual language Size of nodes, thickness of edges, colors,...
Exploring Graphs Today we are going to use Gephi Open-Source Network Analysis and Visualization Platform (written in Java)
Why Gephi? Because it has a large community Because it has history and will continue to have Started at 1998 Maintained by a consortium (long-term vision) Because it is extensible with plugins Gephi marketplace Because I am familiar with it! :) There are other options: The main concepts and ideas we will show can be used on any other visualization tool
Datasets for Today Co-Authorships in Network Science http://www-personal.umich.edu/~mejn/netdata/netscience.zip Compiled by Mark Newman in May 2006 Available in gml (Graph Modeling Language) 1,589 scientists, 2,742 collaborations Flights Data http://openflights.org/data.html Compiled by Open Flights website 3,440 airports, 67,663 routes from 531 airlines
What to do? Load graph Filter Force Directed, Geographical, Circular, (polishing the results) Ranking Centralities, degrees, distances, communities Draw using a layout Main operators, selecting, ranges, combining Compute metrics Opening a network vs importing data Color or size of the nodes and edges according to a metric Partition Coloring according to a partition
What to do?! O M E D