Approximating Betweenness Centrality

Similar documents
Approximating Betweenness Centrality

1 Graph Sparsfication

arxiv: v2 [cs.ds] 24 Mar 2018

3D Model Retrieval Method Based on Sample Prediction

6.854J / J Advanced Algorithms Fall 2008

Counting the Number of Minimum Roman Dominating Functions of a Graph

Random Graphs and Complex Networks T

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve

Protected points in ordered trees

CS 683: Advanced Design and Analysis of Algorithms

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Relationship between augmented eccentric connectivity index and some other graph invariants

UNIT 4 Section 8 Estimating Population Parameters using Confidence Intervals

Improved Random Graph Isomorphism

An Efficient Algorithm for Graph Bisection of Triangularizations

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

An Efficient Algorithm for Graph Bisection of Triangularizations

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

EFFECT OF QUERY FORMATION ON WEB SEARCH ENGINE RESULTS

Data Structures and Algorithms. Analysis of Algorithms

A General Framework for Accurate Statistical Timing Analysis Considering Correlations

Lecture 5. Counting Sort / Radix Sort

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

Ranking of Closeness Centrality for Large-Scale Social Networks

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

Improving Template Based Spike Detection

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation

Big-O Analysis. Asymptotics

Ones Assignment Method for Solving Traveling Salesman Problem

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Algorithms for Disk Covering Problems with the Most Points

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Analysis of Algorithms

The Counterchanged Crossed Cube Interconnection Network and Its Topology Properties

Evaluation scheme for Tracking in AMI

CSE 417: Algorithms and Computational Complexity

Dynamic Programming and Curve Fitting Based Road Boundary Detection

A Study on the Performance of Cholesky-Factorization using MPI

Alpha Individual Solutions MAΘ National Convention 2013

IMP: Superposer Integrated Morphometrics Package Superposition Tool

A Note on Least-norm Solution of Global WireWarping

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

A Parallel DFA Minimization Algorithm

Perhaps the method will give that for every e > U f() > p - 3/+e There is o o-trivial upper boud for f() ad ot eve f() < Z - e. seems to be kow, where

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Markov Chain Model of HomePlug CSMA MAC for Determining Optimal Fixed Contention Window Size

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Pruning and Summarizing the Discovered Time Series Association Rules from Mechanical Sensor Data Qing YANG1,a,*, Shao-Yu WANG1,b, Ting-Ting ZHANG2,c

Lecture 1: Introduction and Strassen s Algorithm

c-dominating Sets for Families of Graphs

condition w i B i S maximum u i

Bayesian approach to reliability modelling for a probability of failure on demand parameter

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis

Big-O Analysis. Asymptotics

Analysis of Algorithms

Identification of the Swiss Z24 Highway Bridge by Frequency Domain Decomposition Brincker, Rune; Andersen, P.

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

Combination Labelings Of Graphs

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

. Written in factored form it is easy to see that the roots are 2, 2, i,

FREQUENCY ESTIMATION OF INTERNET PACKET STREAMS WITH LIMITED SPACE: UPPER AND LOWER BOUNDS

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

An Estimation of Distribution Algorithm for solving the Knapsack problem

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

Performance Plus Software Parameter Definitions

Cubic Polynomial Curves with a Shape Parameter

GPUMP: a Multiple-Precision Integer Library for GPUs

A study on Interior Domination in Graphs

BASED ON ITERATIVE ERROR-CORRECTION

SOFTWARE usually does not work alone. It must have

DATA STRUCTURES. amortized analysis binomial heaps Fibonacci heaps union-find. Data structures. Appetizer. Appetizer

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Redundancy Allocation for Series Parallel Systems with Multiple Constraints and Sensitivity Analysis

Fast Fourier Transform (FFT) Algorithms

prerequisites: 6.046, 6.041/2, ability to do proofs Randomized algorithms: make random choices during run. Main benefits:

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Image Segmentation EEE 508

On (K t e)-saturated Graphs

Characterizing graphs of maximum principal ratio

Optimal Mapped Mesh on the Circle

Force Network Analysis using Complementary Energy

Sum-connectivity indices of trees and unicyclic graphs of fixed maximum degree

Sectio 4, a prototype project of settig field weight with AHP method is developed ad the experimetal results are aalyzed. Fially, we coclude our work

The Adjacency Matrix and The nth Eigenvalue

Throughput-Delay Scaling in Wireless Networks with Constant-Size Packets

A Comparative Study of Positive and Negative Factorials

INTERSECTION CORDIAL LABELING OF GRAPHS

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions

Speeding-up dynamic programming in sequence alignment

Analysis of Documents Clustering Using Sampled Agglomerative Technique

HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING

Pattern Recognition Systems Lab 1 Least Mean Squares

New Fuzzy Color Clustering Algorithm Based on hsl Similarity

are two specific neighboring points, F( x, y)

Transcription:

Approximatig Betweeess Cetrality David A. Bader, Shiva Kitali, Kamesh Madduri, ad Milea Mihail College of Computig Georgia Istitute of Techology {bader,kitali,kamesh,mihail}@cc.gatech.edu Abstract. Betweeess is a cetrality measure based o shortest paths, widely used i complex etwork aalysis. It is computatioally-expesive to exactly determie betweeess; curretly the fastest-kow algorithm by Brades requires O(m) time for uweighted graphs ad O(m + 2 log ) time for weighted graphs, where is the umber of vertices ad m is the umber of edges i the etwork. These are also the worstcase time bouds for computig the betweeess score of a sigle vertex. I this paper, we preset a ovel approximatio algorithm for computig betweeess cetrality of a give vertex, for both weighted ad uweighted graphs. Our approximatio algorithm is based o a adaptive samplig techique that sigificatly reduces the umber of sigle-source shortest path computatios for vertices with high cetrality. We coduct a extesive experimetal study o real-world graph istaces, ad observe that our radom samplig algorithm gives very good betweeess approximatios for biological etworks, road etworks ad web crawls. 1 Itroductio Oe of the fudametal problems i etwork aalysis is to determie the importace (or the cetrality) of a particular vertex (or a edge) i a etwork. Some of the well-kow metrics for computig cetrality are closeess [1], stress [2] ad betweeess [3, 4]. Of these idices, betweeess has bee extesively used i recet years for the aalysis of social-iteractio etworks, as well as other large-scale complex etworks. Some applicatios iclude lethality i biological etworks [5 7], study of sexual etworks ad AIDS [8], idetifyig key actors i terrorist etworks [9, 10], orgaizatioal behavior [11], ad supply chai maagemet processes [12]. Betweeess is also used as the primary routie i popular algorithms for clusterig ad commuity idetificatio [13] i real-world etworks. For istace, the Girva-Newma [14] algorithm iteratively partitios a etwork by idetifyig edges with high betweeess scores, removig them ad recomputig cetrality scores. Betweeess is a global cetrality metric that is based o shortest-path eumeratio. Cosider a graph G = (V, E), where V is the set of vertices represetig actors or odes i the complex etwork, ad E, the set of edges represetig the relatioships betwee the vertices. The umber of vertices ad edges are deoted by ad m respectively. The graphs ca be directed or udirected.

We will assume that each edge e E has a positive iteger weight w(e). For uweighted graphs, we use w(e) = 1. A path from vertex s to t is defied as a sequece of edges u i, u i+1, 0 i < l, where u 0 = s ad u l = t. The legth of a path is the sum of the weights of edges. We use d(s, t) to deote the distace betwee vertices s ad t (the miimum legth of ay path coectig s ad t i G). Let us deote the total umber of shortest paths betwee vertices s ad t by λ st, ad the umber passig through vertex v by λ st (v). Let δ st (v) deote the fractio of shortest paths betwee s ad t that pass through a particular vertex v i.e., δ st (v) = λst(v) λ st. We call δ st (v) the pair-depedecy of s, t o v. Betweeess cetrality [3, 4] of a vertex v is defied as BC(v) = s v t V δ st (v) Curretly, the fastest kow algorithm for exactly computig betweeess of all the vertices, desiged by Brades [15], requires at least O(m) time for uweighted graphs ad O(m + 2 log ) time for weighted graphs, where is the umber of vertices ad m is the umber of edges. Thus, for large-scale graphs, exact cetrality computatio o curret workstatios is ot practically viable. I prior work, we explored high performace computig techiques [16] that exploit the typical small-world graph topology to speed up exact cetrality computatio. We desiged ovel parallel algorithms to exactly compute various cetrality metrics, optimized for real-world etworks. We also demostrate the capability to compute exact betweeess o several large-scale etworks (vertices ad edges i the order of millios) from the Iteret ad social iteractio data; these etworks are three orders of magitude larger tha istaces that ca be processed by curret social etwork aalysis packages. Fast cetrality estimatio is thus a importat problem, as a good approximatio would be a acceptable alterative to exact scores. Curretly the fastest exact algorithms for shortest path eumeratio-based metrics require shortestpath computatios; however, it is possible to estimate cetrality by extrapolatig scores from a fewer umber of path computatios. Usig a radom samplig techique, Eppstei ad Wag [17] show that the closeess cetrality of all vertices i a weighted, udirected graph ca be approximated with high probability i O( log ɛ ( log + m)) time, ad a additive error of at most ɛ 2 G (ɛ is a fixed costat, ad G is the diameter of the graph). However, betweeess cetrality scores are harder to estimate, ad the quality of approximatio is foud to be depedet o the vertices from which the shortest path computatios are iitiated from (i this paper, we will refer to them as the set of source vertices for the approximatio algorithm). Recetly, Brades ad Pich [18] preseted cetrality estimatio heuristics, where they experimeted with differet strategies for selectig the source vertices. They observe that a radom selectio of source vertices is superior to determiistic strategies. I additio to exact parallel algorithms, we also discussed parallel techiques to compute approximate betweeess cetrality i [16], usig a radom source selectio strategy. 2

While prior approaches approximate cetrality scores of all vertices i the graph, there are o kow algorithms to compute the cetrality of a sigle vertex i time faster tha computig the betweeess of all vertices. I this paper, we preset a ovel adaptive samplig-based algorithm for approximately computig betweeess cetrality of a give vertex. Our primary result is as follows: Theorem : For 0 < ɛ < 0.5, if the cetrality of a vertex v is 2 /t for some costat t 1, the with probability 1 2ɛ its cetrality ca be estimated to withi a factor of 1/ɛ with ɛt samples of source vertices. The rest of this paper is orgaized as follows. We review the curretly-kow fastest sequetial algorithm by Brades i Sectio 2. We preset our approximatio algorithm based o adaptive samplig ad its aalysis i Sectio 3. Sectio 4 is a experimetal study of our approximatio techique o several real-world etworks. We coclude with a summary of ope problems i Sectio 5. 2 Exact computatio of Betweeess Cetrality Brades algorithm [15] shows how to compute cetrality scores of all the vertices i the graph i the same asymptotic time bouds as SSSP computatios. 2.1 Brades Algorithm Defie the depedecy of a source vertex s V o a vertex v V as δ s (v) = t s v V δ st(v). The the betweeess score of v ca be the expressed as BC(v) = s v V δ s (v). Also, let P s (v) deote the set of predecessors of a vertex v o shortest paths from s: P s (v) = {u V : u, v E, d(s, v) = d(s, u) + w(u, v)}. Brades shows that the depedecies satisfy the followig recursive relatio, which is the most crucial step i the algorithm aalysis. Theorem 1. The depedecy of s V o ay v V obeys δ s (v) = w:v P s (w) λ sv λ sw (1 + δ s (w)) First, SSSP computatios are doe, oe for each s V. The predecessor sets P s (v) are maitaied durig these computatios. Next, for every s V, usig the iformatio from the shortest paths tree ad predecessor sets alog the paths, compute the depedecies δ s (v) for all other v V. To compute the cetrality value of a vertex v, we fially compute the sum of all depedecy values. The O( 2 ) space requiremets ca be reduced to O(m+) by maitaiig a ruig cetrality score. For additioal details, we refer the reader to Brades paper [15]. 3

3 Adaptive-samplig based approximatio The adaptive samplig techique was itroduced by Lipto ad Naughto [19] for estimatig the size of the trasitive closure of a digraph. Prior to their work, algorithms for estimatig trasitive closure were based o radomly samplig source-vertices, solvig the sigle-source reachability problem for the sampled vertices, ad usig this iformatio to estimate the size of the trasitive closure. The Lipto-Naughto algorithm itroduces adaptive samplig of source-vertices, that is, the umber of samples varies with the iformatio obtaied from each sample. I this sectio, we give a adaptive samplig algorithm for computig betweeess of a give vertex v. It is a samplig algorithm i that it estimates the cetrality by samplig a subset of vertices ad performig SSSP computatios from these vertices. It is termed adaptive, because the umber of samples required varies with the iformatio obtaied from each sample. The followig lemma is easy to see ad the proof is omitted. Lemma 1. BC(v) is zero iff its eighborig vertices iduce a clique. Let a i deote the depedecy of the vertex v i o v i.e., a i = δ vi (v). Let A = a i = BC(v). It is easy to verify that 0 a i 2 ad 0 A ( 1)( 2)/2. The quatity we wish to estimate is A. Cosider doig so with the followig algorithm: Algorithm 1 Repeatedly sample a vertex v i V ; perform SSSP (usig BFS or Dijkstra s algorithm) from v i ad maitai a ruig sum S of the depedecy scores δ vi (v). Sample util S is greater tha c for some costat c 2. Let the total umber of samples be k. The estimated betweeess cetrality score of v, BC(v) is give by S k. Let X i be the radom variable represetig the depedecy of a radomly sampled vertex o v. The probability of a evet x is deoted by Pr [ x ]. We establish the followig lemmas to aalyze the above algorithm. Lemma 2. Let E[X i ] deote the expectatio of X i ad V ar[x i ] deote the variace of X i. The, E[X i ] = A/, E[X i 2 ] A, ad V ar[x i ] A. The ext lemma is useful i provig a lower boud o the expected umber of samples made before stoppig. The proof is preseted i the Appedix. Lemma 3. Let k = ɛ 2 /A. The, Pr [ X 1 + X 2 + + X k c ] Lemma 4. Let k ɛ 2 /A ad d > 0. The [ ( k ) ] Pr X i A da k i=1 ɛ (c ɛ) 2 1 ɛd 2 4

Theorem 2. Let à be the estimate of A i the above procedure ad let A > 0. The for 0 < ɛ < 0.5 with probability 1 2ɛ, Algorithm 1 estimates A to withi a factor of 1/ɛ. Proof. There are two ways that the algorithm ca fail: (i) it ca stop too early to guaratee a good error boud, (ii) it ca stop after eough samples but with a bad estimate. First we claim that the procedure is ulikely to stop with k 2 /A. We have that Pr [ ( j)(j k) (X 1 + X 2 + + X j c) ] Pr [ X 1 + X 2 + + X k c ] where k = ɛ2, because the evet to the right of the iequality implies the A evet to the left. But by Lemma 3, the right side of this equatio is at most ɛ/(c ɛ) 2. Substitutig c = 2 ad otig that 0 < ɛ < 0.5, we get that this probability is less tha ɛ. Next we tur to the accuracy of the estimate. If k = ɛ 2 /A, by Lemma 4 the estimate, à = k X i k is withi da of A with probability 1/(ɛd 2 ). Lettig d = 1/ɛ, this is just ɛ. Puttig the two ways of failure together, we get that the total probability of failure is less tha ɛ + (1 ɛ)ɛ, which is less tha 2ɛ. Fially, ote that if A > 0, there must be at least oe i such that a i > 0, so the algorithm will termiate. The case whe A = 0 (i.e., cetrality of v is 0) ca be detected usig Lemma 1 (before ruig the algorithm). A iterestig aspect of our theorem is that the samplig is adaptive. usually such samplig procedures perform a fixed umber of samples. Here it is critical that the algorithm adapts it behavior. Substitutig A = 2 i our aalysis we t get the followig theorem. Theorem 3. For 0 < ɛ < 0.5, if the cetrality of a vertex v is 2 /t for some costat t 1, the with probability 1 2ɛ its cetrality ca be estimated to withi a factor of 1/ɛ with ɛt samples of source vertices. Although our theoretical result is valid oly for high cetrality odes, our experimetal results (preseted i the ext sectio) show a similar behavior for all the vertices. i=1 4 Experimetal Study We assess the quality of the samplig-based approximatio algorithm o several real-world graph istaces (see Table 1). We use the parallel cetrality aalysis 5

toolkit SNAP [20] to compute exact betweeess scores. Sice the executio time ad speedup achieved by the approximatio approach are directly proportioal to the umber of BFS/shortest path computatios, we do ot report performace results i this sectio. For a detailed discussio of exact cetrality computatio i parallel, ad optimizatios for small-world graphs, please refer to [16]. Network data Label Network m Details Source rad radom graph 2000 7980 sythetic, udirected [21] pref-attach preferetial attachmet 2000 7980 sythetic, udirected [22] bio-pi huma protei iteractios 8503 32,191 udirected [23] crawl web-crawl (staford.edu) 9914 36,854 directed [24] cite Lederberg citatio etwork 8843 41,601 directed [24, 25] road Rome, Italy road etwork 3353 4435 weighted, udirected [26] Table 1. Networks used i the experimetal study We experimet with two sythetic graph istaces ad four real etworks i this study. rad is a uweighted, udirected radom etwork of 2000 vertices ad 7980 edges, geerated usig the Erdős Réyi graph model [27]. This sythetic graph has a low diameter, low clusterig, ad a Gaussia degree distributio. pref-attach is a sythetic graph geerated usig the Preferetial attachmet model proposed by Barabási ad Albert [28]. This model geerates graphs with heavy-tailed degree distributios ad scale-free properties. Vertices are added oe at a time, ad for each of them, we create a fixed umber of edges coectig to existig vertices, with probability proportioal to their degree. bio-pi is a biological etwork that represets iteractios i the huma proteome [29, 23]. This graph is udirected, uweighted ad exhibits small-world characteristics. crawl correspods to the wb-cs-staford etwork i the UF sparse matrix collectio [24]. It is a directed graph, where vertices correspod to pages i the Staford Computer Sciece domai, ad edges represet liks. cite is a directed graph from the Pajek etwork collectio [25]. It correspods to papers by ad citig J. Lederberg (1945-2002). road is a weighted graph of 3353 vertices ad 4435 edges that correspods to a large portio of the road etwork of Rome, Italy from 1999 [26]. Vertices correspod to itersectios betwee roads, ad edges correspod to roads or road segmets. Edge weights are physical distaces i metres. Road etworks have more structure ad a higher diameter tha the other etworks cosidered i this study. Methodology Our goal i this study is to quatify the approximatio quality, ad so we primarily compare the approximatio results to exact scores. We first compute 6

exact cetrality scores of all the etworks i Table 1. I most data sets, we are iterested i high-cetrality vertices, as they are the critical etities ad are used i further aalysis. From the exact scores, we idetify vertices whose cetrality scores are a order of magitude greater tha the rest of the etwork. For these vertices, we study the trade-off betwee computatio ad approximatio quality by varyig the parameter c i Algorithm 1. We also show that it is easy to estimate scores of low-cetrality vertices. We chose small etworks for ease of aalysis ad visualizatio, but the approximatio algorithm ca be effectively applied to large etworks as well (see, for istace, the etworks cosidered i [16]). Experimets Figure 1 plots the distributio of exact ad approximate betweeess scores for the six differet test istaces. Note that the sythetic etworks, rad ad pref-attach show sigificatly lower variatio i exact cetrality scores compared to the real istaces. Also, there are a sigificat percetage of lowcetrality vertices (scores less tha, or close to, ) i cite, crawl ad bio-pi. We apply Algorithm 1 to estimate betweeess cetrality scores of all the vertices i the test istaces. I order to visualize the data better, we plot a smoothed curve of the estimated betweeess cetrality data that is superimposed with the exact cetrality score scatter-plot. We set the parameter c i Algorithm 1 to 5 for these experimets. I additio, we impose a cut-off of 20 o the umber of samples. Observe that i all the etworks, the estimated cetrality scores are very close to the exact oes, ad we are guarateed to cut dow o the computatio by a factor of early 20. To further study the quality of approximatio for high-cetrality vertices, we select the top 1% of the vertices (about 30) ordered by exact cetrality score i each etwork, ad compute their estimated cetrality scores usig the adaptivesamplig algorithm. Sice the source vertices i the adaptive approach are chose radomly, we repeat the experimet five times for each vertex ad report the mea ad variace i approximatio error. Figure 2 plots the mea percetage approximatio error i the computed scores for these high cetrality vertices, whe the value of c (see Algorithm 3) is set to 5. The vertices are sorted by exact cetrality score o the X-axis. The error bars i the charts idicate the variace i estimated score due to radom rus, for each etwork. For the radom graph istace, the average error is about 5%, while it is roughly aroud 10% for the rest of the etworks. Except for a few aomalous vertices, the error variace is withi reasoable bouds i all the graph classes. Figure 3 plots the percetage of BFS/SSSP computatios required for approximatig the cetrality scores, whe c is set to 5. This algorithmic cout is a idicator of the amout of work doe by the approximatio algorithm. The vertices are ordered agai by their exact cetrality scores from left to right, with the vertex with the least score to the left. A commo tred we observe across all graph classes is that the percetage of source vertices decreases as the cetrality score icreases this implies that the scores of high cetrality vertices ca 7

(a) rad (b) pref-attach (c) bio-pi (d) crawl (e) cite (f) road Fig. 1. A scatter plot of exact betweeess scores of all the vertices (i sorted order), ad a lie plot of their estimated betweeess scores (the approximate betweeess scatter data is smoothed by a local smoothig techique usig polyomial regressio) 8

(a) rad (b) pref-attach (c) bio-pi (d) crawl (e) cite (f) road Fig. 2. Average estimated betweeess error percetage (i compariso to the exact cetrality score) for multiple rus. The adaptive samplig parameter c is set to 5 for all experimets ad the error bars idicate the variace. 9

(a) rad (b) pref-attach (c) bio-pi (d) crawl (e) cite (f) road Fig. 3. The umber of samples/sssp computatios as a fractio of, the total umber of vertices. This algorithmic cout is a idicator of the amout of work doe by the approximatio algorithm. The adaptive samplig parameter c is set to 5, ad the error bars idicate the variace from 5 rus. 10

be approximated with lesser work usig the adaptive samplig approach. Also, this value is sigificatly lower for crawl, bio-pi ad road compared to other graph classes. We ca also vary the parameter c, which affects both the percetage of BFS/SSSP computatios ad the approximatio quality. Table 2 summarizes the average performace o each graph istace, for differet values of c. Takig oly high-cetrality vertices ito cosideratio, we report the mea approximatio error ad the umber of samples for each graph istace. As expected, we fid that the error decreases as the parameter c is icreased, while the umber of samples icreases. Sice the highest cetrality value is aroud 10 for the citatio etwork, a sigificat umber of shortest path computatios have to be doe eve for calculatig scores with a reasoable accuracy. But for other graph istaces, particularly the road etwork, web-crawl ad the protei iteractio etwork, c = 5 offers a good trade-off betwee computatio ad the approximatio quality. Network rad pref-attach bio-pi crawl cite road t = 2 Avg. error 16.28% 29.39% 46.72% 33.69% 32.51% 22.58% Avg. k 11.31% 5.36% 1.30% 0.96% 17.00% 0.68 % t = 5 Avg. error 6.51% 10.28% 10.49% 10.31% 9.98% 8.79% Avg. k 27.37% 12.38% 3.20% 2.42% 43.85% 1.68% t = 10 Avg. error 5.62% 6.13% 7.17% 7.04% 7.39% Avg. k 54.51% 24.66% 6.33% 4.89% 3.29% Table 2. Observed average-case algorithmic couts, as the value of the samplig parameter c is varied. The average error percetage is the deviatio of the estimated score from the exact score, ad the k percetage idicates the umber of samples/sssp computatios. 5 Coclusio ad Ope Problems We preseted a ovel approximatio algorithm for computig betweeess cetrality, of a give vertex, i both weighted ad uweighted graphs. Our approximatio algorithm is based o a adaptive samplig techique that sigificatly reduces the umber of sigle-source shortest path computatios for vertices with high cetrality. We coduct a extesive experimetal study o real-world graph istaces, ad observe that the approximatio algorithm performs well o web crawls, road etworks ad biological etworks. Approximatig the cetrality of all vertices i time less tha O(m) for uweighted graphs ad O(m + 2 log ) for weighted graphs is a challegig 11

ope problem. Desigig a fully dyamic algorithm for computig betweeess is very useful. Ackowledgmets The authors are grateful to the ARC (Algorithms ad Radomess Ceter) of the College of Computig, Georgia Istitute of Techology, for fudig this project. This work was also supported i part by NSF Grats CAREER CCF-0611589, NSF DBI-0420513, ITR EF/BIO 03-31654, ad DARPA Cotract NBCH 30390004. Kamesh Madduri s work is supported i part by the NASA Graduate Studet Researcher Program Fellowship (NASA NP-2005-07-375-HQ). We thak Richard Lipto for helpful discussios. Appedix Lemma 3 Let k = ɛ 2 /A. The, Pr [ X 1 + X 2 + + X k c ] ɛ (c ɛ) 2 Proof. We have [ ( Pr [ X 1 + + X k c ] = Pr X 1 A ) ( + + X k A ) [ ( = Pr X 1 A ) ( + + X k A ) [ Pr X i A ] (c ɛ) i c ka ] ] c ɛ 1 (c ɛ) 2 2 V ar[x i] i 1 = (c ɛ) 2 2 V ar[x i ] 1 (c ɛ) 2 2 ka ɛ = (c ɛ) 2 Note that we have used Chebychev s iequality ad uio bouds i the above proof. We boud the error i the estimated value of A with the followig lemma. i 12

Lemma 4 Let k ɛ 2 /A ad d > 0. The [ ( k ) ] Pr X i A da k Proof. i=1 1 ɛd 2 [ ( k ) ] [ ( k ) Pr X i A t = Pr X i k i=1 i=1 [ ( k ) = Pr X i 1 A i=1 2 k 2 t 2 k V ar[x i] ] k kt A ] kt to Let k = λ 2, where λ ɛ. The the above probability is less tha or equal A 2 k 2 t 2 k V ar[x i] which is just A2. Settig Ad = t gives λt2 Refereces 1 λd 2 1 ɛd 2 2 λ 2 A t2 A 1. Sabidussi, G.: The cetrality idex of a graph. Psychometrika 31 (1966) 581 603 2. Shimbel, A.: Structural parameters of commuicatio etworks. Bulleti of Mathematical Biophysics 15 (1953) 501 507 3. Freema, L.: A set of measures of cetrality based o betweeess. Sociometry 40(1) (1977) 35 41 4. Athoisse, J.: The rush i a directed graph. Report BN9/71, Stichtig Mathematisch Cetrum, Amsterdam, Netherlads (1971) 5. Jeog, H., Maso, S., Barabási, A.L., Oltvai, Z.: Lethality ad cetrality i protei etworks. Nature 411 (2001) 41 42 6. Piey, J., McCokey, G., Westhead, D.: Decompositio of biological etworks usig betweeess cetrality. I: Proc. 9th A. It l Cof. o Research i Computatioal Molecular Biology (RECOMB 2005), Cambridge, MA (May 2005) Poster sessio. 7. del Sol, A., Fujihashi, H., O Meara, P.: Topology of small-world etworks of proteiprotei complex structures. Bioiformatics 21(8) (2005) 1311 1315 8. Liljeros, F., Edlig, C., Amaral, L., Staley, H., Åberg, Y.: The web of huma sexual cotacts. Nature 411 (2001) 907 908 13

9. Krebs, V.: Mappig etworks of terrorist cells. Coectios 24(3) (2002) 43 52 10. Coffma, T., Greeblatt, S., Marcus, S.: Graph-based techologies for itelligece aalysis. Commuicatios of the ACM 47(3) (2004) 45 47 11. Buckley, N., va Alstye, M.: Does email make white collar workers more productive? Techical report, Uiversity of Michiga (2004) 12. Cisic, D., Kesic, B., Jakomi, L.: Research of the power i the supply chai. Iteratioal Trade, Ecoomics Workig Paper Archive EcoWPA (April 2000) 13. Newma, M.: The structure ad fuctio of complex etworks. SIAM Review 45(2) (2003) 167 256 14. Girva, M., Newma, M.: Commuity structure i social ad biological etworks. Proceedigs of the Natioal Academy of Scieces USA 99(12) (2002) 7821 7826 15. Brades, U.: A faster algorithm for betweeess cetrality. J. Mathematical Sociology 25(2) (2001) 163 177 16. Bader, D., Madduri, K.: Parallel algorithms for evaluatig cetrality idices i real-world etworks. I: Proc. 35th It l Cof. o Parallel Processig (ICPP), Columbus, OH, IEEE Computer Society (August 2006) 17. Eppstei, D., Wag, J.: Fast approximatio of cetrality. I: Proc. 12th A. Symp. Discrete Algorithms (SODA-01), Washigto, DC (2001) 228 229 18. Brades, U., Pich, C.: Cetrality estimatio i large etworks. To appear i Itl. Joural of Bifurcatio ad Chaos, Special Issue o Complex Networks Structure ad Dyamics (2007) 19. Lipto, R., Naughto, J.: Estimatig the size of geeralized trasitive closures. I: VLDB. (1989) 165 171 20. Madduri, K., Bader, D.: Small-world Network Aalysis i Parallel: a toolkit for cetrality aalysis. http://www.cc.gatech.edu/~kamesh (2007) 21. Madduri, K., Bader, D.: GTgraph: A suite of sythetic graph geerators. http: //www.cc.gatech.edu/~kamesh/gtgraph (2006) 22. Barabási, A.L.: Network databases. http://www.d.edu/~etworks/resources. htm (2007) 23. Bader, D., Madduri, K.: A graph-theoretic aalysis of the huma protei iteractio etwork usig multicore parallel algorithms. I: Proc. 6th Workshop o High Performace Computatioal Biology (HiCOMB 2007), Log Beach, CA (March 2007) 24. Davis, T.: Uiversity of Florida Sparse Matrix Collectio. http://www.cise.ufl. edu/research/sparse/matrices (2007) 25. Batagelj, V., Mrvar, A.: PAJEK datasets. http://www.vlado.fmf.ui-lj.si/ pub/etworks/data/ (2006) 26. Demetrescu, C., Goldberg, A., Johso, D.: 9th DIMACS implemetatio challege Shortest Paths. http://www.dis.uiroma1.it/~challege9/ (2006) 27. Erdős, P., Réyi, A.: O radom graphs I. Publicatioes Mathematicae 6 (290 297) 1959 28. Barabási, A.L., Albert, R.: Emergece of scalig i radom etworks. Sciece 286(5439) (1999) 509 512 29. Peri, S., et al.: Developmet of huma protei referece database as a iitial platform for approachig systems biology i humas. Geome Research 13 (2003) 2363 2371 14