2 E6885 Network Science Lecture 10: Analysis of Network Flow

Size: px

Start display at page:

Download "2 E6885 Network Science Lecture 10: Analysis of Network Flow"

Alberta Cannon
5 years ago
Views:

1 E 6885 Topics in Signal Processing -- Network Science E6885 Network Science Lecture 1: Analysis of Network Flow Ching-Yung Lin, Dept. of Electrical Engineering, Columbia University November 21st, 211 Course Structure Class Date 9/12/11 9/19/11 9/26/11 1/3/11 1/1/11 1/17/11 1/24/11 1/31/11 11/14/11 11/21/11 11/28/11 12/5/11 12/12/11 12/19/11 Class Number Topics Covered Overview Social, Information, and Cognitive Network Analysis Network Representations and Characteristics Network Partitioning, Clustering, and Use Case Network Visualization, Sampling and Estimation Network Modeling Network Topology Inference Dynamic Networks -- I Dynamic Networks -- II Final Project Proposal Presentation Analysis of Network Flow Graphical Models Cognitive Networks and Economy Issues in Networks Large-Scale Network Processing System Final Project Presentation 2 E6885 Network Science Lecture 1: Analysis of Network Flow 1

2 Gravity Models Gravity models are a class of models, for describing aggregate levels of interaction among the people of different populations. Traditionally used in: Geography Economics Sociology Hydrology Analysis of Computer Network Traffic For instance, New York > Los Angeles = 2,124,377 * 15,781,273 / (2462 miles)^2 = 52.4 million. El Paso (Texas) <-> Tucson (Arizona) = 73,127 * 79,755 / (263)^2 = 8. million El Paso (Texas) <-> Los Angeles = 21. million Predict migration and traffic flow 3 E6885 Network Science Lecture 1: Analysis of Network Flow Common Gravity Model The general gravity model specifies that the traffic flows Zij be in the form of counts, with independent Poisson distributions and mean functions of the form: E( Z ) = h ( i) h ( j) h ( c ) ij O D S ij Positive function of the origin i Positive function of the origin j Separation attributes: distance, cost, etc. Some commonly used (standard) forms: h ( i) O α = ( π O, i) hd ( j) = ( π D, j) β h S ij ij ( c ) = ( c ) or θ T h ( c ) = exp( θ c ) S ij ij 4 E6885 Network Science Lecture 1: Analysis of Network Flow 2

Example: Austrian call data Phone traffic between 32 telecommunication districts in Austria in 1991. Call flow volume versus each of origin Gross Regional Product (GRP), destination GRP, and distance.

3 Example: Austrian call data Phone traffic between 32 telecommunication districts in Austria in Call flow volume versus each of origin Gross Regional Product (GRP), destination GRP, and distance. Linear regression (dotted), and a nonparametric smoother (solid) 5 E6885 Network Science Lecture 1: Analysis of Network Flow Inference for Gravity Models Focusing on this model (general gravity model): T logµ = α + β + θ c ij i j ij Generic iteratively re-weighted least-squares method can be used. 6 E6885 Network Science Lecture 1: Analysis of Network Flow 3

4 Example: Gravity Model Accuracy of estimates of traffic volume made by the standard (left, in blue) and general (right, in green) gravity models for the Austrian call data. The standard model tends to over-estimate in somewhat greater frequency than the general model, particularly for medium- and lowvolume flows. The relative error decreases with volume. 7 E6885 Network Science Lecture 1: Analysis of Network Flow Relative Prediction Error of the Gravity Models 8 E6885 Network Science Lecture 1: Analysis of Network Flow 4

5 Traffic Matrix Estimation Sometimes it may not easy to monitoring the flow volumes of pairs. Sensors are placed in the entrances to on- and offramps, such as in highway road networks. We are then facing a problem of predicting the Z ij, or alternatively, estimating their means from the observed link counts: X= ( X ) e e E We are seeking to invert the routing matrix B in the relation X=BZ, and B typically has many fewer rows (i.e., network links) than columns (i.e., origin-destination pairs). 9 E6885 Network Science Lecture 1: Analysis of Network Flow Simple network illustrating the traffic matrix estimation problem. 1 E6885 Network Science Lecture 1: Analysis of Network Flow 5

6 Static Methods Methods based on Least-Squares and Gaussian Models. A simple but commonly adopted model for the link count X is one of the form: X= Bµ + ε errors expected flow volumes In general, µ is not estimable in this model, under certain conditions the expected origin and destination volumes µ ι+ and µ +j are in face estimable. 11 E6885 Network Science Lecture 1: Analysis of Network Flow Static Method cont d Robillard proposed a gravity model for the expected flow volumes (1975). Unfortunately, it has been observed that in practice gravity models often fit too poorly to produce good estimates. However, in some situations we have available some initial set of original destination flow volume measurements. We might use these measurements, rathe rthan a gravity model, to suitably constrain our estimate. Cascetta proposed a method for doing so based on generalized leastsquares (1984). 12 E6885 Network Science Lecture 1: Analysis of Network Flow 6

7 Dynamic Methods Dynamic methods of traffic matrix estimation are designed for estimating the traffic at all time periods or sequentially. The dynamic methods proposed to date are predominantly based on principles of least squares. The majority of the methods require that the length of a typical trip time, from any given origin to any given destination, be substantially shorter than the length of each time interval during which measurements were taken. This assumption has the advantage of simplifying the nature of the routing information that much be encoded in the routing matricesb (t), since it effectively allows us to ignore the possibility that trips beginning in one time period actually end in a different time period. 13 E6885 Network Science Lecture 1: Analysis of Network Flow Dynamic Model Cont d Sequential methods of traffic matrix estimation can be viewed as variations or extensions of the concept of Kalman filtering. In the Kalman filtering approach, the time-varying relationship among the means and the link counts is modeled through a set of equations. The so-called Kalman filter is a sequential, recursive algorithm for determining, at each time t+1, an optimal estimate of the state µ(t+1), based on the observations, where this estimate is optimal in the sense that it is unbaised and has minimum variance among all unbiased estimators. 14 E6885 Network Science Lecture 1: Analysis of Network Flow 7

8 Example: Internet Traffic Matrix Estimation 15 E6885 Network Science Lecture 1: Analysis of Network Flow Comparison of bias-corrected tomogravity and Kalman filtering methods 16 E6885 Network Science Lecture 1: Analysis of Network Flow 8

Traffic volume detection Traffic flow volume predictions from bias-corrected tomogravity (left, in blue) and Kalman filtering (right, in red) methods, for four flows with volumes ranging from high

9 Traffic volume detection Traffic flow volume predictions from bias-corrected tomogravity (left, in blue) and Kalman filtering (right, in red) methods, for four flows with volumes ranging from high (top) to low (bottom). Actual flow volumes are shown in yellow. 17 E6885 Network Science Lecture 1: Analysis of Network Flow Reduced Dimensionality Eigenvalues for the routing matrix corresponding to the Abilene network There are 11 paths traversing just 3 directed links. The large gap between the second and third eigenvalues, and the resulting knee in the spectrum are indicatinve of there being substantially more linear dependence among the columns of B than suggested by its nominal rank of 3. The overall decay in the spectrum of eigenvalues suggests that measurements on roughly five to ten paths, and perhaps as few as two paths, may be sufficient to recover useful information about path costs in this network 18 E6885 Network Science Lecture 1: Analysis of Network Flow 9

10 Visual representation of the first four eigenvectors 19 E6885 Network Science Lecture 1: Analysis of Network Flow Predicting Average Path Delay 2 E6885 Network Science Lecture 1: Analysis of Network Flow 1

11 Modeling and Predicting Personal Information Dissemination Behavior Xiaodan Song, Ching-Yung Lin, Belle Tseng, and Ming-Ting Sun -- KDD E6885 Network Science Lecture 1: Analysis of Network Flow Utilizing relational and temporal info provides more insight than pure content analysis What are a person s role in events? with whom do you discuss what is going on in the company? behavior evolution? interests, tastes? In a certain event, who played the most influential roles? who knew the information? how will a person or group of person response for future event? , Publications 22 E6885 Network Science Lecture 1: Analysis of Network Flow Time 11

12 Outline Motivation The Content-Time-Relation (CTR) model Experimental results Conclusions and ongoing work 23 E6885 Network Science Lecture 1: Analysis of Network Flow Motivation Goal Personal information management Modeling and Predicting Personal Behaviors Prior-Art Systems -- Linkedin, Orkut, Friendster, Yahoo! 36 o Share what matters to you Create your own place online Share photos Create a blog List your favorites Send a blast, and more Keep your friends and family close Control who sees what Share as much as you want, with whomever you want Tools for visually managing personal social networks However in current solutions Users need to manually input, update, and manage these networks Do not model or predict personal behaviors 24 E6885 Network Science Lecture 1: Analysis of Network Flow 12

Enron Email Dataset Enron Email Dataset A huge collection of real e-mail messages sent and received by employees of the Enron corporation.

Science Lecture 1: Analysis of Network Flow Overview of CTR Model Input: Emails Applications Receiver Recommendation system From: sally.beck@enron.

com Subject: Re: timing of submitting information to Risk Controls Good memo - let me know if Information Extraction People: Jeff (154) Role: Sender/Receiver

13 Enron Dataset Enron Dataset A huge collection of real messages sent and received by employees of the Enron corporation. 493,391 s from 154 users within ( 99-11,196, 196,157, 1 272,875, 2 35,922) Unique messages 166,653, Intra- Enron messages 25, E6885 Network Science Lecture 1: Analysis of Network Flow Overview of CTR Model Input: s Applications Receiver Recommendation system From: sally.beck@enron.com To: shona.wilson@enron.com Subject: Re: timing of submitting information to Risk Controls Good memo - let me know if Information Extraction People: Jeff (154) Role: Sender/Receiver Content: Bag of Words Time: CTR Model Prediction Filtering CommunityNet Topic: California Energy Time: 2-21 α β γ θ A φ T ϕ Tm u z w N t r D : observations CTR Model S CTR model incorporates content, time and relations in a generative probabilistic way 26 E6885 Network Science Lecture 1: Analysis of Network Flow 13

14 Related Work (I) Social Network Analysis Static Social Network Analysis Small world: six degrees of separation [Milgram 1967] Introduce link analysis into information retrieval (Page rank [BP 98], Hits [K 98]) Mine communities from the web [Flake 22] Mining the network value of customers (Domingos et al. 21, Kempe et al. 23) Exponential Random Graph Model (ERGM [Wasserman et al.1996]) Dynamic Social Network Analysis Link prediction [Nowell and Kleinberg 23] Tracking network changes [Kubica et al. 22] Dynamic actor-oriented social network [Snijders 23] 27 E6885 Network Science Lecture 1: Analysis of Network Flow Related Work (II) Content Analysis Latent Semantic Analysis (LSA) [Deerwester et al. 199] Capture the semantic concepts of documents by mapping words into the latent semantic space which captures the possible synonym and polysemy of words Based on truncated SVD of document-term matrix: optimal least-square projection to reduce dimensionality Probabilistic LSA [Hofmann 1999] Statistical view of LSA Latent Dirichlet Allocation (LDA) [Blei et al. 23] A generative model which assigns Dirichlet priors to the class modeling at PLSA Assume a document is a mixture of topics Author-Topic model [Rosen-Zvi et al. 24] Try to recognize which part of the document is contributed by which co-author. A document with multiple authors is a mixture of the distributions associated with authors. Each author is associated with a multinomial distribution over topic; Each topic is associated with a multinomial distribution over words Author-Recipient-Topic model [McCallum et al. 25] Given the sender and the set of receivers of an , find senders have similar role in events 28 E6885 Network Science Lecture 1: Analysis of Network Flow terms documents LSA D X T = ~ S N x M N x K K x K K x M None of the previous models use temporal information and social/relational information 14

15 Our Contribution Assumption People tend to send s to different groups of people regarding different time periods Social Who knows who People influence each other Information flow Approach Identify context Who does one user communicate with regarding a given topic? Identify temporal evolution How do relations change over time? Content Time Who knows what Networks grow & decay Information diffusion Content-Time-Relation Model (CTR model) 29 E6885 Network Science Lecture 1: Analysis of Network Flow Content-Time-Relation Algorithm -- I Content-Relation (CR) Content topic classification Integrate social network model Combine content and social relation information with Dirichlet allocations, a causal Bayesian network and an Exponential Random Graph Social Network Model. [McCallum et. al, 25] Given the sender and the set of receivers of an 1. Pick a receiver 2. Get the probability of a topic given the sender and receiver 3. Get the probability a word given the topic α β θ A φ T CR model w : observations a: sender/author, z: topic, S: social network (Exponential Random Graph Model / p* model), D: document/ r: receivers, w: content words, N: Word set, T: Topic u z N r D S p* Given the sender of an 1. Get the probability of a topic given the sender 2. Get the probability of the receiver given the sender and the topic 3. Get the probability of a word given the topic 3 E6885 Network Science Lecture 1: Analysis of Network Flow 15

16 Content-Time-Relation Algorithm -- II Content-Time-Relation (CTR) Topic + time -> event Capture evolutionary information Integrate social network model Combine content, time and social relation information with Dirichlet allocations, a causal Bayesian network and an Exponential Random Graph Social Network Model. Given the sender and the [McCallum et. al, 25] Given the sender and the set of receivers of an 1. Pick a receiver 2. Get the probability of a topic given the sender and receiver 3. Get the probability a word given the topic 31 E6885 Network Science Lecture 1: Analysis of Network Flow α β γ θ A φ T CTR model u z w N time of an 1. Get the probability of a topic given the sender 2. Get the probability of the receiver given the sender and the topic ϕ D 3. Get the Tm probability of a word given the : observations topic a: sender/author, z: topic, S: social network (Exponential Random Graph Model / p* model), D: document/ r: receivers, w: content words, N: Word set, T: Topic t r S p* CTR algorithm Training phase Input Old s with content, sender and receiver information, and time stamps Output Testing phase Input New s with content and time stamps Output (, old), (, old), and (,, old) P w z t P z d t P u r z t (,, new), (, new), and (, new) P u r d t P w z t P z d t 32 E6885 Network Science Lecture 1: Analysis of Network Flow 16

Adaptive CTR Social networks dynamically change and evolve over time update the model with newest user behavior information is necessary Aggregative updating the model by adding new user behavior

17 Adaptive CTR Social networks dynamically change and evolve over time update the model with newest user behavior information is necessary Aggregative updating the model by adding new user behavior information including the senders and receivers into the model K ( i) = ( k old) ( k i) + ( old) ( t ) i i Pˆ u, r d, t P u, r z, t P z d, t P u, r t P z d, t k= 1 zt / z i told Assume the correlation between current data and the previous data decays over time. The more recent data are more important. A sliding window of size n is used to choose the data for building the prediction model The prediction is only dependent on the recent data, with the influence of old data ignored 33 E6885 Network Science Lecture 1: Analysis of Network Flow Personal Social Network PSN: who a user contacts with during a certain time period number of times u sends s to r P( r u) = total number of s sent out by u (a) Jan- 99 to Dec- 99 (b) Jan- to Jun- (c) Jul- to Dec- 34 E6885 Network Science Lecture 1: Analysis of Network Flow 17

CommunityNet Christmas Energy Provide a query apply CTR model visualize the personal topic community by CommunityNet 35 E6885 Network Science Lecture 1: Analysis of Network Flow Topic Analysis - Hot

Enron station Houston Texas Enron north America street letter draft attach comment review mark Cold Topics Specific or Sensitive Issues Trade Stock Network Project Market trade London bank name

18 CommunityNet Christmas Energy Provide a query apply CTR model visualize the personal topic community by CommunityNet 35 E6885 Network Science Lecture 1: Analysis of Network Flow Topic Analysis - Hot and cold topics Hot Topics Regular Issues Meeting Deal Petroleum Texas Document meeting plan conference balance presentation discussion deal desk book bill group explore petroleum research dear photo Enron station Houston Texas Enron north America street letter draft attach comment review mark Cold Topics Specific or Sensitive Issues Trade Stock Network Project Market trade London bank name Mexico conserve stock earn company share price new network world user save secure system court state India server project govern call market week trade description respond 36 E6885 Network Science Lecture 1: Analysis of Network Flow 18

19 Topic Trends - yearly repeating events Popularity Topic Trends Topic45(y2) Topic45(y21) Topic19(y2) Topic19(y21) Jan Mar May Jul Sep Nov Topic 45, which is talking about a schedule issue, reaches a peak during June to September. For topic 19, it is talking about a meeting issue. The trend repeats year to year. 37 E6885 Network Science Lecture 1: Analysis of Network Flow CTR Model Finds Topic Categories, Key People and Communities Simultaneously Popularity Key Words Topic Analysis Topic 61 Topic Trend of California Power Jan- Apr- Jul- Oct- Jan-1 Apr-1 Jul-1 Oct-1 power California.8816 electrical price.5594 energy generator market until.3681 Key People Jeff_Dasovich James_Steffes Richard_Shapiro Mary_Hain Richard_Sanders Steven_Kean Event California Energy Crisis occurred at exactly this time period Key people can be identified to be active in this event 38 E6885 Network Science Lecture 1: Analysis of Network Flow 19

Personal Topic Trends of California Power.5.4 Overall trend Jeff_Dasovich Vince_Kaminski Popularity.3.2.

20 Personal Topic Trends of California Power.5.4 Overall trend Jeff_Dasovich Vince_Kaminski Popularity Jan- May- Sep- Jan-1 May-1 Sep-1 39 E6885 Network Science Lecture 1: Analysis of Network Flow Predicting Receivers Personal social network People tend to send s to the same group of people Latent Dirichlet Allocation - Personal social network Topic clusters do not change over time Content-Time-Relation model Adaptive CTR model Jan-1 Mar-1 May-1 Adaptive CTR(aggregative) Adaptive CTR(6 months) CTR LDA-PSN PSN Jul-1 Sep-1 Nov-1 Comparison using Breese evaluation metrics 4 E6885 Network Science Lecture 1: Analysis of Network Flow 2

CTR Model: Predicting Receivers Is a person s behavior predictable? Jeff Dasovich (Enron government relations executive): Whom should I discuss with about Government issue? Accuracy 1.8.6.4.

Analysis of Network Flow Conclusions and ongoing work Conclusion Automatically model and predict human behavior of receiving and disseminating information Establish personal CommunityNet profiles

21 CTR Model: Predicting Receivers Is a person s behavior predictable? Jeff Dasovich (Enron government relations executive): Whom should I discuss with about Government issue? Accuracy by PSN by LDA-PSN by CTR Adaptive CTR Jan-1 Mar-1 May-1 Jul-1 Sep-1 Nov-1 Time Prediction Performance Personal behavior and intention are somewhat predictable 41 E6885 Network Science Lecture 1: Analysis of Network Flow Conclusions and ongoing work Conclusion Automatically model and predict human behavior of receiving and disseminating information Establish personal CommunityNet profiles based on the Content-Time-Relation algorithm, which incorporates contact, content, and time information simultaneously from personal communication Explore many interesting results, Finding the most important employees in events Predicting senders or receivers of s Perform better than both the social network-based and the content-based predictions Personal behavior and intention are somewhat predictable Ongoing work incorporate nonparametric Bayesian methods such as hierarchical LDA with contact and time information Extend the CTR model to Content-Time-Context model for personalized Retrieval and Recommendation 42 E6885 Network Science Lecture 1: Analysis of Network Flow 21

22 Personalized Recommendation Driven by Information Flow Xiaodan Song, Belle Tseng, Ching-Yung Lin and Ming-Ting Sun -- SIG E6885 Network Science Lecture 1: Analysis of Network Flow Recommendation by Collaborative Filtering (CF) A Given adopt A Infer adopt? B Infer adopt? B Given adopt People with similar tastes People with similar tastes Similarity is symmetric! 44 E6885 Network Science Lecture 1: Analysis of Network Flow 22

Adoptions follow a sequence Number of Accessed Users 1 8 6 4 2 5 1 15 2 Apr. 24 Jul.

23 Adoptions follow a sequence Number of Accessed Users Apr. 24 Jul. 25 Early adopter Late adopter 45 E6885 Network Science Lecture 1: Analysis of Network Flow Rogers Diffusion of Innovations Theory Percentage over all adopters Innovators Early adopters Early majority Late majority Laggards Users adoption patterns: Some users tend to adopt items earlier than others 46 E6885 Network Science Lecture 1: Analysis of Network Flow 23

24 Recommendation Driven by Information Flow An Intuitive Example Innovators adopt Early adopters Innovators Less likely! adopt? Early adopters Late majority Early majority Late majority? Early majority People with similar tastes Laggards Most likely! adopt? People with similar tastes Laggards adopt Influence is not symmetric! 47 E6885 Network Science Lecture 1: Analysis of Network Flow Utilize Information Flow for Personalized Recommendation -- Problem Formulation The typical CF question: What items will user U like? Our Formulation Given user U adopts item Y, who would be likely to adopt item Y next? Innovators Information flows from earlier adopters to later adopters Laggards 48 E6885 Network Science Lecture 1: Analysis of Network Flow 24

25 Analogy: Information Adoption As A Diffusion Process Given user U adopts item Y, who would be likely to adopt item Y next? Information Adoption Information Flow (Diffusion) In physics, diffusion process is usually related to a random walk [R. Kondor and J.-P. Vert, Diffusion Kernels, 24] Information Adoption is modeled as a random walk Users are ranked by the state probabilities 49 E6885 Network Science Lecture 1: Analysis of Network Flow Scheme Overview (I) Leverage the asymmetric influence User ID Item ID Timestamp IF Information Propagation Model Application: Personalized Recommendation /3/1 u r 1 v r 2 r 3 Information Flow Network (IF) model the asymmetric influences between users Information Propagation Model if a user adopts the information, who will likely be the follower? 5 E6885 Network Science Lecture 1: Analysis of Network Flow 25

26 Scheme Overview (II) Adoptions patterns are typically category specific User ID Item ID Timestamp IF Information Propagation Model Application: Personalized Recommendation Topic Detection TIF Topic-Sensitive Information Flow Network (TIF) model the asymmetric influences between users under the same topic 51 E6885 Network Science Lecture 1: Analysis of Network Flow IF (I) User ID Dataset Item ID Timestamp IF Information Propagation Model Application: Personalized Recommendation Objective Model the asymmetric influences between users Early Adoption Matrix (EAM) Count how many items one user adopts earlier than the other pairwise comparison User 1 User 2 User N User User User N 3 52 E6885 Network Science Lecture 1: Analysis of Network Flow 26

27 IF (II) IF A Random Walk Model Network structure Each user as a node (state) The value on edge (i j) represents how likely user j will follow user i to adopt the information Normalize EAM to a Transition probability Matrix F i F ij j 53 E6885 Network Science Lecture 1: Analysis of Network Flow User ID Item ID Timestamp IF (III) Dataset IF Information Propagation Model Application: Personalized Recommendation The random walk over the following graphs does not converge Sink Cycle Make the random walk have a unique stationary distribution F (= F + random jump) 1) Make the matrix stochastic 2) Make the matrix irreducible F u, v F if u, v Fu, v v = 1 N else ( α) T F= αf+ 1 ee N N: number of the nodes 54 E6885 Network Science Lecture 1: Analysis of Network Flow 27

28 TIF User ID Item ID Timestamp IF Information Propagation Model Application: Personalized Recommendation Topic Detection TIF Latent Dirichlet Allocation [Blei et al. 23] β TOPIC 1 φ T α θ z w W D : Observations TOPIC 2 TOPIC 1 TOPIC 2 TOPIC 3 TIF TOPIC 3 55 E6885 Network Science Lecture 1: Analysis of Network Flow Information Propagation Models (I) User ID Dataset Item ID Timestamp IF Information Propagation Model Application: Personalized Recommendation 1. Summation of various propagation steps F F F F ( ) = ( + ) m if m m u r 1 v a special case: when m= N 1 and N 1 N Fif ( N 1) F = U U... T r 2 r 3 N: number of the nodes where U: eigenvector 56 E6885 Network Science Lecture 1: Analysis of Network Flow 28

29 Information Propagation Models (II) User ID Dataset Item ID Timestamp IF Information Propagation Model Application: Personalized Recommendation 2. Exponential weighted summation The longer the path, the less reliable it is ( β ) N: number of the nodes exp 1 exp( β λ2) Fif (exp) exp( βf) = U U exp( β λn) u r 1 r 2 r 3 v T where eigenvalues λ λ λ 1= 1> 2 > > N 57 E6885 Network Science Lecture 1: Analysis of Network Flow Personalized Recommendation User ID Dataset Item ID Timestamp IF Information Propagation Model Application: Personalized Recommendation Construct IF or TIF based on the historical data Trigger earliest users to start the process Predict who will be also interested in these items by information propagation models 58 E6885 Network Science Lecture 1: Analysis of Network Flow 29

Baseline Collaborative Filtering (CF) Metric Precision & Recall 59 E6885 Network Science Lecture 1: Analysis of Network Flow Consistency of Early Adoption Patterns How consistent are users pairwise

30 Experimental Setup Sales-force dataset Apr. 24 to Apr. 25 as training data May 25 to Jul. 25 as test data 133 users, 586 documents MovieLens dataset 943 users, 1682 movies, 1, actions The log data regarding early 8% disclosed movies as training data, late 2% as test data Evaluation Baseline Collaborative Filtering (CF) Metric Precision & Recall 59 E6885 Network Science Lecture 1: Analysis of Network Flow Consistency of Early Adoption Patterns How consistent are users pairwise adoption behaviors over time? Calculate Transition Prob. Matrices (TPM) of both training and test data For each user I, calculate the correlation value of The ith row from TPM of the test data and uniform(1/(n-1)) (Baseline 1) The ith row from TPM of the test data and uniform(1/m) (M is the number of users used in CF) (Baseline 2) The ith rows from these two TPMs Number of Users ER Baseline 1 Baseline 2 IF Number of Users MovieLens Baseline 1 Baseline 2 IF Correlation Value 6 E6885 Network Science Lecture 1: Analysis of Network Flow Correlation Value 3

31 Experimental Results --Recommendation Quality Precision Precision Comparison (Number of Triggered Users = 1, Propagation Steps = 1) Number No. of of recommended retrieved users users CF CF EABIF TEABIF TIF Recall Recall Comparison (Number of Triggered Users = 1, Propagation Steps = 1) CF CF EABIF TEABIF TIF Number No. of of retrieved recommended users users Comparing to Collaborative Filtering (CF) Precision: IF is 91% better, TIF is 18% better Recall: IF is 87% better, TIF is 113% better 61 E6885 Network Science Lecture 1: Analysis of Network Flow Experimental Results -- Propagation Performance Ratio (x1%) Precision Improvement Comparison (Number of triggered users = 1, Baseline: CF) EABIF 1.6 TEABIF TIF m m = 1 m = 2 m = 3 m = 4 m = 5 sum exp(β= 1) exp(β= 1.5) exp(β= 2) exp(β= 3) exp(β= 4) exp(β= 5) exp(β= 8) exp(β= 16) Ratio (x1%) Recall Improvement Comparison (Number of triggered users = 1, Baseline: CF) m = 1 m = 2 m = 3 m = 4 m = 5 sum exp(β= 1) exp(β= 1.5) exp(β= 2) exp(β= 3) exp(β= 4) exp(β= 5) EABIF TEABIF TIF exp(β= 8) exp(β= 16) TIF with exponential weighted summation ( β = 4 ) achieves the best performance: improves 136% on precision and 126% on recall comparing to CF 62 E6885 Network Science Lecture 1: Analysis of Network Flow 31

32 Experimental Results --Recommendation Quality Precision Precision Precision Comparison (Number of Triggered Users = 1, Propagation Steps = 1) Number of recommended users No. of retrieved users 4 Precision Comparison (Number of Triggered Users = 2, Propagation Steps = 1) Number of recommended users No. of retrieved users 4 CF CF EABIF TEABIF TIF CF CF EABIF TTIF EABIF Recall Recall Recall Comparison (Number of Triggered Users = 1, Propagation Steps = 1) CF CF EABIF T EABIF TIF Number No. of recommended of retrieved users users Recall Comparison (Number of Triggered Users = 2, Propagation Steps = 1) CF CF EABIF TEABIF TIF Number of recommended users No. of retrieved users 4 Comparing to Collaborative Filtering (CF) Precision: IF is 91% better, TIF is 18% better Recall: IF is 87% better, TIF is 113% better 63 E6885 Network Science Lecture 1: Analysis of Network Flow Conclusions and Next Steps Conclusions Utilize sequential adoption patterns Leverage asymmetric influences between users IF Leverage category-specific patterns TIF Identify how information flows through the network information propagation models Next Steps Leverage the diffusion rate Improve the information propagation models Evaluate by online user study 64 E6885 Network Science Lecture 1: Analysis of Network Flow 32

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science. September 21, 2017

E6893 Big Data Analytics Lecture 3: Big Data Storage and Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science September 21, 2017 1 E6893 Big Data Analytics