Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

CS 9 Projet Final Report: Learning Convention Propagation in BeerAdvoate Reviews from a etwork Perspetive Abstrat We look at the way onventions propagate between reviews on the BeerAdvoate dataset, and try to predit whether a speifi onvention will be adopted by a user in his oming review. Learning and predition of onvention adoption are done based on exposure of a review, or a reviewer at a speifi time point, to the onvention. In this projet, we define the riteria for exposure of one review to another, whih in turn define an impliit network struture over the reviews. We then use features extrated from this network to learn and predit onvention adoption.. Introdution BeerAdvoate is a website in whih users write reviews on various brands of beer. The BeerAdvoate dataset ontains over.5 million review reords, made by more than 33K users. Some of the users adopt a unique jargon in their review text, and use ertain onventions (speifi words, phrases or abbreviations) whih are shared by multiple reviewers and aross multiple beers. In the sope of this projet, a onvention is an element from a pre-defined set of piees of text C. We do not address the semanti meaning of a onvention, or what makes a piee of text to beome a onvention. In this projet, we look at the binary lassifiation problem of learning when a onvention C is used ( adopted ) and try to predit whether a new review r would adopt that onvention. In addition to its ontent, a review r is haraterized by three omponents: The reviewer user( r ), the produt ( ) beer r, and the time of the review time( r ). The hypothesis behind this projet is that high exposure to a onvention by user( r ) while reviewing beer( r ) at time( r ) inreases the likelihood of review r adopting. One key question is how to define exposure in a review website suh as BeerAdvoate. In problems that deal with information propagation in networks (e.g. soial networks), the network is given in advane and determines whether a node (whih in most ases represents a user) is exposed to another node. In ontrast, a review website suh as BeerAdvoate does not expliitly define a network struture. Instead, it is up to us to define when are two reviews (or two reviewers) are exposed to one another, and at what times. The definition of exposure then defines an underlying exposure network that an be used to reason about information propagation between its nodes. The established exposure relations between reviews and the network struture they define are used to define features for learning and predition of onvention adoption by new reviews that are added to the network. There are three ategories of features (attributes) we use for onvention adoption learning: The extent of the exposure the review has to the onvention, user bias, and onvention bias. ser bias aptures the tendeny of the user to adopt onventions, and onvention bias aptures the tendeny of the onvention to be adopted. Features related to the embedding of the sub-graph indued by the onvention adopting reviews within the general exposure graph are inluded under onvention bias as an impliit measure of orrelation between exposure and onvention propagation for that onvention.. Exposure etwork Model Definition: a review r is exposed to review r ' if r ' is either an earlier review by user( r ), or one of the k preeding reviews on beer( r ). Review r is exposed to a onvention if one of the reviews r is exposed to uses. Note that there is no requirement for usage by r itself. Formally, the set of reviews r is exposed to is Exp[ r] Exp [ r] Exp [ r] where: B

B [ r] { review ( ) ( ) ( ) < ( )} [ r] { review ( ) ( ) ( ) ( ) < ( )} Exp x user r user x time x time r Exp x beer r beer x rank r k rank x rank r rank( x ) is the hronologial rank of review x among the reviews of beer( x ). Note that rank( x) rank( r) time x ( ) time( r) <. < also implies The reasoning behind this definition is that exposure between reviews originates from previous usage of a onvention ( user-based exposure ) or from ontagion from one of the immediate preeding reviews of the same produt whih are immediately visible to the reviewer ( produt-based exposure ). We set the produt exposure parameter k to be 5, whih is the number of reviews in a page on the BeerAdvoate website. While the above definition is binary ( r is either exposed to r ' or not), the extent of exposure of r to the reviews is not uniform - but a dereasing funtion of the time differene between the reviews in the ase of same-user exposure (using the same onventions as a reent review is more probable than using a onvention from an anient one), or a dereasing funtion of the rank differene between the reviews in the ase of same-produt exposure (the reviews lose in rank to the urrent review are more easily visible to the reviewer, and probably more relevant). The above definition of exposure indues a direted network struture G (, E) over the dataset, where the nodes represent reviews, and an edge ( r ' r) exists in the network if and only if review r is exposed to review r '. The extent of exposure of r to r ' then defines a weight for the edge ( r ' r) whih is a dereasing funtion of the time / rank differene between r and r '. We notied that the likelihood of onvention propagation dereases far more drastially for produt-based exposures as rank differene inreases than it does for user-based exposure as time differene inreases. Thus the edge weights are modeled in the following manner: ( time( r) rime( r )) ( r r) ( rank( r) rank( r )) r r ' ; if ' is due to user-based exposure w( r ' r) exp ( ' ) ; if ( ' ) is due to produt-based exposure The exposure and network model disussed here deal with reviews as basi entities (i.e. nodes in the network), and not reviewers. This is important in order to inorporate temporal onsiderations into the model. A review is an instantaneous event - and exposure for the purpose of onvention propagation is only relevant at the moment of the review. Thus, we annot disuss absolute exposure between reviewers (who write multiple reviews at different times), but only exposure at speifi times, whih is equivalent to disussing exposure between reviews. 3. Features The basi features (attributes) we use for learning and prediting adoption of onvention are desribed in the table below. They divided into three ategories: The extent of exposure to onvention of the review r (features,), bias of user( r ) at time( r ) (features 3,4), and the bias of the onvention at time( r ) (features 5-9). The last ategory also inludes features that ome to apture the embedding of the sub-graph G indued by the reviews that adopted within the general exposure network G (features 7-9). These features ome to apture the ontinuity and linearity of the spread of the onvention within the exposure network as an impliit measure of orrelation between exposure and onvention propagation for that onvention. For more ompat formulation, the following sets are defined: { ( ) < ( )} ; C : S[, ] { x x T x uses } ; E {( x, y) T T ( x y) E} T r x time x time r r r r r r These sets are in turn used for feature formulation: [ x] { y ( y x) E} [ x] { y ( x y) E} In ; out

Desription Extent of user-based exposure of review r to onvention Formulation x Exp [ r] sore( r, ) w x x Exp [ r] { x uses } w( x r) ( r) Extent of produt-based exposure of review r to onvention sore( r, ) { x uses } w( x r) x ExpB[ r] w x x ExpB[ r] ( r) 3 The fration of onventions adopted by user( r ) up to time( r ) 4 The fration of reviews by user( r ) to adopt a onvention up to time( r ), maximized over all possible onventions sore r x Exp x C ( ) { [ r] : uses } C ( ) max { uses } C Exp [ r] x Exp[ r] sore r x 5 Likelihood of the onvention to get adopted at time( r ) - the fration of reviews that adopted by time( r ) (, ) sore r [, ] S r T r 6 The likelihood of propagation of given that a review is exposed to - the weighted fration at time( r) of exposures (edges in the network) that represent propagations of 7 The fration of adoptions at time( r ) that serve as soures in G (i.e. start a propagation flow) 8 The fration of adoptions at time( r ) that serve as sinks in G (i.e. end a propagation flow) 9 Average propagation fan-out at time( r ) - the average fration of adopters among the out-neighbors of an adopter ( x, y) E[ r] sore r (, ) { x uses y uses } w( x y) ( x, y) E[ r] ( y) w x sore( r, ) In x S r, S r, x S[ r, ] { } sore( r, ) out x S r, S r, x S[ r, ] { } [ x] S[ r ] [ x] out, sore( r, ) S r, x S[ r, ] out The exposure network G is a massive graph of over.5m nodes and over 600M edges, with attribute data on both nodes and edges. Even in a ompat binary representation, the objet representing G is over 30GB in size. Therefore it is ruial that all feature extration will be performed extremely effiiently. Many of the features depend on a sum of the form F( x). x time( x) < time( r) When omputed naively, suh a sum requires O( ) steps where is the number of reviews, whih makes the omputation infeasible for suh a large graph. However, by visiting reviews in their hronologial order and using dynami programming to inrementally ompute the features, we were able to extrat all features in a linear time O( ). For programming onveniene and omputational effiieny, in our analysis we also only address reviews for whih all feature values are available by traversing only the edges - i.e. reviews that both serve as a soure node and a destination node of some edge (i.e. have in-degree and out-degree of at least one). This still results in a massive dataset of 89,066 reviews.

4. Learning and Evaluation We proessed the dataset, onstruted the exposure network G, and extrated the feature values using SNAP - a high-performane library for analysis of massive networks (http://snap.stanford.edu/snap/). We learn adoption of the following set of onventions: A, M, S, T, deent, stik, ream, grass, butter, skunk. The first four are onventions for abbreviations in the BeerAdvoate ommunity (stand for Aroma, Mouth-full, Smell and Taste ). We only ounted the appearanes of these abbreviations in the text when they were used as onventions - when apitalized and followed by a olon or a hyphen. The rest are ommonly repeating not obvious word roots used in beer desriptions. The frequenies of these onventions aross the entire dataset are given in the following table: A M S T deent stik ream grass butter skunk 30.79% 8.03% 30.3% 9.98%.34%.04% 8.49% 7.66%.7%.03% We partitioned the dataset (89,066 reviews) into a training set ontaining 70% of the data points (580,346 reviews) and a test set ontaining the remaining 30% (48,70 reviews). We then trained an SVM (using a liblinear SVM with a linear kernel with L regularization) to learn and predit the adoption of the above onventions using the features extrated from the adoption network. Our baseline for evaluation was naive predition based on onvention frequeny: a predition sheme that onsiders eah review r and onvention independently and predits that r would use with probability p whih is the frequeny of onvention aross the entire dataset. We ompare auray, preision and reall for both predition methods for eah of the onventions. The results are desribed in the following figures:

Auray is a more signifiant for the higher-frequeny onventions than for lower-frequeny onventions. But even then, the positive and negative lasses are highly skewed (the vast majority of reviews do not use a given onvention). Thus, preision and reall should be taken into aount as more signifiant performane metris of the predition sheme. We an see that the SVM performs surprisingly well when a onvention appears frequently enough in the dataset. It also seems that the SVM predits more onservatively (less prone to assign a positive label) than the baseline when the onvention has low frequeny, whih explains the low reall for infrequent onventions. This an be attributed to the fat that SVM aptures dependeny between the data points, whereas our baseline predits for eah point independently. The number of training examples by far exeeds the number of feature used, so it makes sense to use more features and onstrut a riher model by mapping the existing nine features into a higher dimensional feature spae using a kernel. However, trying to do so using the entire dataset (using libsvm SVMs with a higher-degree polynomial kernel) proved to be extremely omputationally expensive. We ompared the preision and reall sores for the most ommonly used onventions using an SVM with a linear kernel vs. one with a 3 rd -degree polynomial kernel (both using L regularization) on a sample of 0K reviews. Despite our initial assumption, the results showed that the performane of both kernels is very similar (we didn t try higher degree kernels due to long runtimes): 5. Conlusion and Potential Future Work We were surprised to see how well lassifiation based on exposure network features performed in omparison to naive frequenybased predition. Yet, we believe that exposure network based features an be further improved - for instane by using statistial inferene to determine values for network edges weights, or by leveraging network statistis (for instane degree distribution or weakly-onneted omponent deomposition) of the entire graph and onvention-adopting sub-graphs. Additionally, suh predition ould be further improved by inorporating non-network based features, suh as user and produt bias, and semanti/linguisti features of the onventions.