Restaurants Review Star Prediction for Yelp Dataset

Size: px

Start display at page:

Download "Restaurants Review Star Prediction for Yelp Dataset"

Colleen Reynolds
6 years ago
Views:

1 Restarants Revew Star Predcton for Yelp Dataset Mengq Y UC San Dego A mey004@eng.csd.ed Meng Xe UC San Dego A m6xe@eng.csd.ed Wenja Oyang UC San Dego A weoyang@eng.csd.ed ABSTRACT Yelp connects people to great local bsnesses. In ths paper, we focs on the revews for restarants. We am to predct the ratng for a restarant from prevos nformaton, sch as the revew text, the ser s revew hstores, as well as the restarant s statstc. We nvestgate the data set provded by Yelp Dataset Challenge rond 5. In ths project, we wll predct the star(ratng) of a revew.three machne learnng algorthms are sed, lnear regresson, random forest tree and latent factor model, combnng wth the sentment analyss. After analyzed the performance of each models, the best model for predctng the ratngs from revews s the random forest tree algorthm. Also, we fond sentment featres are very sefl for ratng predcton. Keywords Yelp ratng predcton; data mnng; sentment analyss 1. INTRODUCTION The ratng stars of bsnesses nflence people mch more on makng a choce among all avalable bsnesses. They wll choose the one wth hgher revew ratng stars n order to ensre the qalty of servce they wll get from the bsness. In a Yelp search, a star ratng s argably the frst nflence on a ser s jdgment. Located drectly beneath bsness names, the 5 star meter s a crtcal determnant of whether the ser wll clck to fnd ot more, or scroll on. In fact, economc research has shown that star ratngs are so central to the Yelp experence that an extra half star allows restarants to sell ot 19% more freqently[1]. Crrently, a Yelp star ratng for a partclar bsness s the mean of all star ratngs gven to that bsness. However, t s necessary to consder the mplcatons of representng an entre bsness by a sngle star ratng. What f one ser cares abot only food, bt a partclar restarant s page has a 1-star ratng wth revewers complanng abot poor servce that rned ther delcos meal?[3] The ser may lkely contne to search for other restarants, when the 1- star restarant may have been deal. In a data mnng project, dfferent featre selectons wll change the accracy sgnfcantly n learnng models. It s qte vtal to fnd ot the mportant featres nflencng the revew ratng star. Althogh there are many data sets that cold be sed Table 1: Yelp Dataset attrbtes Name Attrbtes Bsness Bsness Name, Id, Category, Locaton, etc. ser Name, Revew Cont, Frends, Votes, etc. Revew Date, Bsness, Stars, Text, etc. to stdy and learn models for bsnesses or smlar ones for tems, sch as Google commntes and Netflx dataset. The reason why we selected the Yelp data set s that t has more nformaton among the sers, revews and bsnesses that cold be nvestgated for featre selectons and models bldng to help mch more accrate predctons. In ths project, we wll predct the star(ratng) of a revew. Three machne learnng algorthms are sed, lnear regresson, random forest tree and latent factor model, combnng wth the sentment analyss. We analyze the performance of each of these models to come p wth the best model for predctng the ratngs from revews. We se the dataset provded by Yelp for tranng, valdaton, and testng the models. 2. DATASET CHARACTERISTICS 2.1 Propertes In or project, we sed a deep dataset of Yelp Dataset Challenge rond 5, whch s avalable at yelp webste. The Challenge Dataset has 1.6M revews and 500K tps by 366K sers for 61K bsnesses; 481K bsness attrbtes, e.g., hors, parkng avalablty, ambence; Socal network of 366K sers for a total of 2.9M socal edges aggregated check-ns over tme for each of the 61K bsnesses. More specfcally, t ncldes the data of bsnesses, revews and sers, whch contans followng nformaton shown n Table 1. All the data are n json format. For revews, the data looks lke: { type : revew, bsness d : (encrypted bsness d), ser d : (encrypted ser d), stars : (star ratng, ronded to half-stars), text : (revew text), date : (date, formatted lke ), votes : (vote type): (cont) } 2.2 Bsnesses Statstc There are mltple types of bsnesses, sch as Doctors, 1

Fgre 1: Restaant locatons dstrbted Health&Medcal, Actve Lfe, Cocktal Bars and Nghtlfe, bt we have focsed only on restarant revews as they accont for major part of the dataset, whch hold 35.

Then, Fgre 2 tells s restarants dstrbton by state.

2 Fgre 1: Restaant locatons dstrbted Health&Medcal, Actve Lfe, Cocktal Bars and Nghtlfe, bt we have focsed only on restarant revews as they accont for major part of the dataset, whch hold 35.78% of total bsnesses. Fgre 1 shows all restarant s locatons globally. It gves s an ntal dea abot where these restarants are: the most restarants are located n US. Then, Fgre 2 tells s restarants dstrbton by state. It srprsed s that all restarants from gven dataset are located only n seven states, PA, NC, NV, WI, IL, AZ and SC, and Nevada has most nmber of restarants. Fgre 3: Dstrbton of how many revews a ser gven gven. It clearly shows that the most ser only gven a few revews(less than 10 revews), and a few ser provded hge nmber of revews(more than 50 revews), whch reveals that dfferent grop of sers has dfferent habts and we cold separate them nto grops to get more approprate featres. Moreover, by analyss the tme of new sers joned n and the tme of a revew wrote n Fgre 4, we knew that the most sers joned n Yelp arond 2010, and they contne contrbtng ncreasng nmber of revews year by year. 2.4 Fgre 2: Restaant locatons dstrbted 2.3 Revew Statstc Revew data s the most nterested dataset part and nclded more nformaton. Therefore, we are plannng to predct the star of a revew wll gve. The restarant revew dataset contans revews (63% of total revews), whch amonts to arond 1GB of data. Frstly, we try to fgre ot dstrbton of revew rat- User Statstc In the ser dataset, we have sers, and t s nterested to fnd that 88% sers are gven less then 60 revews. Fgre 3 shows the dstrbton of how many revews a ser 2

Fgre 4: Nmber of new ser joned n wth nmber of revews Fgre 7: WorldClod of 1 star revews ng. From Fgre 5, we fond the revew dstrbtons n or dataset were skewed to the 4 and 5 star categores heavly.

Ths s confrmed by a separate analyss by Max Woolf[8] on 1 and 5 star revews whch showed, excellent vsalzaton asde, that Yelp revews have started to appear more optmstcally based as tme passes.

Fgre 8: WorldClod of 5 star revews generated the world clod[5] of all 1 star revew text and all 5 star revew text, shown n Fgre 7 and Fgre 8 respectvely.

These two fgres shold spport to be mch more dfferent, however, we fond they have some same worlds, sch as food, place, good and restarant.

Bgram and trgram may be a good choce to represent content of revews, as well as sentment analyss of revew text. We beleve they cold reveal more accrate featres of the revew text.

3 Fgre 4: Nmber of new ser joned n wth nmber of revews Fgre 7: WorldClod of 1 star revews ng. From Fgre 5, we fond the revew dstrbtons n or dataset were skewed to the 4 and 5 star categores heavly. They consst of arond 70% of the dstrbton, whereas the 1, 2, and 3 star categores are each only arond 10-15% at most. Ths s confrmed by a separate analyss by Max Woolf[8] on 1 and 5 star revews whch showed, excellent vsalzaton asde, that Yelp revews have started to appear more optmstcally based as tme passes. Ths skewed dstrbton s reflected by or sample dataset, and neven class dstrbton wll become noteworthy frther on n or predctve task. Fgre 8: WorldClod of 5 star revews generated the world clod[5] of all 1 star revew text and all 5 star revew text, shown n Fgre 7 and Fgre 8 respectvely. The world clod retrn the most freqently sngle world wth sng stop worlds. These two fgres shold spport to be mch more dfferent, however, we fond they have some same worlds, sch as food, place, good and restarant. Ths s cased by only a sngle world(ngram) freqency s conted, whch can not tell the dfferent between good and not good. Bgram and trgram may be a good choce to represent content of revews, as well as sentment analyss of revew text. We beleve they cold reveal more accrate featres of the revew text. Fgre 5: Ratng Stars dstrbton We also plotted relatonshp of how many revews a restarant receved by US state, as shown n Fgre 6. It follows the common sense that n general, most restarants wll not receve too many revews(less than 50), and only some poplar restarants receved more than 100 revews. 3. RATING PREDICTION PROBLEM SETUP In ths project, we planned to tran a model to predct sers ratng of a bsness. The motvaton ncldes that f we can predct how a ser s gong to rate a bsness, then we can recommend the bsness that the ser s more lkely to rate hgher than the others. The Netflx Prze dataset jst has the ser d, tem d and the correspondng ratng. The dfference between Yelp dataset and Netflx dataset s that Yelp dataset has more nformaton than the Netflx dataset. Therefore, we can not only try the latent factor model bt also some featre-based models. In the tradtonal ratng predcton problem, we try to make or predcton R, for ser and tem to be as close as R,, the ratng ser gves. Usally, Mean Sqared Error (MSE) s employed to compare the performance of the dfferent model. 1 X (R, R, )2 (1) M SE = T Fgre 6: Nmber of revew a restarant receved by US state Frthermore, snce text s a mportant part of revews, we 3

4 4. MODEL SELECTION Accordng to the data analyss, the ratng of new revews shold be predcted. There are several featres we fond sefl n comparng wth dfferent models and help to choose the best model for predcton. The featres are: the average ratng stars for each ser based on the revew stars of the tranng dataset (Rate); the average ratng stars for each bsness based on the revew stars of the tranng dataset (brate); the cont of revews ths ser had made (rcont); the lengh of the revew text (tlen); the polarty of the sentment of the revew text (tpol); the sbjectvty of the sentment of the revew text (tsb); the average ratng stars for each ser whch can be get from dataset drectly (Avg); the average ratng stars for each bsness whch can be get from dataset drectly (bavg). The data was separated nto three parts: tranng dataset (80%), valdaton dataset (10%) and test dataset (10%). 4.1 Baselne The baselne model s straghtforward, sng ser s prevos average gven star to predct ther ftre revews star. That s, based on the tranng data, a average star for each ser s compted by comptng the average of all ther prevos revew stars. Then, we sed these averages to predct test dataset. When a ser n test dataset has not been seen before, we wll se the global average star nstead of t. 4.2 Lnear Regresson Lnear Regresson s or frst model to do the predcton. It calclates the weghts of each featre and mltply them wth featres to do the calclaton. The eqaton can be shown as: y = X θ + ɛ (2) where y, θ, ɛ are 1-D array and X s 2-D array. y s the labels; θ s the weghts, ɛ s the nose and X s the featres for samples. Wth the featres mentoned above n ths part, we appled the lnear regresson to the tranng dataset. The featres are rcont, tlen, tpol, tsb. The label s set to (revew star - (Avg+bAvg)/2.0) where y[] s revew star[] and ɛ[] s (Avg+bAvg)/2.0. Wth the featres and labels, the lnear regresson s appled by sng nmpy.lnalg.lstsq. Then, the (featres θ+(avg+bavg)/2.0) s appled, whch s the eqaton mentoned above, to do the predcton for those sers and bsnesses seen n tranng dataset. Otherwse, se (featre θ+averagerate) for predcton where averagerate s the average ratng stars for all tranng revews. Frthermore, n ths model, we also add a correcton to set the predcton vale eqal to 5.0 f t s calclated vale s larger than 5.0, snce the ratng stars wll never be larger than Random Forest Regresson A better bt more complex way for the predcton s sng Random Forest Regresson. For each revew n the tranng data, the featres are chosen as Avg, bavg, rcont, tlen, tpol, tsb. The label s set to (revew star). Wth the featres and labels, we appled random forest regresson by sng sklearn.ensemble.randomforestregressor. For those ser or bsness not seen n tranng data, we se averagerate nstead of Avg and bavg. Then apply the predct(featre) to do the predcton. In ths model, we compared the valdaton MSE wth dfferent parameters (n estmators and max depth). The evalaton for the model wll be dscssed n Part 5. Frthermore, the same as what we dd for the Lnear Regresson, we add a correcton to set the predcton vale eqal to 5.0 f t s calclated vale s larger than Latent Factor Model Ths model has been dscssed n the lectres. And t s very poplar n recommender system. The basc dea behnd ths model s projectng the ser s preferences and tem s propertes nto low dmenson space. The model can be formlated as: ˆR, = α + β + β + γ γ (3) where α s the global average ratng, β and β are bas term for ser and tem respectvely and γ and γ are the nteractve term for ser and tem respectvely. The optmzaton can be formlated as: arg mn obj(α, β, γ) α,β,γ = arg mn (α + β + β + γ γ R,) 2 α,β,γ, + λ 1[ + λ 2[ β 2 + γ β 2 ] γ 2 2] Ths objectve fncton s not convex, obvosly. Bt we can fnd a approxmate solton. The pdate procedre s descrbed below: 1. Fx γ and solve arg mn α,β,γ obj(α, β, γ) 2. Fx γ and solve arg mn α,β,γ obj(α, β, γ) 3. Repeat 1) 2) ntl convergence. The pdate rle for solvng arg mn α,β,γ obj(α, β, γ) s descrbed below: β (t+2) β (t+3) α (t+1) k [0...m], γ (t+4) k (4),j R, β(t) β (t) γ (t) γ (5) N tran I R, α (t+1) β (t+1) λ 1 + I U R, α (t+2) β (t+2) λ 1 + U 1 λ 2 + I γk 2 γ k (R, α (t+3) β (t+3) I β (t+3) γ (t+1) γ (t+2) j k γ γ γ (t+3) j γ j) where m s the nmber of dmensons of γ. The pdate rle for solvng arg mn α,β,γ obj(α, β, γ) s descrbed below: (6) (7) (8) 4

5 β (t+2) α (t+1),j R, β(t) β (t) γ γ (t) (9) N tran I R, α (t+1) β (t+1) λ 1 + I γ γ (t+1) (10) Table 3: Parameter Selecton n estmators max depth MSE (valdaton) β (t+3) k [0...m], γ (t+4) k U R, α (t+2) β (t+2) λ 1 + U 1 λ 2 + U γk 2 γ k (R, α (t+3) β (t+3) U β (t+3) where m s the nmber of dmensons of γ. 5. EVALUATION γ γ (t+2) j k (11) γ jγ (t+3) j ) (12) 5.1 Baselne Model We se the baselne model as we descrbed n Secton 4.1. The MSE of baselne model on valdaton dataset and test dataset are and , respectvely. 5.2 Lnear Regresson Table 2: Featre Selecton No. Featres MSE (valdaton) (1) (Avg+bAvg)/2,tlen (2) (1)+rCont (3) (1)+tpol,tsb (4) (Avg+bAvg)/2,rCont (5) (4)+tpol,tsb (6) (Avg+bAvg)/2,tpol,tsb (7) (6)+rCont,tlen (8) (7)+bsness revew cont Featre selecton s a key step n Lnear Regresson model. We nvestgate dfferent featres and dd calclated the MSE for the valdaton dataset. The dfferent featres and ther reslts can be seen n table 2, and t shows these featre representatons worked well or not. In ths table, the (Avg+bAvg)/2 s sed n the label part whch detaled explaned n Secton 4.2. Accordng to the reslts, the featres Avg, bavg, tlen, rcont, tpol, tsb mprove the accracy and bsness revew cont The mnmm of MSE for the valdaton dataset occrs when the featres are rcont,tlen,tpol,tsb wth ratng calclated wth (Avg+bAvg)/2. Then we choose ths model for the test dataset. The MSE for the test dataset s Random Forest Regresson In the Random Forest Regresson, we changed the parameters n estmators and max depth to tran the tranng dataset. Wth the model, we can get the MSE for the valdaton dataset. As the reslt n table 3, the larger the n estmators and the the max depth, the better the performance. In the experments we dd, the mnmm MSE for the valdaton dataset s wth n estmators as 200 and max depth as 12. Then we choose ths model for the test dataset. The MSE for the test dataset s Latent Factor Model The latent factor s effectve and wdely sed n predcton task. In ths model, we have three parameters to tne. λ 1 and λ 2 are the reglarzaton parameters n eqaton 3. m s the dmenson of the latent factor. Frst, we fx the dmenson of the latent factor to 8, and then tne λ 1 and λ 2. Or observaton s the reslt s more senstve to the vale of λ 2. The performance wth dfferent λ 1 and λ 2 s shown n table 4. Table 4: Reglarzaton Parameter λ 2\λ 1 1e-3 1e-5 1e-7 1e-9 1e-11 1e e e e e We fx λ 1 and λ 2 to the par of λ 1 and λ 2 that perform best, and then we tne parameter m. The reslts wth dfferent m are shown n table 5. Table 5: Dmenson Selecton m MSE (valdaton) The reslt shows that the hgher the dmenson s, the better the performance wll be. Bt the performance won t change mch. Fnally, we choose the parameters that have the best performance,.e. λ 1 = 1e 5, λ 2 = 1e 4 and m = 64. We test the model on the test data and the MSE s Comparson Between Models In ths project, we evalated the Baselne, Lnear Regresson, Random Forest Regresson and Latent Factor Model. The Lnear Regresson costs the least and needs featres to do the predcton. Bt t may have the overft on tranng data. The tranng cost of Random Forest Regresson Model manly depends on two parameters 1) the depth of trees 2) the nmber of estmators. It accepts large set of featres 5

6 and can also optmze the performance by randomly selectng featres from the set. The Latent Factor Model costs a lot especally when the dmenson of latent factor s very hgh, snce t needs to terate on mltple parameters ntl the objectve fncton converges. Bt t doesn t need any featres to do the predcton. One dsadvantage of latent factor model s that when the dataset s not large enogh for estmate all the parameters, the performance wll become worse. For each model, after comparng between dfferent parameters or featres, we choose the one wth mnmm MSE on the valdaton dataset to do the predcton on the test dataset to prevent the overft on tranng dataset. The MSEs for the test dataset s shown n table 6. Table 6: Comparson between MSEs for test data Model MSE (test data) Baselne Lnear Regresson Random Forest Regresson Latent Factor Model Accordng to the statstc, the best model n ths project based on the MSE comparson for the test data s the Random Forest Regresson Model. The reason why t has the best performance may nclde that n ths Yelp dataset, there are a lot of sefl featres cold be sed, as well as random forest tree cold prevent overfttng. Besdes, the sentment parameters and the average ratng stars of ser and bsness may nflence a lot on the revew star. Snce the Latent Factor Model doesn t se featres, t may be the reason why Random Forest Regresson has a better performance. 6. RELATED WORK 6.1 SVD++ Koren[4] proposed an algorthm that called SVD++ drng the competton of Netflx Prze competton and they won the Netflx Prze by otperform Netflx s own recommendng algorthm by 10%. The model can be formlated as: ˆR, = µ + b + b + q T p + N() 1 2 j N() y j (13) where ˆR, s the estmated ser s ratng for tem, b and b are the bas term for a ser and an tem respectvely, q and p are the latent factor for an tem and a ser respectvely, N() s the set of mplct nformaton(the set of tems ser rated), y s the mpact factor assocated wth the the j-th tem that ser has rated. Koren exploted both explct and mplct feedback from ser. The explct feed back s the same as what we ses n the latent factor model, sng sers ratng as explct feedback. Implct feedback means whether sers rate a restarant and how many tme they have rated. In other words, sers can tell s ther preference from whether they rate a restarant and how tmes they have rated. The cost fncton that we want to mnmze s mn, (R, ˆR,) 2 + λ 1( p 2 + q 2 ) + λ 2(b 2 + b 2 ) + λ 3 j N() y j 2 (14) We can explot stochastc gradent descent (SGD) to tran the model. The pdate rle s lst below: b b + γ(e, λ 2b ) (15) b b + γ(e, λ 2b ) (16) p p + γ(q e, λ 1p ) (17) q q + γ(e,(p + N() frac12 j N() y ) λ 1p ) (18) j N(), y j y j + γ(e,q N() 1 2 λ3y j) (19) 6.2 Sentment Analyss Crrent sentment analyss methods can be groped nto for man categores: keyword spottng, lexcal affnty, statstcal methods and concept-level technqes[2]. Keyword spottng classfes text by affect categores based on the presence of nambgos affect words sch as happy, sad, afrad, and bored.[6] Lexcal affnty not only detects obvos affect words, t also assgns arbtrary words a probable affnty to partclar emotons.[7] Statstcal methods leverage on elements from machne learnng. Concept-level approaches leverage on elements from knowledge representaton and, hence, are also able to detect semantcs that are expressed n a sbtle manner. The python lbrary TextBlob s sed n ths project. It s a lbrary help processng textal data and provdes the API for natral langage processng. For the sentment analyss tasks, TextBlob retrns a named tple wth float varables polarty and sbjectvty. The polarty score s a float nmber n range of [-1.0, 1.0]. The more postve the vale s, the more postve the npt text s. Vce versa the more negatve npt text leads to the more negatve vale of polarty score. For example, the negatve text may nclde the words lke: hate, dsapponted, dslke, annoyed, awfl, etc. The postve text may nclde the words lke: excellent, great, best, etc. The sbjectvty score s also a float nmber bt n range of [0.0, 1.0]. The 0.0 ndcates the most objectve score and 1.0 ndcates the most sbjectve score. For example, the sbjectve text may have a hgh freqency of the word lke: I, we, or, etc. The objectve text may have a really low freqency of those words. 7. CONCLUSION Yelp dataset have mch more nformaton than some smlar ratng-based dataset, sch as Netflx dataset. It allows s to explore the correlaton between each attrbtes and the ratng and employ the relatons to predct more accrately. In ths paper, we have traned several models ncldng lnear regresson, random forest and latent factor model. We have also exploted some text mnng technqes sch as sentment analyss to bld or featres. We compare the 6

7 parameters settng for each model to tne the performance and to control the model complexty. Fnally, we apply the models on the test dataset and compare ther performance. The reslt shows that random forest model otperforms the others, snce that t ses some reasonable featres extracted from the rch nformaton of the dataset. 8. FUTURE WORK De to the lmted tme, we have no chance to tred some models sch as Spport Vector Machne, k-nearest neghbors and Stochastc Gradent Descent. Also, n the ftre, we wll try to develop more models abot text mnng, sch as bgram, tr-gram and TF-IDF, and we may predct the revew star by sng the revew text only. Moreover, Yelp may release more dateset ncldes more bsnesses ot of US. Then, we cold analyss sers habts n dfferent contres. At the same tme, we cold extent or revew star predcton to all bsnesses, not only for restarants. 9. REFERENCES [1] M. Anderson and J. Magrder. Learnng from the crowd [2] E. Cambra, B. Schller, Y. Xa, and C. Havas. New avenes n opnon mnng and sentment analyss. IEEE Intellgent Systems, (2):15 21, [3] M. Fan and M. Khadem. Predctng a bsness star n yelp from ts revews text alone. arxv: , [4] Y. Koren. Factorzaton meets the neghborhood: a mltfaceted collaboratve flterng model. In Proceedngs of the 14th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pages ACM, [5] A. Meller. A wordclod n python. Andy s Compter Vson and Machne Learnng Blog, Jne [6] A. Ortony, G. L. Clore, and A. Collns. The cogntve strctre of emotons. Cambrdge nversty press, [7] R. A. Stevenson, J. A. Mkels, and T. W. James. Characterzaton of the affectve norms for englsh words by dscrete emotonal categores. Behavor Research Methods, 39(4): , [8] M. Woolf. The statstcal dfference between 1-star and 5-star revews on yelp. September

Modeling Local Uncertainty accounting for Uncertainty in the Data

Modeling Local Uncertainty accounting for Uncertainty in the Data Modelng Local Uncertanty accontng for Uncertanty n the Data Olena Babak and Clayton V Detsch Consder the problem of estmaton at an nsampled locaton sng srrondng samples The standard approach to ths problem