CS545 Project: Conditional Random Fields on an ecommerce Website Brock Wilcox December 18, 2013 Contents 1 Conditional Random Fields 1 1.1 Overview................................................. 1 1.2 CRFSuite................................................. 2 2 Inspirational Work 2 2.1 Web Page Prediction Based on CRFs................................. 2 2.2 Conditional Random Fields for Activity Recognition........................ 3 2.3 Emotion Classification Using Web Blog Corpora.......................... 3 3 Experiments 4 3.1 General Setup.............................................. 4 3.2 Experiment 1 - Predicting Next-Page-Type.............................. 4 3.3 Experiment 2 - Next-page Category.................................. 5 3.4 Experiment 3 - Conversion Prediction................................ 5 4 Conclusions 6 1 Conditional Random Fields 1.1 Overview A Conditional Random Field (CRF) [6] is a machine learning model for labeling each observation in an undirected graph of observations. For the remainder of this paper I ll consider CRFs restricted to a linear sequences of observations and labels, as opposed to general CRFs which can be used on any graph shape. A common example of CRF usage is in part of speech tagging [8], in which each word in a sentence is labeled with the part of speech (noun, verb, etc). Using CRF allows for contextual labeling, building off of conditional probabilities. The basic model for a CRF is measuring the total probability of a sequence of labels (Y), given a sequence of observations (X), as P(Y X). This model is constructed directly (discriminative), instead of indirectly as in a naive Bayes or Hidden Markov Model classifier (generative) [5]. That is, in a generative classifier we estimate p(y X) by calculating P(X Y) and P(Y) and then applying Bayes rule. This is built on the assumption that each feature is independent when deciding on a label for an observation. 1
The distribution that models this fully is described by Sutton [9], and I summarize the relevant parts here. Let X = (x 1, x 2,..., x n ) be a sequence of observations, and Y = (y 1, y 2,..., y n ) be a corresponding sequence of labels for each of the observations. The overall model is p(y x) = 1 Z(x) T t=1 { K } exp θ k f k (y t, y t 1, x t ) k=1 (1) Here f k (y t, y t 1, x t ) is a feature function, defining a value given the current and previous labels, as well as relevant features for time t. In the cases I m working with features are categorical, so f k is either 0 or 1. Features for x t don t have to actually come exclusively from time t, however, they can be drawn from any feature in X. θ k are the parameters for the distribution, and Z(x) is a normalization function to keep the total p(y x) = 1. Both θ and Z must be computed, and consist of an exponential number of terms. Fortunately this can be done with a variety of algorithms. In the CRF implementation that I m using, L-BFGS [2] is used to estimate these parameters. Fit into a larger context, there is a family of probabilistic models that relate to one another. Naive Bayes is the most simplified of these models, categorizing test cases with a single label based on independent features. Expanding Naive Bayes to classify a sequence of labels gives HMM. This can be generalized even further to label a directed graph of labels (Generative directed models). If the independence assumption and directed nature of these models is removed, a corresponding set of models can be derived from the same basic probabilistic equations. Linear-chain CRFs are the conditional version of HMMs, just as Logistic Regression is the conditional version of Naive Bayes. 1.2 CRFSuite There are several available implementations of CRFs, both standalone and as part of larger machine learning toolkits. CRFSuite [7] aims to be a fast and simple-to-use implementation, while implementing a variety of parameter solvers and integration points. As an example of the simplicity in use, only relevant features need to be specified for each observation within a sequence. This is unlike another popular implementation, CRF++ [1], in which every feature must be present for every observation. 2 Inspirational Work I looked at three papers to better understand Conditional Random Fields and to guide my own experiments. The first, Web Page Prediction based on CRFs [4], attempts to label a sequence of web page interactions with a label for the category of the next-page. This is the most similar to what I attempt in my experiments. Next I looked at Conditional Random Fields for Activity Recognition [10], in which the authors train a model to classify different activities of robot-agents in a virtual game of tag. Finally I examine Emotion Classification Using Web Blog Corpora [11], which uses emoticons and user-supplied ratings to categorize the emotions presented in individual sentence and overall content of blog posts. 2.1 Web Page Prediction Based on CRFs In [4], Guo et al. use CRFs to predict website usage next-page loading. From that prediction the authors hope to optimize pre-fetching of pages, thereby significantly reducing latency during user interaction with a website. The authors ran a series of experiments on both Hidden Markov Models and CRFs, though I will only examine their CRF based experiments and results. The data from [3] was pre-processed into sequences of page views, each of which is assigned to 1 of 17 numbered page categories from the dataset. Duplicate consecutive page views of the same category are removed. Labels are then assigned as the next-page-category. For example, a user sequence of 6 9 4 10 3 10 2
5 10 4 is mapped into an observation sequence of 6 9 4 10 3 10 5 10 (without the last page view), with labels 9 4 10 3 10 5 10 4. Three experiments with CRF were run. The first (CRF0) used only the immediate category as a feature for an observation. The second (CRF1) used the immediate category, one category before and after the current observation, and a single feature combining the before and after categories. Finally (CRF2) they used the two categories before and after the current observation, and a feature combining them. The authors hypothesized and demonstrated that CRF2 performed best, CRF1 second best, and CRF0 worst on their dataset. All cases performed better than a similarly trained Hidden Markov Model. A possible flaw in their experiment, however, is in the feature selection and how it maps onto their actual problem of preloading web pages. For experiments CRF1 and CRF2 the authors used categorization of pages after the current page to predict preloading. I believe that this gives their model an unknowable answer when compared to using their trained algorithm in real time. Ultimately their goal should have been to take a partial sequence and predict the next (or the next several) web page, but instead they ve constructed an algorithm to classify the category of a series of webpages without regard to temporial accessibility. 2.2 Conditional Random Fields for Activity Recognition In [10], the authors model robot-agent interactions with the goal of tagging a sequence of actions with the category of activity that the robot is performing. The domain used in the paper is a simulated game of tag played between three robots. Two robots are passive, and one is the seeker. Once the seeker touches one of the other robots, the touched robot becomes the seeker and their activities are changed accordingly. Taking the position of the robots as input, the goal is to label the robots at each timestep with the activity that they are performing. Each time step is labeled with the current seeker, and has features for the current location of all three robots. Additionally transitional features are included, which is a combination of the previous timestep features and the current feature. So if a position at t 0 = (0, 0) and at t 1 = (1, 1) then t 1 would have a (1, 1) feature and a combined (0, 0) (1, 1) transition feature. This allows the label at t 1 to be both conditionally dependent on the t 0 label and also on the position change from t 0. In later experiments features for velocity, a chasing indicator, and distance thresholds were also included. Like the Web Page Prediction paper, the authors compare CRFs with HMMs and ultimately find that CRFs perform better for their problem in all cases. Additionally the more features that are included the better CRF performs. Redundant features, however, appear to cause some overfitting. Unlike the Web Page Prediction experiments, none of the features supplied at a given point in the sequence are from future observations. I believe this makes for a more fair use of the algorithm considering the ultimate goal of enabling an agent to recognize ongoing activities. 2.3 Emotion Classification Using Web Blog Corpora The final paper I examined was [11], in which the authors classify both sentences and entire blog posts for the expressed emotion. They used blog posts from a website which allows users to indicate the overall emotion of a blog post, and additionally use a dictionary of words and their emotional uses for sentencelevel labeling. The authors compare a Bayesian classifier, SVMs, and CRFs on this task, and find that CRF outperforms the others. The conclusion they come to is that the condition based context of sentence-to-sentence emotional relations are more strongly represented by CRFs. They even added the previous-sentence label to the features used in an SVM model, but label independence still led to worse results than using CRFs. 3
3 Experiments 3.1 General Setup I took the weblogs from one day of activity on blinq.com, an ecommerce site specializing in used and openbox items. The logfile has only a limited amount of information for this particular service, and once cleaned effectively has an IP address and website path for each access. This includes background requests from the client side application, in addition to user navigation. I made the assumption that IP address can be used to narrow a set of access to a specific user, which is not globally the case but will be acceptable for this set of experiments. Each user session, then, consists of an ordered list of page paths. Based on this path we can identify a general classification for the type of page being accessed. I initially divided this into 14 specific types of pages based on the structure of the path, and for each extracted some identifying features. For example, with the path /electronics/ipods-mp3-players/apple-ipod-touch-4th-gen-8gb-black-mc540ll-a/31541?condition=used-verygood I extract [type=product, cat=electronics, subcat=ipods-mp3-players, condition=used-very-good]. There are a number of rows that don t fit any of these 14 types. Some of these are errors, but most are requests for page resources that are irrelevant to our experiments (such as fetches of.jpg image files). Additionally there is a significant amount of redundancy. This is primarily because of the inclusion of backgroundrequests done by javascript. A single product page will continuously make requests to the server requesting updates to the available quantity for that product, for example. I will initially leave this redundancy in place. In each case I divide the samples randomly into 80% training and 20% testing sequences. 3.2 Experiment 1 - Predicting Next-Page-Type For my first experiment I labeled each observation with the page-type of the following observation, and the final observation with exit. For example, a sequence of [product-list, product, product, checkout] is labeled with [product, product, checkout, exit]. The list of features are kept minimal, no individual product IDs are included. A typical example run produces the following statistics for each label, as provided by CRFSuite: label match model ref precision recall F1 product 4137 4405 4230 0.9392 0.9780 0.9582 product-list 1861 2068 1910 0.8999 0.9743 0.9356 exit 1463 1463 1468 1.0000 0.9966 0.9983 client-error 33 55 141 0.6000 0.2340 0.3367 home 43 60 93 0.7167 0.4624 0.5621 search 0 0 61 0.0000 0.0000 0.0000 cart 1 4 57 0.2500 0.0175 0.0328 cat 0 0 29 0.0000 0.0000 0.0000 carousel 0 0 22 0.0000 0.0000 0.0000 checkout 0 11 20 0.0000 0.0000 0.0000 customer-reviews 0 0 13 0.0000 0.0000 0.0000 email-alert 0 0 9 0.0000 0.0000 0.0000 support-info 0 0 8 0.0000 0.0000 0.0000 account 0 0 5 0.0000 0.0000 0.0000 Macro-average precision, recall, F1: (0.314695, 0.261636, 0.273125) Item accuracy: 7538 / 8066 (0.9345) Instance accuracy: 1251 / 1468 (0.8522) Here you can see that the overall item accuracy is high, at 93%, but the average precision, recall, and F1 are relatively low. Looking at the labels, prediction of exit, product, and product-list are accurate, but 4
all other labels are rare and not well predicted. I think this is coming from two sources. First, these top 3 labels are overwhelming the others in the training data. Second, in the data itself there are a large number of duplicate rows, where a browser is requesting the same information over and over. Eliminating adjacent duplicate rows decreased the item accuracy to 89%, and largely didn t affect the distribution or accuracy of individual label assignment, except for product-list (which was the most severely affected by the duplicates). product-list F1 score went from 0.9983 before duplicates were removed down to 0.7906 on the cleaned data. The severe disparity between the top three labels and the others shows that the more infrequent page-types are very difficult to predict. 3.3 Experiment 2 - Next-page Category For this experiment I used the next-page top level category for labels. For example product-electronics is the label for an observation where the next observation has page type product and category electronics. The idea here is to predict what sort of categories a user will next be interested in as they traverse the site. The last product visited is dropped. Since many sessions only look at a single product detail page, this decreases the dataset significantly. Item accuracy in this experiment was much lower, at 57% item accuracy and an F-score of 0.23. Looking at the annotated guesses (which compare the test data actual-vs-predicted labels), it appears that in many cases the algorithm s guess is equivalent to putting down the label of the previous observation. Still, this is spread across 17 product categories. 3.4 Experiment 3 - Conversion Prediction Based on the previous experiments and several less formal ones, I decided to label an entire sequence instead of individual observations. This removes most of the advantage of CRFs, effectively turning this into a logistical regression. The exercise is worthwhile, however, considering the nature of the data. Each session is labeled as either conversion or no-conversion depending on whether at some point the checkout page type is reached. An initial execution of this showed extreme overfitting, and since the checkout page type is a feature I believe that this allows a single label to be set to checkout, and then the weight of that on other labels forces all to be checkout. To defend against this overfitting I made each sequence end on the observation before the checkout. With this set of trimmed sessions in place I got a 95% item accuracy and a macro F-score of 0.61. This is higher than expected considering other results, so I did some study of the trained model and the training data. CRFSuite provides a way to dump the model, and from that I found, for example, that the relationship subcat=car-seats-baby-safety conversion has a 0.621374 weight associated with it. Exploring this further in the training data, I found 34 sessions with a pageview feature of subcat=car-seats-baby-safety, but only 4 lead to a conversion. So the linking of this subcategory with a 0.62 conversion weight is also being weighed by the conditional context. Inspired by this result, I wanted to see how far ahead of the conversion I could cut off the sequence while still getting a high prediction rate. I first tried with only the first observation in a session, and the trained algorithm got every single conversion entry wrong. I then tried with the first two observations, and got much better results. There are many more non-conversions than conversions, so it is unsurprising that there was a 95% precision at guessing non-conversions. But there was also a relatively high precision for conversions, at 60 Increasing the max sequence length to 3 increased the conversion precision to 66%, and further increasing the max sequence length to 4 at a 74% precision rate. Further increases don t significantly affect the precision rate for labeling conversions. 5
4 Conclusions CRFs are an excellent solution to sequential labeling. Based on my readings, taking contextual relationships into account through conditional probabilities allows CRF to outperform HMM in many situations. Using modern optimization algorithms such as L-BFGS allows parameter calculation to process fast on common datasets. CRFs can handle a very large number of features since P(X) is not modeled directly, though that implies that CRFs work best with a relatively small number of labels. My own goal was to explore how CRFs could be used to better understand and predict consumer behavior by analyzing web traffic on an ecommerce site. While processing the data and extracting features provided some insights, I didn t find a satisfying use for CRFs in this context. This is possibly due to the limited amount of data available for each page request. For example, it would be interesting if different predictions for product categories could be made based on which web browser or operating system a buyer used while browsing the site. The data used was a single day of website usage, expanding this to cover a longer period of time would also be helpful. CRFs are good at what they do, but my ill-defined problem does not appear to be a good usage in the problem s current form. References [1] CRF++: yet another CRF toolkit. [2] Limited-memory BFGS, September 2013. Page Version ID: 573644777. [3] UCI KDD Archive. msnbc.com anonymous web data. [4] Yong Zhen Guo, Kotagiri Ramamohanarao, and Laurence AF Park. Web page prediction based on conditional random fields. In ECAI, page 251 255, 2008. [5] A. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in neural information processing systems, 14:841, 2002. [6] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001. [7] Naoaki Okazaki. CRFsuite: a fast implementation of Conditional Random Fields (CRFs). 2007. [8] Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, page 134 141, 2003. [9] Charles Sutton and Andrew McCallum. An introduction to conditional random fields. arxiv preprint arxiv:1011.4088, 2010. [10] Douglas L. Vail, Manuela M. Veloso, and John D. Lafferty. Conditional random fields for activity recognition. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, page 235, 2007. [11] Changhua Yang, Kevin Hsin-Yih Lin, and Hsin-Hsi Chen. Emotion classification using web blog corpora. In Web Intelligence, IEEE/WIC/ACM International Conference on, page 275 278, 2007. 6