CS545 Project: Conditional Random Fields on an ecommerce Website

Similar documents
Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Structured Learning. Jun Zhu

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Feature Extraction and Loss training using CRFs: A Project Report

Automatic Domain Partitioning for Multi-Domain Learning

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Conditional Random Fields : Theory and Application

COMP90051 Statistical Machine Learning

CS229 Final Project: Predicting Expected Response Times

Computationally Efficient M-Estimation of Log-Linear Structure Models

Conditional Random Fields. Mike Brodie CS 778

Network Traffic Measurements and Analysis

Introduction to Hidden Markov models

CSI5387: Data Mining Project

CS 6784 Paper Presentation

Edinburgh Research Explorer

The Basics of Graphical Models

1 Machine Learning System Design

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Sequence Labeling: The Problem

Conditional Random Fields for Activity Recognition

Semi-supervised Learning

Classification Algorithms in Data Mining

Applications of Machine Learning on Keyword Extraction of Large Datasets

Conditional Random Fields for Activity Recognition

Classification. I don t like spam. Spam, Spam, Spam. Information Retrieval

Homework 2: HMM, Viterbi, CRF/Perceptron

Regularization and model selection

of Manchester The University COMP14112 Markov Chains, HMMs and Speech Revision

Weighted Alternating Least Squares (WALS) for Movie Recommendations) Drew Hodun SCPD. Abstract

Learning Diagram Parts with Hidden Random Fields

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

A Survey on Postive and Unlabelled Learning

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Web Information Retrieval using WordNet

A bit of theory: Algorithms

Introduction to CRFs. Isabelle Tellier

Chapter 10. Conclusion Discussion

CSCI 5582 Artificial Intelligence. Today 10/31

Social Interactions: A First-Person Perspective.

Problem Set #6 Due: 11:30am on Wednesday, June 7th Note: We will not be accepting late submissions.

EFFICIENT BAYESIAN INFERENCE USING FULLY CONNECTED CONDITIONAL RANDOM FIELDS WITH STOCHASTIC CLIQUES. M. J. Shafiee, A. Wong, P. Siva, P.

Louis Fourrier Fabien Gaie Thomas Rolf

Detection and Extraction of Events from s

Predicting Popular Xbox games based on Search Queries of Users

CS 343: Artificial Intelligence

Graphical Models. David M. Blei Columbia University. September 17, 2014

CS 532c Probabilistic Graphical Models N-Best Hypotheses. December

Applying Supervised Learning

Personalized Interactive Faceted Search

Non-rigid body Object Tracking using Fuzzy Neural System based on Multiple ROIs and Adaptive Motion Frame Method

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Classication of Corporate and Public Text

Principles of Machine Learning

Predicting Messaging Response Time in a Long Distance Relationship

Data Science Tutorial

Handwritten Word Recognition using Conditional Random Fields

Bayesian Networks Inference

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction

Structured Completion Predictors Applied to Image Segmentation

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Overview of machine learning

Sentiment Analysis for Amazon Reviews

CS 188: Artificial Intelligence Fall Machine Learning

27: Hybrid Graphical Models and Neural Networks

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Link Prediction for Social Network

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

ECE521 Lecture 18 Graphical Models Hidden Markov Models

Sign Language Recognition using Dynamic Time Warping and Hand Shape Distance Based on Histogram of Oriented Gradient Features

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

Machine Learning in WAN Research

TA Section: Problem Set 4

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Segment-based Hidden Markov Models for Information Extraction

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Complex Prediction Problems

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Sequence Classification with Neural Conditional Random Fields

Fast or furious? - User analysis of SF Express Inc

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Learned Automatic Recognition Extraction of appointments from . Lauren Paone Advisor: Fernando Pereira

Machine Learning in WAN Research

Semi-Supervised Learning of Named Entity Substructure

Machine Learning. Decision Trees. Manfred Huber

Bayes Net Learning. EECS 474 Fall 2016

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Annotation of Human Motion Capture Data using Conditional Random Fields

CS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows:

Bayes Classifiers and Generative Methods

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields

Deep Learning With Noise

Slice Intelligence!

Conditional Random Fields for Object Recognition

CLASSIFICATION JELENA JOVANOVIĆ. Web:

Detecting Coarticulation in Sign Language using Conditional Random Fields

Transcription:

CS545 Project: Conditional Random Fields on an ecommerce Website Brock Wilcox December 18, 2013 Contents 1 Conditional Random Fields 1 1.1 Overview................................................. 1 1.2 CRFSuite................................................. 2 2 Inspirational Work 2 2.1 Web Page Prediction Based on CRFs................................. 2 2.2 Conditional Random Fields for Activity Recognition........................ 3 2.3 Emotion Classification Using Web Blog Corpora.......................... 3 3 Experiments 4 3.1 General Setup.............................................. 4 3.2 Experiment 1 - Predicting Next-Page-Type.............................. 4 3.3 Experiment 2 - Next-page Category.................................. 5 3.4 Experiment 3 - Conversion Prediction................................ 5 4 Conclusions 6 1 Conditional Random Fields 1.1 Overview A Conditional Random Field (CRF) [6] is a machine learning model for labeling each observation in an undirected graph of observations. For the remainder of this paper I ll consider CRFs restricted to a linear sequences of observations and labels, as opposed to general CRFs which can be used on any graph shape. A common example of CRF usage is in part of speech tagging [8], in which each word in a sentence is labeled with the part of speech (noun, verb, etc). Using CRF allows for contextual labeling, building off of conditional probabilities. The basic model for a CRF is measuring the total probability of a sequence of labels (Y), given a sequence of observations (X), as P(Y X). This model is constructed directly (discriminative), instead of indirectly as in a naive Bayes or Hidden Markov Model classifier (generative) [5]. That is, in a generative classifier we estimate p(y X) by calculating P(X Y) and P(Y) and then applying Bayes rule. This is built on the assumption that each feature is independent when deciding on a label for an observation. 1

The distribution that models this fully is described by Sutton [9], and I summarize the relevant parts here. Let X = (x 1, x 2,..., x n ) be a sequence of observations, and Y = (y 1, y 2,..., y n ) be a corresponding sequence of labels for each of the observations. The overall model is p(y x) = 1 Z(x) T t=1 { K } exp θ k f k (y t, y t 1, x t ) k=1 (1) Here f k (y t, y t 1, x t ) is a feature function, defining a value given the current and previous labels, as well as relevant features for time t. In the cases I m working with features are categorical, so f k is either 0 or 1. Features for x t don t have to actually come exclusively from time t, however, they can be drawn from any feature in X. θ k are the parameters for the distribution, and Z(x) is a normalization function to keep the total p(y x) = 1. Both θ and Z must be computed, and consist of an exponential number of terms. Fortunately this can be done with a variety of algorithms. In the CRF implementation that I m using, L-BFGS [2] is used to estimate these parameters. Fit into a larger context, there is a family of probabilistic models that relate to one another. Naive Bayes is the most simplified of these models, categorizing test cases with a single label based on independent features. Expanding Naive Bayes to classify a sequence of labels gives HMM. This can be generalized even further to label a directed graph of labels (Generative directed models). If the independence assumption and directed nature of these models is removed, a corresponding set of models can be derived from the same basic probabilistic equations. Linear-chain CRFs are the conditional version of HMMs, just as Logistic Regression is the conditional version of Naive Bayes. 1.2 CRFSuite There are several available implementations of CRFs, both standalone and as part of larger machine learning toolkits. CRFSuite [7] aims to be a fast and simple-to-use implementation, while implementing a variety of parameter solvers and integration points. As an example of the simplicity in use, only relevant features need to be specified for each observation within a sequence. This is unlike another popular implementation, CRF++ [1], in which every feature must be present for every observation. 2 Inspirational Work I looked at three papers to better understand Conditional Random Fields and to guide my own experiments. The first, Web Page Prediction based on CRFs [4], attempts to label a sequence of web page interactions with a label for the category of the next-page. This is the most similar to what I attempt in my experiments. Next I looked at Conditional Random Fields for Activity Recognition [10], in which the authors train a model to classify different activities of robot-agents in a virtual game of tag. Finally I examine Emotion Classification Using Web Blog Corpora [11], which uses emoticons and user-supplied ratings to categorize the emotions presented in individual sentence and overall content of blog posts. 2.1 Web Page Prediction Based on CRFs In [4], Guo et al. use CRFs to predict website usage next-page loading. From that prediction the authors hope to optimize pre-fetching of pages, thereby significantly reducing latency during user interaction with a website. The authors ran a series of experiments on both Hidden Markov Models and CRFs, though I will only examine their CRF based experiments and results. The data from [3] was pre-processed into sequences of page views, each of which is assigned to 1 of 17 numbered page categories from the dataset. Duplicate consecutive page views of the same category are removed. Labels are then assigned as the next-page-category. For example, a user sequence of 6 9 4 10 3 10 2

5 10 4 is mapped into an observation sequence of 6 9 4 10 3 10 5 10 (without the last page view), with labels 9 4 10 3 10 5 10 4. Three experiments with CRF were run. The first (CRF0) used only the immediate category as a feature for an observation. The second (CRF1) used the immediate category, one category before and after the current observation, and a single feature combining the before and after categories. Finally (CRF2) they used the two categories before and after the current observation, and a feature combining them. The authors hypothesized and demonstrated that CRF2 performed best, CRF1 second best, and CRF0 worst on their dataset. All cases performed better than a similarly trained Hidden Markov Model. A possible flaw in their experiment, however, is in the feature selection and how it maps onto their actual problem of preloading web pages. For experiments CRF1 and CRF2 the authors used categorization of pages after the current page to predict preloading. I believe that this gives their model an unknowable answer when compared to using their trained algorithm in real time. Ultimately their goal should have been to take a partial sequence and predict the next (or the next several) web page, but instead they ve constructed an algorithm to classify the category of a series of webpages without regard to temporial accessibility. 2.2 Conditional Random Fields for Activity Recognition In [10], the authors model robot-agent interactions with the goal of tagging a sequence of actions with the category of activity that the robot is performing. The domain used in the paper is a simulated game of tag played between three robots. Two robots are passive, and one is the seeker. Once the seeker touches one of the other robots, the touched robot becomes the seeker and their activities are changed accordingly. Taking the position of the robots as input, the goal is to label the robots at each timestep with the activity that they are performing. Each time step is labeled with the current seeker, and has features for the current location of all three robots. Additionally transitional features are included, which is a combination of the previous timestep features and the current feature. So if a position at t 0 = (0, 0) and at t 1 = (1, 1) then t 1 would have a (1, 1) feature and a combined (0, 0) (1, 1) transition feature. This allows the label at t 1 to be both conditionally dependent on the t 0 label and also on the position change from t 0. In later experiments features for velocity, a chasing indicator, and distance thresholds were also included. Like the Web Page Prediction paper, the authors compare CRFs with HMMs and ultimately find that CRFs perform better for their problem in all cases. Additionally the more features that are included the better CRF performs. Redundant features, however, appear to cause some overfitting. Unlike the Web Page Prediction experiments, none of the features supplied at a given point in the sequence are from future observations. I believe this makes for a more fair use of the algorithm considering the ultimate goal of enabling an agent to recognize ongoing activities. 2.3 Emotion Classification Using Web Blog Corpora The final paper I examined was [11], in which the authors classify both sentences and entire blog posts for the expressed emotion. They used blog posts from a website which allows users to indicate the overall emotion of a blog post, and additionally use a dictionary of words and their emotional uses for sentencelevel labeling. The authors compare a Bayesian classifier, SVMs, and CRFs on this task, and find that CRF outperforms the others. The conclusion they come to is that the condition based context of sentence-to-sentence emotional relations are more strongly represented by CRFs. They even added the previous-sentence label to the features used in an SVM model, but label independence still led to worse results than using CRFs. 3

3 Experiments 3.1 General Setup I took the weblogs from one day of activity on blinq.com, an ecommerce site specializing in used and openbox items. The logfile has only a limited amount of information for this particular service, and once cleaned effectively has an IP address and website path for each access. This includes background requests from the client side application, in addition to user navigation. I made the assumption that IP address can be used to narrow a set of access to a specific user, which is not globally the case but will be acceptable for this set of experiments. Each user session, then, consists of an ordered list of page paths. Based on this path we can identify a general classification for the type of page being accessed. I initially divided this into 14 specific types of pages based on the structure of the path, and for each extracted some identifying features. For example, with the path /electronics/ipods-mp3-players/apple-ipod-touch-4th-gen-8gb-black-mc540ll-a/31541?condition=used-verygood I extract [type=product, cat=electronics, subcat=ipods-mp3-players, condition=used-very-good]. There are a number of rows that don t fit any of these 14 types. Some of these are errors, but most are requests for page resources that are irrelevant to our experiments (such as fetches of.jpg image files). Additionally there is a significant amount of redundancy. This is primarily because of the inclusion of backgroundrequests done by javascript. A single product page will continuously make requests to the server requesting updates to the available quantity for that product, for example. I will initially leave this redundancy in place. In each case I divide the samples randomly into 80% training and 20% testing sequences. 3.2 Experiment 1 - Predicting Next-Page-Type For my first experiment I labeled each observation with the page-type of the following observation, and the final observation with exit. For example, a sequence of [product-list, product, product, checkout] is labeled with [product, product, checkout, exit]. The list of features are kept minimal, no individual product IDs are included. A typical example run produces the following statistics for each label, as provided by CRFSuite: label match model ref precision recall F1 product 4137 4405 4230 0.9392 0.9780 0.9582 product-list 1861 2068 1910 0.8999 0.9743 0.9356 exit 1463 1463 1468 1.0000 0.9966 0.9983 client-error 33 55 141 0.6000 0.2340 0.3367 home 43 60 93 0.7167 0.4624 0.5621 search 0 0 61 0.0000 0.0000 0.0000 cart 1 4 57 0.2500 0.0175 0.0328 cat 0 0 29 0.0000 0.0000 0.0000 carousel 0 0 22 0.0000 0.0000 0.0000 checkout 0 11 20 0.0000 0.0000 0.0000 customer-reviews 0 0 13 0.0000 0.0000 0.0000 email-alert 0 0 9 0.0000 0.0000 0.0000 support-info 0 0 8 0.0000 0.0000 0.0000 account 0 0 5 0.0000 0.0000 0.0000 Macro-average precision, recall, F1: (0.314695, 0.261636, 0.273125) Item accuracy: 7538 / 8066 (0.9345) Instance accuracy: 1251 / 1468 (0.8522) Here you can see that the overall item accuracy is high, at 93%, but the average precision, recall, and F1 are relatively low. Looking at the labels, prediction of exit, product, and product-list are accurate, but 4

all other labels are rare and not well predicted. I think this is coming from two sources. First, these top 3 labels are overwhelming the others in the training data. Second, in the data itself there are a large number of duplicate rows, where a browser is requesting the same information over and over. Eliminating adjacent duplicate rows decreased the item accuracy to 89%, and largely didn t affect the distribution or accuracy of individual label assignment, except for product-list (which was the most severely affected by the duplicates). product-list F1 score went from 0.9983 before duplicates were removed down to 0.7906 on the cleaned data. The severe disparity between the top three labels and the others shows that the more infrequent page-types are very difficult to predict. 3.3 Experiment 2 - Next-page Category For this experiment I used the next-page top level category for labels. For example product-electronics is the label for an observation where the next observation has page type product and category electronics. The idea here is to predict what sort of categories a user will next be interested in as they traverse the site. The last product visited is dropped. Since many sessions only look at a single product detail page, this decreases the dataset significantly. Item accuracy in this experiment was much lower, at 57% item accuracy and an F-score of 0.23. Looking at the annotated guesses (which compare the test data actual-vs-predicted labels), it appears that in many cases the algorithm s guess is equivalent to putting down the label of the previous observation. Still, this is spread across 17 product categories. 3.4 Experiment 3 - Conversion Prediction Based on the previous experiments and several less formal ones, I decided to label an entire sequence instead of individual observations. This removes most of the advantage of CRFs, effectively turning this into a logistical regression. The exercise is worthwhile, however, considering the nature of the data. Each session is labeled as either conversion or no-conversion depending on whether at some point the checkout page type is reached. An initial execution of this showed extreme overfitting, and since the checkout page type is a feature I believe that this allows a single label to be set to checkout, and then the weight of that on other labels forces all to be checkout. To defend against this overfitting I made each sequence end on the observation before the checkout. With this set of trimmed sessions in place I got a 95% item accuracy and a macro F-score of 0.61. This is higher than expected considering other results, so I did some study of the trained model and the training data. CRFSuite provides a way to dump the model, and from that I found, for example, that the relationship subcat=car-seats-baby-safety conversion has a 0.621374 weight associated with it. Exploring this further in the training data, I found 34 sessions with a pageview feature of subcat=car-seats-baby-safety, but only 4 lead to a conversion. So the linking of this subcategory with a 0.62 conversion weight is also being weighed by the conditional context. Inspired by this result, I wanted to see how far ahead of the conversion I could cut off the sequence while still getting a high prediction rate. I first tried with only the first observation in a session, and the trained algorithm got every single conversion entry wrong. I then tried with the first two observations, and got much better results. There are many more non-conversions than conversions, so it is unsurprising that there was a 95% precision at guessing non-conversions. But there was also a relatively high precision for conversions, at 60 Increasing the max sequence length to 3 increased the conversion precision to 66%, and further increasing the max sequence length to 4 at a 74% precision rate. Further increases don t significantly affect the precision rate for labeling conversions. 5

4 Conclusions CRFs are an excellent solution to sequential labeling. Based on my readings, taking contextual relationships into account through conditional probabilities allows CRF to outperform HMM in many situations. Using modern optimization algorithms such as L-BFGS allows parameter calculation to process fast on common datasets. CRFs can handle a very large number of features since P(X) is not modeled directly, though that implies that CRFs work best with a relatively small number of labels. My own goal was to explore how CRFs could be used to better understand and predict consumer behavior by analyzing web traffic on an ecommerce site. While processing the data and extracting features provided some insights, I didn t find a satisfying use for CRFs in this context. This is possibly due to the limited amount of data available for each page request. For example, it would be interesting if different predictions for product categories could be made based on which web browser or operating system a buyer used while browsing the site. The data used was a single day of website usage, expanding this to cover a longer period of time would also be helpful. CRFs are good at what they do, but my ill-defined problem does not appear to be a good usage in the problem s current form. References [1] CRF++: yet another CRF toolkit. [2] Limited-memory BFGS, September 2013. Page Version ID: 573644777. [3] UCI KDD Archive. msnbc.com anonymous web data. [4] Yong Zhen Guo, Kotagiri Ramamohanarao, and Laurence AF Park. Web page prediction based on conditional random fields. In ECAI, page 251 255, 2008. [5] A. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in neural information processing systems, 14:841, 2002. [6] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001. [7] Naoaki Okazaki. CRFsuite: a fast implementation of Conditional Random Fields (CRFs). 2007. [8] Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, page 134 141, 2003. [9] Charles Sutton and Andrew McCallum. An introduction to conditional random fields. arxiv preprint arxiv:1011.4088, 2010. [10] Douglas L. Vail, Manuela M. Veloso, and John D. Lafferty. Conditional random fields for activity recognition. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, page 235, 2007. [11] Changhua Yang, Kevin Hsin-Yih Lin, and Hsin-Hsi Chen. Emotion classification using web blog corpora. In Web Intelligence, IEEE/WIC/ACM International Conference on, page 275 278, 2007. 6