Randolph E. Bucklin Catarina Sismeiro The Anderson School at UCLA. Overview. Introduction/Clickstream Research Conceptual Framework

Similar documents
From Clickstream to Research Stream

A Dynamic Clustering-Based Markov Model for Web Usage Mining

Modeling and Tolerating Heterogeneous Failures in Large Parallel Systems

Graphical Models, Bayesian Method, Sampling, and Variational Inference

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

Handbook of Statistical Modeling for the Social and Behavioral Sciences

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Modeling and Performance Analysis with Discrete-Event Simulation

Introduction to Bayesian Analysis in Stata

Link Recommendation Method Based on Web Content and Usage Mining

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Markov Networks in Computer Vision

Markov Chain Monte Carlo (part 1)

Markov Networks in Computer Vision. Sargur Srihari

Additive hedonic regression models for the Austrian housing market ERES Conference, Edinburgh, June

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Slides 11: Verification and Validation Models

This study is brought to you courtesy of.

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

arxiv: v3 [cs.ni] 3 May 2017

Developing Effect Sizes for Non-Normal Data in Two-Sample Comparison Studies

Modeling Criminal Careers as Departures From a Unimodal Population Age-Crime Curve: The Case of Marijuana Use

Monte Carlo for Spatial Models

Generalized least squares (GLS) estimates of the level-2 coefficients,

Advances in Parallel Branch and Bound

From Bayesian Analysis of Item Response Theory Models Using SAS. Full book available for purchase here.

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

Prioritizing the Links on the Homepage: Evidence from a University Website Lian-lian SONG 1,a* and Geoffrey TSO 2,b

Statistical techniques for data analysis in Cosmology

Extreme Value Theory in (Hourly) Precipitation

Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri

Chapter 10 Verification and Validation of Simulation Models. Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

Multiple Imputation with Mplus

Web Analytics Key Metrics and KPIs Version 1.0

Part 11: Collaborative Filtering. Francesco Ricci

Lecture 13: Model selection and regularization

Markov chain Monte Carlo methods

Collective classification in network data

Nested Sampling: Introduction and Implementation

Convexization in Markov Chain Monte Carlo

Performance of Sequential Imputation Method in Multilevel Applications

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Analysis of Panel Data. Third Edition. Cheng Hsiao University of Southern California CAMBRIDGE UNIVERSITY PRESS

OVERVIEW GOOGLE ANALYTICS

Introduction to Hierarchical Linear Model. Hsueh-Sheng Wu CFDR Workshop Series January 30, 2017

Parallel Multivariate Slice Sampling

Bayesian Modelling with JAGS and R

DATA MINING II - 1DL460. Spring 2014"

Calibration and emulation of TIE-GCM

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

Performance and cost effectiveness of caching in mobile access networks

3 : Representation of Undirected GMs

MCMC Methods for data modeling

CHAPTER 1 INTRODUCTION

Bayes Estimators & Ridge Regression

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

1 Methods for Posterior Simulation

Loop detection and extended target tracking using laser data

An imputation approach for analyzing mixed-mode surveys

Lasso.jl Documentation

The linear mixed model: modeling hierarchical and longitudinal data

A noninformative Bayesian approach to small area estimation

Web Crawlers Detection. Yomna ElRashidy

Network Traffic Measurements and Analysis

ECE-161C Cameras. Nuno Vasconcelos ECE Department, UCSD

A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fmri Data

Scott Philips, Edward Kao, Michael Yee and Christian Anderson. Graph Exploitation Symposium August 9 th 2011

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Latent Variable Models for the Analysis, Visualization and Prediction of Network and Nodal Attribute Data. Isabella Gollini.

Using Machine Learning to Optimize Storage Systems

Computational Methods. Randomness and Monte Carlo Methods

Beyond the Assumption of Constant Hazard Rate in Estimating Incidence Rate on Current Status Data with Applications to Phase IV Cancer Trial

Simulation Calibration with Correlated Knowledge-Gradients

BeviMed Guide. Daniel Greene

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Package simsurv. May 18, 2018

Splines and penalized regression

Allstate Insurance Claims Severity: A Machine Learning Approach

Vergic Engage Analytics

emetrics Study Llew Mason, Zijian Zheng, Ron Kohavi, Brian Frasca Blue Martini Software {lmason, zijian, ronnyk,

Sample Size calculations for Stepped Wedge Trials

Package EBglmnet. January 30, 2016

A Dendrogram. Bioinformatics (Lec 17)

Decision support for user interface design: usability diagnosis by time analysis of the user activity

Correctly Compute Complex Samples Statistics

Predictor Selection Algorithm for Bayesian Lasso

Generalized additive models I

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Link Prediction for Social Network

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

CSE 546 Machine Learning, Autumn 2013 Homework 2

Example Using Missing Data 1

Geometric Registration for Deformable Shapes 3.3 Advanced Global Matching

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R

Warm-up as you walk in

The University of Michigan - Department of EECS EECS 370 Introduction to Computer Architecture Midterm Exam 2 solutions April 5, 2011

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

Transcription:

A Model of Web Site Browsing Behavior Estimated on Clickstream Data Randolph E. Bucklin Catarina Sismeiro The Anderson School at UCLA Overview Introduction/Clickstream Research Conceptual Framework 4 Within-site browsing behavior for one web site 4 Examine aspects related to stickiness Modeling Approach 4 Page request decision (binary probit) 4 Page view duration (proportional hazard) Data and Estimation Results and Implications Next Steps 2 1

Clickstream Research Internet Customization 4 Recommendation systems (Ansari et al 2000) 4 Customization of email (Ansari and Mela 2000) Site Visit Behavior (Media Metrix) 4 Johnson et al (2000) 4 Moe and Fader (2000, 2001) Site Navigation and Browsing 4 Huberman et al (1998) 4 Adar and Huberman (1999) Purchasing 4 Moe and Fader (2000) 4 Montgomery et al (2001) 3 Objectives Use clickstream data to understand within-site browsing behavior and the implications for site performance (e.g., for site stickiness) How is within-site browsing behavior influenced by the following factors: 4 Depth of visit 4 Repeat visitation 4 Other covariates (content, response time, errors) What implications might be drawn? 4 Behavioral: Evidence for lock-in and learning 4 Managerial: Site metrics, design and customization 4 2

Understanding Browsing Behavior Depth of Site Visit (# Pages Viewed) Repeat Visitation (# Visits to the Site) Involvement Involvement Page Requests Time Constraints Learning Page View Duration Involvement Involvement Time Constraints Learning 5 Aggregate Statistics # Pages Viewed # Visits to the Site 0.9 6.5 0.88 6 0.86 Page Requests 0.84 0.82 0.8 0.78 5.5 5 4.5 # of Page Views 0.76 0 5 10 15 20 25 30 4 0 5 10 15 20 25 1.9 3 Page View Duration 1.8 1.7 1.6 1.5 2.8 2.6 2.4 2.2 2 1.8 1.4 0 5 10 15 20 25 30 1.6 0 5 10 15 20 25 6 3

Conceptual Modeling Framework Model Two Elements of Browsing Behavior at the Individual Level 4 Additional page request versus site exit 4 Page view durations Estimate Two Models 4 Page request model: binary probit for probability of staying or exiting the site 4 Duration model: proportional hazard 4 Include heterogeneity and covariates What We are Not Modeling... 4 Link choices and navigation pathway 4 Effects of specific page content on browsing 7 y = ij X ij Page Requests Model Binary probit with possible visitor heterogeneity The utility of an additional page request j for visitor i is given by: ε ij ~ N( 0,1) β + ε i ij j =1...J i i = 1...N Where: y > 0 C 1 ( additionalpage request) ij ij = ij ij = y < 0 C 0 ( website exit) Random effects approach to heterogeneity 8 4

h o Duration Model Conditional proportional hazards model with possible visitor heterogeneity Weibull baseline hazard 1 ( t ) = λαt α, t > 0, α > 0, λ > 0 ij 8Positive duration dependence ij 8Negative duration dependence 8No duration dependence (exponential) ij α >1 α <1 α = 1 Gamma heterogeneity ( frailties ) with unit mean ω ~ Gamma( κ, κ ) i = 1... N i 9 Duration Model (cont d) Hazard depends on covariates and expected page view utility The hazard of a page view j for visitor i is then h given by: ( tij ωi, zij, y ) = ho ( tij ) exp( zijφ1 + g( y, φ2 )) ωi ij ij 4 g(. ) is a polynomial of the expected utilities 4 Covariates do not include intercept 10 5

Data Online reseller of automotive vehicles Data covers month of October 1999 W3C Extended Log File format 4 Includes cookies - for first hit, cookie was inferred using IP-address and time lag 4 Cookies identify returning visitors and contain information on previous vehicle configurations Hits aggregated to page views Log files also record 4 Bytes transferred and server response time 4 Errors and reloads 11 Data Filtering Only page requests with cookies to identify: 4 Unique users 4 Multiple sessions per user (Repeat) 4 Cars configured Only sessions with more than one page view 4 Session defined as sequence of page views from a unique cookie 4 If next page requested by user is more than 30 minutes apart, it is assumed to start new session Eliminate Web Crawlers, Company Web Administrators, etc. 12 6

Summary Statistics for Browsing Random Sample October 1999 October 15-31 Number of Visitors 169,910 5,000 Number of Sessions 245,782 6,630 Number of Requests 1,462,457 40,560 Sessions per Visitor 1.447 1.326 Pages per Visitor 8.607 8.112 Pages per Session 5.950 6.118 Average Page View Duration* 102.692 117.156 *in seconds 13 Distribution of Page Views 14 7

Visit Depth Predictor Variables 4 Cumulative number of page views in the visit at the current stay/exit decision point Repeat Visitation 4 Number of previous visits to the site Additional Covariates 4 Information sent (bytes between server and client) 4 Server response time 4 Errors occurring on a page download (0/1) 4 Reload occurrence (0/1) 4 Presence of dynamic content (0/1) 4 Previous product configuration (0/1) 4 Previous page view duration (page request model only) 4 Expected utility from page request model (duration only) 15 Summary Statistics for the Covariates Mean Std. Dev. BYTES 0.275 0.238 CMAKE 0.201 0.817 CPAGE 5.886 6.186 CSESSION 1.921 2.208 DYNAMIC 0.126 0.332 ERROR 0.081 0.273 FIRSTMAKE 0.146 0.354 ORDER 0.005 0.069 PREVDUR 1.633 2.909 RELOAD 0.233 0.423 SRVRESP 0.003 0.041 16 8

Estimation Bayesian estimation with diffuse priors Gibbs sampling and Metropolis-Hastings 4 Check for convergence 4 Eliminate first 20,000 draws 4 Save each 4 th draw of remaining 8,000 Log of pseudo Bayes factors for model comparison and selection 4 Computed using the saved 2,000 draws 4 Intercept only model as base case 17 Model Fits Probit Model No Heter. Model 1 Model 9 Model 9 INTERCEPT x x x BYTES x x SRVRESP x x RELOAD x x ERROR x x LN(PREVDUR +1) x x LN(CPAGE) x x LN(CSESSION) x x LN(CMAKE + 1) x x Contribution to Log PSBF -18,111.6-16,703.8-17,453.6 Number of Covariates 1 9 9 Log PSBF 1,407.8 658.0 18 9

Model Fits (cont.) Duration Model Joint Choice & Duration No Covar. No Heter. Model 1 Model 7 Model 9 Model 10 BYTES x x x SRVRESP x x x RELOAD x x x ERROR x x x LN(CPAGE) x x x DYNAMIC x x x EXPECTED UTITLITY x x Contribution to Log PSBF -193,627-192,933-192,917-194,013 Number of Covariates -- 6 7 7 Log PSBF 694 710-386 19 Model Parameters Choice 1 Duration 2 95% Probability 95% Probability Mean Interval Mean Interval BYTES 0.211 [ 0.126, 0.308] -0.351 [-0.400, -0.287] LN(CMAKE + 1) 0.186 [ 0.105, 0.269] n.s. n.s. LN(CPAGE) -0.711 [-0.755, -0.668] -0.027 [-0.046, -0.008] LN(CSESSION) -0.194 [-0.236, -0.151] n.s. n.s. DYNAMIC n.s. 0.544 [ 0.479, 0.598] ERROR 0.427 [ 0.317, 0.542] -0.246 [-0.303, -0.188] LN(PREVDUR + 1) -0.220 [-0.250, -0.189] n.a. n.a. RELOAD -0.050 [-0.102, -0.006] -0.241 [-0.274, -0.208] SRVRESP -8.485 [-9.434, -7.445] -1.883 [-2.072, -1.686] INTERCEPT 1.885 [ 1.834, 1.943] n.a. n.a. EXP UTILITY n.a. n.a. 0.071 [ 0.047, 0.093]? n.a. n.a. 0.988 [ 0.980, 0.996]? n.a. n.a. 4.993 [ 4.635, 5.360]? n.a. n.a. 0.012 [ 0.012, 0.013] 1 cross-sectional means; 2 model parameters; n.a. not applicable; n.s. not significant 20 10

Individual Level Elasticities Choice Duration - Direct Duration - Total Min Max Mean Min Max Mean Min Max Mean BYTES -0.184 0.136 0.010 0.001 0.665 0.105 0.001 0.667 0.101 CMAKE -0.064 0.374 0.067 --- --- --- -0.049 0.007-0.016 CPAGE -1.212-0.000-0.238 0.026 0.026 0.026 0.024 0.109 0.076 CSESSION -0.358 0.027-0.067 --- --- --- -0.012 0.039 0.015 DYNAMIC --- --- --- -0.562 0.000-0.094-0.562 0.000-0.094 ERROR -0.009 0.127 0.002 0.000 0.257 0.025 0.000 0.245 0.023 PREVDUR -0.524 0.077-0.090 --- --- --- -0.020 0.046 0.016 RELOAD -0.166 0.050-0.016 0.000 0.239 0.057 0.000 0.260 0.058 SRVRESP -8.418 0.027-0.044 0.000 0.914 0.007 0.000 0.970 0.008 21 Findings for CPAGE and CSESSION # Pages Viewed # Visits to the Site Page Requests [P(Stay)] Time Constraints + Involvement Learning Page View Duration Involvement Learning + Involvement Compensating? 22 11

Additional Behavioral Findings A Possible Dynamic Tradeoff Between Page Views and Duration 4 Higher expected utility from another page view implies shorter expected duration of next page view 4 Higher previous page view duration lowers page request probability Cross Sectional Patterns 4 Users who are more sensitive to CPAGE and PREVDUR also have shorter page view durations (as revealed by correlations of the individual-level parameter estimates with the frailties) Costs vs Benefits 23 Implications for Managers Customization 4 Significant interest in customization of Internet experiences among managers and academics (e.g., Ansari and Mela 2000, Adar and Huberman 1999) 4 Effects consistent with across-site learning suggest limits on the benefits of site customization Site Design 4 Question of clutter versus clarity Information per page Number of page views needed to complete transaction 4 Findings for bytes per page and cumulative page views supports move to fewer pages and more information per page (site condensation) 24 12

Implications for Managers (cont d) Site Metrics 4 Aggregate-level metrics of site stickiness differ from individual-level modeling results 4 Mix of new and returning visitors can change site usage and affect metrics 25 Next Steps Combine analysis of browsing behavior with prediction of purchase Diagnose site-related drivers of order decisions Develop a sequential model approach 4 Order conditional upon product configuration 4 Product configuration conditional upon search 4 Search conditional upon initial browsing experience Preliminary analysis reveals differences between those that configure or order a car and those that do not 26 13