Data Mining for Web Personalization

Similar documents
CS4491/CS 7265 BIG DATA ANALYTICS

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Classification. Instructor: Wei Ding

List of Exercises: Data Mining 1 December 12th, 2015

Web Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Classification Part 4

Introduction to Data Mining

The MapReduce Framework

Weka ( )

CS 584 Data Mining. Classification 3

Data Mining Classification: Bayesian Decision Theory

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Contents. Preface to the Second Edition

Evaluating Machine Learning Methods: Part 1

Evaluating Machine-Learning Methods. Goals for the lecture

Model s Performance Measures

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

9. Conclusions. 9.1 Definition KDD

IJITKMSpecial Issue (ICFTEM-2014) May 2014 pp (ISSN )

Probabilistic Classifiers DWML, /27

Evaluating Classifiers

ECLT 5810 Evaluation of Classification Quality

Part I: Data Mining Foundations

Chapter 1, Introduction

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Statistics 202: Statistical Aspects of Data Mining

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Detecting DGA Malware Traffic Through Behavioral Models. Erquiaga, María José Catania, Carlos García, Sebastían

10 Classification: Evaluation

Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity.

Evaluating Classifiers

Agglomerative clustering on vertically partitioned data

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

ECE521 Lecture 18 Graphical Models Hidden Markov Models

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CS145: INTRODUCTION TO DATA MINING

INF 4300 Classification III Anne Solberg The agenda today:

Information Retrieval and Organisation

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

ECG782: Multidimensional Digital Signal Processing

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Web Usage Mining from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer Chapter written by Bamshad Mobasher

Classification Algorithms in Data Mining

Lecture 25: Review I

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

EVALUATIONS OF THE EFFECTIVENESS OF ANOMALY BASED INTRUSION DETECTION SYSTEMS BASED ON AN ADAPTIVE KNN ALGORITHM

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Evaluating Classifiers

Logistic Regression: Probabilistic Interpretation

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Data Mining and Knowledge Discovery Practice notes 2

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

Survey on Process in Scalable Big Data Management Using Data Driven Model Frame Work

CS570: Introduction to Data Mining

Association Rule Mining and Clustering

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

A PROPOSED HYBRID BOOK RECOMMENDER SYSTEM

CS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald

Machine Learning in Action

Introduction to Information Retrieval

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Information Retrieval. Lecture 7

Data Mining and Knowledge Discovery: Practice Notes

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Can t you hear me knocking

Data Analytics on RAMCloud

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Network Traffic Measurements and Analysis

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Performance Evaluation of Different Classifier for Big data in Data mining Industries

Data Mining Course Overview

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Data Mining and Knowledge Discovery: Practice Notes

Data Mining Concepts & Techniques

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Hierarchical Clustering 4/5/17

Introduction to MapReduce Algorithms and Analysis

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Pattern Classification based on Web Usage Mining using Neural Network Technique

CISC 4631 Data Mining

Transcription:

Data Mining for Web Personalization Patrick Dudas Outline Personalization Data mining Examples Web mining MapReduce Data Preprocessing Knowledge Discovery Evaluation Information High 1

Personalization Goal of data mining approach is for automatic personalization Automatic Personalization: Content-based Collaborative Rule-based Rule-based (Brief overview) Create decision rules Implicitly/Explicitly Highly domain dependent Rules nontransferable Profiles are based on user input Biased Static Degrade over time 2

Content-based (Brief overview) Profile based on users past experiences and their interest (ratings) Think Amazon, Pandora, ebay.. Vector similarities based on cosine similarity Bayesian classification Remember: Ratings = Profile = Recommendation Collaborative (Brief overview) Creating groups of users based on ratings Nearest neighbor approach Once grouped, recommendation based on the other neighbors are presented More users or items = more dimensions of data Dynamic or real-time not applicable 3

Data Mining Data rich descriptions Large volumes of data reliable models Automated data collection Evaluate results/make decisions Integration with existing data sources Examples of Large Datasets http://aws.amazon.com/datasets Featured data sets: Illumina - Jay Flatley (CEO of Illumina) Human Genome Data Setcience 315(5814): 972. 350 GB YRI Trio Dataset 700GB Sloan Digital Sky Survey DR6 Subset 160 GB Genome, survey data, Google Books n-gram corpuses, traffic statistics, OpenStreetMap dataset, Wikipedia traffic 4

Data Mining Web Personalization Recommendations based on Web objects: Items Pages Documents Navigation by links Web mining Pros: Personalization (duh.), real-time, more enriched datasets Cons: Privacy issues, building complex systems that misrepresent the individual Extend the Data Mining Paradigm 3) Recommendation 1) Data Preparation and Transformation 2) Pattern Discovery 5

Data Preparation and Transformation Web logs Date/time usage Site information Resource requested (image, video, etc.) Site files/meta-data The power of the cookie Server-side cookies! Data Preparation and Transformation (cont.) Pageview: User actions (where they clicked and the path) User events (what they are trying to accomplish) Session: Sequence of page views 6

MapReduce Google design Hoodop implemented C++, C#, Erlang, Java, Ocaml, Perl, Python, Ruby, F#, R.. Example 7

Usage Data Pre-Processing Pattern Discovery We have data! Now what? Cluster Classification Association Rule Discovery Sequential pattern Discovery Markov Models Latent Variable Model 8

Clustering Partitioning Split your data into groups K-means Hierarchical Divisive (top-down) Start with everything, find groups Agglomerative (bottom-up) Start with a cluster and add additional information Model-based Building a model for the data (best fit) K-means 9

User-Based Clustering Start with the user profile Partition into k-groups of profiles Based on similarity Association Discovery Support min(support) Confidence min(confidence) 10

Evaluation (Personalization Model) Challenges: Recommendation algorithms may require unique set of evaluation metrics Personalization actions may be different Domain Intended application Data gathered Check for overfitting data Training set ROC Curve ROC Curve TPR = TP / (TP + FN) FPR = FP / (FP + TN) "Yes" "No" SN N.95.05.20.80 11

Information High Information is addictive Information can be misleading Information ethics Information is power, sometimes too powerful Personal Suggestions Develop a hypothesis Figure out what data is needed Make informed decisions Don t trust just your judgment Experts are experts for a reason! Develop a way to validate based on experience Then get more data if needed 12

Sources Kohavi, R. and F. Provost (2001). "Applications of data mining to electronic commerce." Data Mining and Knowledge Discovery 5(1): 5-10. Dean, J. and S. Ghemawat (2008). "MapReduce: Simplified data processing on large clusters." Communications of the ACM 51(1): 107-113. Mobasher, B. (2007). "Data mining for web personalization." The adaptive web: 90-135. http://en.wikipedia.org/wiki/file:frequentitems.png Zaharia, M., A. Konwinski, et al. (2008). Improving mapreduce performance in heterogeneous environments, USENIX Association. Witten, I. H. and E. Frank (2002). "Data mining: practical machine learning tools and techniques with Java implementations." ACM SIGMOD Record 31 (1): 76-77. Senthil kumar, A. (2011). Knowledge discovery practices and emerging applications of data mining : trends and new domains. Hershey, PA, Information Science Reference. Thank you! Questions? 13