Collective Intelligence in Action

Similar documents
Collective Intelligence in Action

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Machine Learning in Action

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Un-moderated real-time news trends extraction from World Wide Web using Apache Mahout

Building Search Applications

Part I: Data Mining Foundations

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Contents. Preface to the Second Edition

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Introduction to Text Mining. Hongning Wang

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Hibernate Search: A Successful Search, a Happy User Make it Happen!

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Mining Web Data. Lijun Zhang

Tika in Action JUKKA MANNING CHRIS A. MATTMANN L. ZITTING. Shelter Island

T chnology chnology Ma turity turity for fo Adaptiv Adaptiv Massively Massiv ely Pa P ra r llel llel Computing F rst rst Wo W rksho p 2009

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

TEXT MINING APPLICATION PROGRAMMING

Optimizing Apache Nutch For Domain Specific Crawling at Large Scale

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

A Software Architecture for Progressive Scanning of On-line Communities

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Enhancing applications with Cognitive APIs IBM Corporation

Challenges for Data Driven Systems

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Specialist ICT Learning

Distributed Itembased Collaborative Filtering with Apache Mahout. Sebastian Schelter twitter.com/sscdotopen. 7.

Creating a Recommender System. An Elasticsearch & Apache Spark approach

60-538: Information Retrieval

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""

CS290N Summary Tao Yang

Nutch as a Web mining platform the present and the future Andrzej Białecki

Focused Crawling with

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4. Prof. James She

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

The Topic Specific Search Engine

Focused Crawling with

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

IBM Advantage: IBM Watson Compare and Comply Element Classification

Intro to Artificial Intelligence

Hibernate Search Googling your persistence domain model. Emmanuel Bernard Doer JBoss, a division of Red Hat

Personalized News Recommender using Twitter

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Table Of Contents: xix Foreword to Second Edition

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Twitter data Analytics using Distributed Computing

Machine Learning using MapReduce

Click to add text IBM Collaboration Solutions

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Linked Data. Department of Software Enginnering Faculty of Information Technology Czech Technical University in Prague Ivo Lašek, 2011

Pre-Requisites: CS2510. NU Core Designations: AD

Table of Contents 1 Introduction A Declarative Approach to Entity Resolution... 17

Review on Text Mining

Business Intelligence Roadmap HDT923 Three Days

Name of the lecturer Doç. Dr. Selma Ayşe ÖZEL

Information Retrieval

Mining Web Data. Lijun Zhang

IN PRACTICE. Daniele Bochicchio Stefano Mostarda Marco De Sanctis. Includes 106 practical techniques MANNING

Search Engines Information Retrieval in Practice

Recommendation Algorithms: Collaborative Filtering. CSE 6111 Presentation Advanced Algorithms Fall Presented by: Farzana Yasmeen

BigDataBench-MT: Multi-tenancy version of BigDataBench

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

Contents PART I: CLOUD, BIG DATA, AND COGNITIVE COMPUTING 1

A short introduction to the development and evaluation of Indexing systems

Jeff Howbert Introduction to Machine Learning Winter

FAST InStream. version 4.3 Product Overview Guide

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Moving to the Cloud. Developing Apps in. the New World of Cloud Computing. Dinkar Sitaram. Geetha Manjunath. David R. Deily ELSEVIER.

Collective Intelligence in Action

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Modules, Details & Fees. Total Modules- 25 (highest in Industry) Duration- 2-5Months Full Course Fees- 30, (Pay in two Installments *2)

Utilizing Folksonomy: Similarity Metadata from the Del.icio.us System CS6125 Project

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

MACHINE LEARNING Example: Google search

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

SCIENCE. An Introduction to Python Brief History Why Python Where to use

CHAPTER 6 EXPERIMENTS

Techno Expert Solutions An institute for specialized studies! 0.20 hrs hrs. 2 hrs

CS371R: Final Exam Dec. 18, 2017

Lecture 11: Clustering Introduction and Projects Machine Learning

This tutorial is designed for all Java enthusiasts who want to learn document type detection and content extraction using Apache Tika.

Human-Computer Information Retrieval

Plan for today. CS276B Text Retrieval and Mining Winter General feedback on proposals. General feedback on proposals

ECS289: Scalable Machine Learning

INFSCI 2480! RSS Feeds! Document Filtering!

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

CSE 3. How Is Information Organized? Searching in All the Right Places. Design of Hierarchies

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Contents. Part I Setting the Scene

SERVICE-ORIENTED COMPUTING

Agile Model-Driven Development with UML 2.0 SCOTT W. AM BLER. Foreword by Randy Miller UNIFIED 1420 MODELING LANGUAGE. gile 1.

Transcription:

Collective Intelligence in Action SATNAM ALAG II MANNING Greenwich (74 w. long.)

contents foreword xv preface xvii acknowledgments xix about this book xxi PART 1 GATHERING DATA FOR INTELLIGENCE 1 "1 Understanding collective intelligence 3 -*- 1.1 What is collective intelligence? 4 1.2 CI in web applications 6 Collective intelligence from the ground up: a sample application 7 Benefits of collective intelligence 9 CI is the core component of Web 2.0 10 Harnessing CI to transform from content-centric to user-centric applications 12 1.3 Classifying intelligence 14 Explicit intelligence 14 Implicit intelligence intelligence 16 1.4 Summary 18 1.5 Resources 18 15 Derived vii

CONTENTS Learning from user interactions 20 2.1 Architecture for applying intelligence 21 Synchronous and asynchronous services 21 Real-time learning in an event-driven system 23 Polling services for non-event-driven systems 24 Advantages and disadvantages of event-based and non-event-based architectures 25 2.2 Basics of algorithms for applying CI 25 Users and items 26 Representing user information 27 Content-based analysis and collaborative filtering 29 Representing intelligence from unstructured text 30 Computing similarities 31 Types of datasets 32 2.3 Forms of user interaction 34 Rating and voting 35 Emailing or forwarding a link 36 Bookmarking and saving 36 Purchasing items 37 Click-stream 37 Reviews 39 2.4 Converting user interaction into collective intelligence 41 Intelligence from ratings via an example 41 Intelligence from bookmarking saving, purchasing Items, forwarding, click-stream, and reviews 46 2.5 Summary 48 2.6 Resources 48 Extracting intelligence from tags 50 3.1 Introduction to tagging 51 Tag-related metadata for users and items 52 Professionally generated tags 52 User-generated tags 53 Machine-generated tags 54 Tips on tagging 55 Why do users tag? 55 3.2 How to leverage tags 56 Building dynamic navigationv 56 Innovative uses of tag clouds Targeted search 59 Folksonomies and building a dictionary 60 3.3 Extracting intelligence from user tagging: an example 60 Items related to other items 61 Items of interest for a user 61 Relevant users for an item 62 3.4 Scalable persistence architecture for tagging 62 Reviewing other approaches 63 Recommended persistence architecture 66 3.5 Building tag clouds 69 Persistence design for tag clouds 69 Algorithm for building a tag cloud 70 Implementing a tag cloud 71 Visualizing a tag cloud 76

CONTENTS 3.6 Finding similar tags 79 3.7 Summary 80 3.8 Resources 81 Extracting intelligencefromcontent 82 4.1 Content types and integration 83 Classifying content 83 Architecture for integrating content 85 4.2 The main Cl-related content types 86 Blogs 87 Wikis 89 Groups and message boards 91 4.3 Extracting intelligence step by step 93 Setting up the example 94 Naive analysis 95 Removing, common words 98 Stemming 99 Detecting phrases 100 4.4 Simple and composite content types 102 4.5 Summary 103 4.6 Resources 104 Searching the blogosphere 107 5.1 Introducing the blogosphere 108 Leveraging the blogosphere 108 RSS: the publishing format 109 Blog-tracking companies 111 5.2 Building a framework to search the blogosphere 111 The searcher 113 The search parameters 113 The query results 114 Handling the XML response 115 Exception handling 116 5.3 Implementing the base classes 116 Implementing the search parameters 117 Implementing the result objects 117 Implementing the searcher 119 Parsing XML response 123 Extending the framework 127 5.4 Integrating Technorati 128 Technorati search API overview 128 Implementing classes for integrating Technorati 130 5.5 Integrating Bloglines 135 Bloglines search API overview 135 Implementing classes for integrating Bloglines 136 5.6 Integrating providers using RSS 139 Generalizing the query parameters 139 Generalizing the blog searcher 140 Building the RSS 2.0 XML parser 141 5.7 Summary 143 5.8 Resources 143

X 6 CONTENTS Intelligent web crawling 145 6.1 Introducing web crawling 146 Why crawl the Web? 146 The crawling process 147 Intelligent crawling and focused crawling 149 Deep crawling 150 Available crawlers 151 6.2 Building an intelligent crawler step by step 152 Implementing the core algorithm 152 Being polite: following the robots.txt file 156 Retrieving the content 159 Extracting URLs 160 Making the crawler intelligent 161 Running the crawler 162 Extending the crawler 163 6.3 Scalable crawling with Nutch 164 Setting up Nutch 164 Running the Nutch crawler 165 Searching with Nutch 168 Apache Hadoop, MapReduce, and Dryad 169 6.4 Summary 171 6.5 Resources 171 PART 2 DERIVING INTELLIGENCE 173 Q О Data mining: process, toolkits, and standards 175 7.1 Core concepts of data mining 176 Attributes 176 Supervised and unsupervised learning 178 Key learning algorithms 178 The mining process 181 7.2 Using an open source data mining framework: WEKA 182 Using the WEKA application: a step-by-step tutorial 183 Understanding the WEKA APIs 186 Using the WEKA APIs via an example 188 7.3 Standard data mining API: Java Data Mining (JDM) 193 JDM architecture 194 Key JDM objects 195 Representing the dataset 196 Learning models 197 Algorithm settings 199 JDM tasks 199 JDM connection 200 Sample code for accessing DME 202 JDM models andpmml 204 7.4 Summary 204 7.5 Resources 205 Building a text analysis toolkit 206 8.1 Building the text analyzers 207 Leveraging Lucene 208 Writing a stemmer analyzer 213 Writing a TokenFilter to inject synonyms and detect phrases 214 Writing an analyzer to inject synonyms and detect phrases 218 Putting our analyzers to work 218

CONTENTS 8.2 Building the text analysis infrastructure 221 Building the tag infrastructure 222 Building the term vector infrastructure 225 Building the Text Analyzer class 231 Applying the text analysis infrastructure 234 8.3 Use cases for applying the framework 237 8.4 Summary 238 8.5 Resources 239 Discovering patterns xvith clustering 240 9.1 Clustering blog entries 241 Defining the text clustering infrastructure 242 Retrieving blog entries from Technorati 244 Implementing the k-means algorithms for text processing 247 Implementing hierarchical clustering algorithms for text processing 253 Expectation maximization and other examples of clustering high-dimension sparse data 261 9.2 Leveraging WEKA for clustering 262 Creating the learning dataset 263 Creating the clusterer 265 Evaluating the clustering results 266 9.3 Clustering using the JDM APIs 268 Key JDM clustering-related classes 268 Clustering settings using the fl)m APIs 269 Creating the clustering task using the JDM APIs 271 Executing the clustering task using the JDM APIs 271 Retrieving the clustering model using the JDM APIs 272 9.4 Summary 272 9.5 Resources 273 Making predictions 274 10.1 Classification fundamentals 275 Learning decision trees by example 275 Naive Bayes' classifier 281 Belief networks 285 10.2 Classifying blog entries using WEKA APIs 287 Building the datasetfor classifying blog entries 288 Building the classifier class 292 10.3 Regression fundamentals 294 Linear regression 295 Multi-layer perceptron (MLP) 297 Radial basis functions (RBF) 298 10.4 Regression using WEKA 299

xii CONTENTS 10.5 Classification and regression using JDM 300 Keyß)M supervised learning related classes 300 Supervised learning settings using thejdmapis 302 Creating the classification task using the JDM APIs 304 Executing the classification task using the JDM APIs 304 Retrieving the classification model using the JDM APIs 305 Retrieving the classification model using the JDM APIs 305 10.6 Summary 306 10.7 Resources 306 Jr ART О xvpplying INTELLIGENCE IN YOUR APPLICATION...307 11 12 Intelligent search 309 11.1 Search fundamentals 310 Search architecture 310 Core Lucene classes 311 Basic indexing and searching via example 313 11.2 Indexing with Lucene 320 Understanding the index Jormat 320 Modifying the index 321 Incremental indexing 322 Accessing the term frequency vector 324 Optimizing indexing performance 325 11.3 Searching with Lucene 327 Understanding Lucene scoring 327 Querying Lucene 330 Sorting search results 331 Querying on multiple fields 333 Filtering 334 Searching multiple indexes 335 Using a HitCollector 335 Optimizing search performance 338 11.4 Useful tools and frameworks 339 Luke 339 Solr 339 Compass 341 Hibernate search 341 11.5 Approaches to intelligent search 341 Augmenting searchxuith classifiers andpredictors 342 Clusteringsearch results 342 Personalizing results for the user 344 Communitybased search 344 Linguistic-based search 345 Data search 345 11.6 Summary 347 11.7 Resources 347 Building a recommendation engine 349 12.1 Recommendation engine fundamentals 350 Introducing the recommendation engine 351 Item-based and user-based analysis 352 Computing similarity using contentbased and collaborative techniques 353 Comparison of contentbased and collaborative techniques 354

CONTENTS xiii 12.2 Content-based analysis 355 Finding similar items using a search engine (Lucene) 355 Building a content-based recommendation engine 359 Related items for document clusters 362 Personalizing content for a user 362 12.3 Collaborative filtering 363 k-nearest neighbor 363 Packages for implementing collaborative filtering 365 Dimensionality reduction with latent semantic indexing 369 Implementing dimensionality reduction 370 Probabilistic model-based approach 373 12.4 Real-world solutions 373 Amazon item-to-item recommendation 374' Google News personalization 377 Netflix and the BellKor Solution for the Netflix Prize 381 12.5 Summary 385 12.6 Resources 386 index 389