Competitive Intelligence and Web Mining:

Similar documents
Evaluating the Usefulness of Sentiment Information for Focused Crawlers

An Approach To Web Content Mining

DATA MINING II - 1DL460. Spring 2014"

DATA MINING - 1DL105, 1DL111

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Part I: Data Mining Foundations

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Mining Web Data. Lijun Zhang

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Mining Web Data. Lijun Zhang

A Supervised Method for Multi-keyword Web Crawling on Web Forums

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Automated Online News Classification with Personalization

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

Distributed Indexing of the Web Using Migrating Crawlers

DATA MINING II - 1DL460. Spring 2017

A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Evaluation Methods for Focused Crawling

Sentiment Analysis for Customer Review Sites

Web Crawling As Nonlinear Dynamics

Extracting Information Using Effective Crawler Through Deep Web Interfaces

Oleksandr Kuzomin, Bohdan Tkachenko

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Domain Specific Search Engine for Students

Web crawlers Data Mining Techniques for Handling Big Data Analytics

Incorporating Hyperlink Analysis in Web Page Clustering

Context Based Web Indexing For Semantic Web

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Smartcrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

Improving Relevance Prediction for Focused Web Crawlers

Information Retrieval

Creating a Classifier for a Focused Web Crawler

Remotely Sensed Image Processing Service Automatic Composition

EFFICIENT ALGORITHM FOR MINING ON BIO MEDICAL DATA FOR RANKING THE WEB PAGES

Ontology-Based Web Query Classification for Research Paper Searching

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Context Based Indexing in Search Engines: A Review

Deep Web Crawling and Mining for Building Advanced Search Application

Information mining and information retrieval : methods and applications

CS6200 Information Retreival. Crawling. June 10, 2015

Objective Explain concepts used to create websites.

MURDOCH RESEARCH REPOSITORY

CS47300: Web Information Search and Management

Query Modifications Patterns During Web Searching

SEO and Monetizing The Content. Digital 2011 March 30 th Thinking on a different level

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Featured Archive. Saturday, February 28, :50:18 PM RSS. Home Interviews Reports Essays Upcoming Transcripts About Black and White Contact

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface

Ontology Based Searching For Optimization Used As Advance Technology in Web Crawlers

String Vector based KNN for Text Categorization

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

DMI Exam PDDM Professional Diploma in Digital Marketing Version: 7.0 [ Total Questions: 199 ]

An Introduction to Search Engines and Web Navigation

Development of an Ontology-Based Portal for Digital Archive Services

IJMIE Volume 2, Issue 9 ISSN:

Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering Recommendation Algorithms

1 Introduction. 1.1 What is the World Wide Web?

Chapter 1 AN INTRODUCTION TO TEXT MINING. 1. Introduction. Charu C. Aggarwal. ChengXiang Zhai

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION

Smartcrawler: A Two-stage Crawler Novel Approach for Web Crawling

Enhancing applications with Cognitive APIs IBM Corporation

Filtering of Unstructured Text

A SURVEY- WEB MINING TOOLS AND TECHNIQUE

HYBRID QUERY PROCESSING IN RELIABLE DATA EXTRACTION FROM DEEP WEB INTERFACES

An Efficient Method for Deep Web Crawler based on Accuracy

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

Basics of SEO Published on: 20 September 2017

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

Below execution plan includes a set of activities, which are executed in phases. SEO Implementation Plan

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction

Text Mining: A Burgeoning technology for knowledge extraction

Relevance Feature Discovery for Text Mining

Web Structure Mining using Link Analysis Algorithms

SciVerse Scopus. Date: 21 Sept, Coen van der Krogt Product Sales Manager Presented by Andrea Kmety

Question No : 1 Web spiders carry out a key function within search. What is it? Choose one of the following:

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Highly Efficient Architecture for Scalable Focused Crawling Using Incremental Parallel Web Crawler

Building an Internet-Scale Publish/Subscribe System

Information Retrieval

FILTERING OF URLS USING WEBCRAWLER

Mining Social Media Users Interest

World Wide Web has specific challenges and opportunities

Chapter 6: Information Retrieval and Web Search. An introduction

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

Simulation Study of Language Specific Web Crawling

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta

Transcription:

Competitive Intelligence and Web Mining: Domain Specific Web Spiders American University in Cairo (AUC) CSCE 590: Seminar1 Report Dr. Ahmed Rafea

2 P age Khalid Magdy Salama

3 P age Table of Contents Introduction to Web Mining...4 Web Mining...4 Types of Web Mining...5 Web Crawlers...6 What is a Web Crawler...6 Types of Web Crawlers...7 Properties of Web Crawlers...7 Competitive Intelligence...8 Introducing Competitive Intelligence...8 Kinds of information to look for...8 Places where information can be found....8 Proposal...9 Problem Definition...9 Motivation...9 Objective:...9 Approach...10 References & Related Work...10

4 P age Introduction to Web Mining Web Mining Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content, and usage data. Web mining uses many data mining techniques such as supervised learning (or classification), unsupervised learning (or clustering), association rule mining, and sequential pattern mining. The Web has many unique characteristics, which make mining useful information and knowledge a fascinating and challenging task. Some of these characteristics are as follows: The amount of data/information on the Web is huge and still growing. Data of all types exist on the Web, e.g., structured tables, semistructured Web pages, unstructured texts, and multimedia files (images, audios, and videos). Information on the Web is heterogeneous. Information on the Web is linked. The information on the Web is noisy. The Web is also about services. The Web is dynamic., as information on the Web changes constantly. The Web is a virtual society. People, organizations and automated systems interact thought the web.

5 P age Types of Web Mining 1. Web Usage Mining: refers to the automatic discovery and analysis of patterns in clickstream and associated data collected or generated as a result of user interactions with Web resources on one or more Web sites. The goal is to capture, model, and analyze the behavioral patterns and profiles of users interacting with a Web site. The discovered patterns are usually represented as collections of pages, objects, or resources that are frequently accessed by groups of users with common needs or interests. 2. Web Content Mining: extracts or mines useful information or knowledge from Web page contents. For example, we can automatically classify and cluster Web pages according to their topics. These tasks are similar to those in traditional data mining. However, we can also discover patterns in Web pages to extract useful data such as descriptions of products, postings of forums, etc, for many purposes. Furthermore, we can mine customer reviews and forum postings to discover consumer sentiments. These are not traditional data mining tasks.

6 P age 3. Web Structure Mining: discovers useful knowledge from hyperlinks (or links for short), which represent the structure of the Web. For example, from the links, we can discover important Web pages, which, incidentally, is a key technology used in search engines. We can also discover communities of users who share common interests. Web Crawlers What is a Web Crawler Web crawlers, also known as spiders or robots, are programs that automatically download web pages and analyze them looking for any referenced web pages. The pages discovered are in turn analyzed and the process is continued ad infinitum or until some stopping criteria is achieved. The pages discovered from the crawling process are usually treated as input to another system that further analyzes these pages for a variety of purposes including updating the indexes of search engines, email harvesting and website monitoring. Over the course of the paper, we will discuss crawler basics, the most common types of crawlers, the issues faced by contemporary crawlers and finally shed a light over some of the hot topics in research regarding crawlers.

7 P age Types of Web Crawlers 1. Universal Crawlers: Universal crawlers manage their frontiers as first-in-firstout (FIFO) queues. In this case, the crawlers would act as breadth-first crawlers. Pages to be analyzed are extracted from the head of the queue, while newly discovered pages are added to the tail of the queue. 2. Preferential Crawlers: Preferential crawlers manage their frontiers as priority queues. In this case, preferential crawlers act as best-first crawlers. As the crawler adds a newly discovered URL to the frontier, it is assigned a priority based either on the indegree of the target page, content properties regarding the target page, the proximity of keywords to the discovered URL in the source page or some other predefined measure. The URL with highest priority currently in the frontier is the one de-queued for crawling. As with universal crawlers, preferential crawlers are sensitive to the seed pages with which they are initialized. Properties of Web Crawlers 1. Quality (Interesting Objects): Relevant to the user s focus of interest. 2. Volume: To retrieve as many quality object as possible 3. Freshness: How Recent and up-to-date the retrieved objects are.

8 P age Competitive Intelligence Introducing Competitive Intelligence. Competitive intelligence refers to the process of gathering and analyzing information about products, domain constituents, customers, and competitors for the short term and long term planning needs of an organization. Many major companies, such as Ernst & Young and General Motors, have formal and well-organized CI units that enable managers to make informed decisions about critical business matters such as investment, marketing, and strategic planning. Traditionally, CI relied upon published company reports and other kinds of printed information. In recent years, Internet has rapidly become an extremely good source of information about the competitive environment of companies and has been reported by a Futures Group survey in 1997 to be one of the top five sources for CI professionals Kinds of information to look for. Competitor Profiling (background, finance, marketing, personnel). Market products, product features and performance. New accounts, proposals, contracts and financial incidents. Customer preferences and opinions. New technologies, R&D and patents. Places where information can be found. Competitor Web Sites Suppliers and Customers Web Sites News Web Sites Data Providers & Web Sites Community Articles and Blogs Social Networks

9 P age Proposal Problem Definition Dedicating human resources for harvesting relevant information from the World Wide Web has proven to be inefficient due to the vast size of information available and its distribution over numerous resources online. Motivation Delivering such a domain specific crawler that is able to collect useful separate pieces of information from around the web automatically and efficiently can reduce the cost of competitive intelligence process and give a better insight to the firm using such a tool. Objective: Objective is to develop a CI Spider that has the following capabilities: 1. Domain-aware spider: Continuously detecting slimier and/or related business entities to the domain. 2. Automatic Topic Tracking: This is the process of autonomously discovering emerging topics and gathering content that is relevant as it s added. 3. Information Organization: Assembling the information gathered into topic based groups to facilitate their analysis by domain experts.

10 P age Approach 1. Make use of domain knowledge. This can be achieved by utilizing the concept of ontology and ontology-based search. This can make the spider domain-aware and focused on the entities of the specific domain that it was designed for. 2. Use meta-search for seed links. Make use of the result of famous search engines such as Google, Yahoo and MSN as seeds to the CI Spider to crunch the web looking for relevant information. 3. Integrate content-based and link-based similarity ranking. As a large amount of information is retrieved, these information should be ranked based on its quality (relevance to the topic). The ranking should base on both content and links related to the extracted page. 4. Relevance feedback. The CI Spider should be able to accept feedback from the user classifying whether the piece of information is related or not so it can adjust itself later. 5. Extract information from social networks. Mining social networks is a challenge by itself but it is very useful when trying to obtain a competitive edge gathering people opinions, suggestions, and comments about a product of your company or a competitor. References & Related Work 1. S. Chakrabarti, M. van der Berg, B. Dom, Focused crawling: anew approach to topic-specific Web resource discovery, Proceedings of the 8th International World Wide Web Conference (Toronto, Canada, May 1999). 2. H. Chen, Y. Chung, M. Ramsey, C.C. Yang, An intelligent Personal Spider (agent) for dynamic Internet/Intranet searching, Decision Support Systems 23 (1) (1998) 41 58. 3. Dutka, Competitive Intelligence for the Competitive Edge, NTC Business Books, Chicago, IL, 1998. 4. T. Kohonen, Self-Organizing Maps, Springer-Verlag, Berlin,1995. 5. C. Lin, H. Chen, J. Nunamaker, Verifying the proximity and size hypothesis for self-organizing maps, Journal of Management Information Systems 16 (3) (1999 2000) 61 73.

11 P age 6. P. Maes, Agents that reduce work and information overload, Communications of the ACM 37 (7) (July 1994) 31 40. 7. J.J. McGonagle, C.M. Vella, The Internet Age of Competitive Intelligence, Quorum Books, London, 1999. 8. C.C. Yang, J. Yen, H. Chen, Intelligent Internet searching agent based on hybrid simulated annealing, Decision Support Systems 28 (2000) 269 277.