In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
|
|
- Helena Roberts
- 5 years ago
- Views:
Transcription
1 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to provide the web user with the appropriate content in response to his or her query. In most cases, the user is flooded with thousands of web pages in response to his or her query and it is common knowledge that not many users go past the first few web pages. In spite of the multitude of the pages returned, most of the time, the average user does not find what he or she is looking for in the first few pages he or she manages to examine. It is really debatable as to how useful or meaningful it is for any search engine to return thousands of web pages in response to a user query. In spite of the sophisticated page ranking algorithms employed by the search engines, the pages the user actually needs may actually get lost in the huge amount of information returned. Since most users of the web are not experts, grouping of the web pages into categories helps them to navigate quickly to the category they are actually interested and subsequently to the specific web page. This will reduce the search space for the user to a great extent. Web page categorization is the main focus of this thesis. It is strongly believed and felt that the experience of a person using a web search engine is enhanced multifold if the results are nicely categorized
2 2 as against the case where the results are displayed in a structure less, flat manner. 1.2 Information Retrieval Web search owes its roots to the traditional information retrieval systems, even though it differs from it in some ways. Information retrieval systems work in controlled environments and it is relatively easier to build effective systems. Web search engines, on the other hand, have to work in a relatively uncontrolled environment, where the content is dynamic in nature and also stored in different formats and also unstructured. Information Retrieval is the problem of selection of relevant information from a document database in response to search queries given by a user [1],[6]. Information Retrieval systems (IRSs) deal with document data bases that usually consist of textual information and process user queries to provide the user with access to relevant information within a reasonably acceptable time interval. An IRS consists of three main components: A document collection, which stores the actual documents and the representation of their information contents. Text documents are typically represented in terms of the index terms which act as the content identifiers of the documents.
3 3 A query subsystem, which allows the users to formulate their queries and presents the relevant documents retrieved by the system to them. To do so, it includes a query language that collects the rules to generate legitimate queries and procedures to select the relevant documents. A matching mechanism, which evaluates the degree to which the document representations match the requirements expressed in the query, and retrieves those documents that are deemed to be relevant to it. The retrieval model of most of the commercially available IRSs is the boolean one, which is a robust and well formulated model albeit with some limitations. For example, it does not consider partial relevance and is not able to rank the retrieved documents by relevance. Due to this fact, some paradigms have been designed to extend this retrieval model and overcome these problems, with the vector space model [6] being the most popular. 1.3 Web Page Retrieval Although the traditional IR techniques are a few decades old, they still constitute the base of modern Web search engines. The popularity of the Web has transformed traditional IRSs into newer and more powerful search tools for locating content on the Internet. However, there are several differences due to the special characteristics of the World Wide
4 4 Web environment. The problem of searching the Web has become far more complex than it was in the past[4] mainly due to the increase in the size of the search space by several orders of magnitude and due to the multimedia nature of Web documents. The main differences between Web Page retrieval and traditional IR are listed as follows [3]: (1) The HTML-based nature of Web documents, that make them present a structure defined by the HTML tags. (2) The diversity of Web documents in terms of: (i) (ii) (iii) length, structure, writing style language and domains and existing information formats (3) The dynamic nature of many Web pages, in the sense that the content may keep changing, that makes it difficult to retrieve them. The previous aspects clearly show how Web retrieval has to extend traditional IR in order to deal with the special nature of Web documents. However, this usually makes Web engines focus more on the efficiency of the response than on the retrieval efficacy.
5 5 1.4 Classification and Clustering Classification and clustering are the two tasks which have been traditionally carried out by human beings who are experts in the given domain. But in this electronic age, with the explosion in the amount of information available on the net, it is becoming increasingly difficult for human experts to classify or cluster all the documents available on the World Wide Web. Hence, it is increasingly evident that machine learning techniques be used instead of human experts, to carry out the tasks of web document classification and clustering, as part of the activity of categorizing them Document Classification Given a document, the task of a classifier is to identify to which class (category) the document belongs to, from a set of pre defined classes. It is an example of supervised learning. In order to carry out text classification, a classifier needs to be trained first, with sufficient number of representative samples from each of the classes. The samples or examples as they are often called are usually picked by human experts. One important application of text classification is to help build a vertical search engine. Vertical search engines restrict searches to a particular topic. For example, the query computer science on a vertical search engine for the topic India would return a list of Indian computer science departments with higher precision and recall than the query
6 6 computer science India on a general purpose search engine. Some other applications include several of the preprocessing steps necessary for indexing: detecting a document s encoding (ASCII, Unicode, UTF-8) word segmentation identifying the language of a document The automatic detection of spam pages Many classification tasks have traditionally been solved manually. But manual classification is expensive to scale to problems of larger size. A second approach is to carry out classification by the use of rules, which are often equivalent to Boolean expressions. A rule captures a certain combination of keywords that indicates a class. Hand-coded rules have good scaling properties, but creating the rules and maintaining them over time is labor-intensive. A technically skilled person can create rule sets that are very accurate. But often, it can be very difficult to find someone with this specialized skill. A third approach to text classification is based on machine learning. In machine learning, the set of rules or, more generally, the decision criterion of the text classifier is learned automatically from training data. This approach is also called statistical text classification if the learning method is statistical. In statistical text classification, a number of good example documents from each pre defined class are required for training the classifier. The human element is still not
7 7 eliminated since the training documents come from a person who is an expert in the area Document Clustering Document clustering is the process of grouping a set of documents into subsets or clusters. The goal is to create clusters that are coherent internally, but are different from each other. In other words, documents within a cluster should be as similar as possible and documents in one cluster should be as dissimilar as possible from documents in other clusters. Clustering is the most common form of unsupervised learning. No supervision means that there is no human expert who has assigned documents to classes unlike in classification. In clustering, it is the data distribution that will determine the cluster membership. The key input to a clustering algorithm is the similarity measure. In document clustering, the similarity or distance measure is usually the vector space similarity or distance. Different similarity or distance measures give rise to different cluster formations. Thus, the similarity or distance measure is very critical to the outcome of clustering. The cluster hypothesis states the fundamental assumption that is made when using clustering in information retrieval: Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. The hypothesis states that if there is a document from a cluster that is relevant to a search
8 8 request, then it is likely that other documents from the same cluster are also relevant. This is because clustering puts together documents that share many words. 1.5 Meta Search Engines A meta search engine does not crawl or index a document collection. Instead, it takes a query from the user, submits it to several search engines and then organizes the outputs received from all those search engines before displaying the results to the user. Since a given search engine covers only a part of the entire expanse of the world wide web at best, it seems like a good idea to collect the responses from several search engines for a given user query. This way, a bigger part of the World Wide Web can be accessed. Also, by collating the results from several search engines a much better ranking to the web pages[176] can be provided as they are displayed in the browser for the benefit of the user.
9 9 Figure1.1 Clustered search results shown by Clusty of Vivisimo The Vivísimo Clustering Engine[177], known by the name Clusty, uses clustering as the core technology. Document clustering methods never need to touch or know about the larger collection from which search results are taken, or undergo any other pre-processing steps. Organizing the search results occurs just before a user is shown the long list of search results. The final output is a hierarchy (or tree) on the left of a split screen with the search results on the right. The interaction is based on the familiar Microsoft Explorer style of interacting with a file system.
10 Motivation for Clustering the Search Results It is found that people will not search for long on the web. There is a limited average time they will spend before giving up, or becoming very upset with the search technology available to them. On average, Webrage is uncaged (sic) after twelve minutes of fruitless searching, although about seven percent of the 566 people surveyed by Roper Starch Worldwide say ire starts rising within three minutes [179]. When users become frustrated while searching the internet, they either try another search engine, re-formulate the search, or quit. For users, seeing clustered search results has three benefits[178]: Brings into easy view those search results that otherwise would remain invisible because they are far down the list. Allows users to examine nearly double the number of relevant documents than in the case of result lists of commercial search engines Leads to effortless knowledge discovery as the user learns the types or subtopics of available information relating to the query. Provides context by placing related documents within a single folder for joint viewing All of these factors have significant impact on a user s search productivity.
11 Objectives of the Research Work Categorization of search results returned by a search engine is the main focus of this thesis. The following are the main objectives of the work reported herein. To try and find a method that automatically determines the number of natural clusters (K) present in a set of web search results with reasonable accuracy. Once clustering is done, which is to know essentially which search results (web pages) belong together as a group, each web page can be labeled with the corresponding cluster number. This means that the individual web pages are automatically annotated with a label instead of a human expert doing that job. When the popular vector space model is used to represent web pages, the resulting dimensionality is usually very high, a few thousand terms, even for a moderately large data collection. In that context, feature selection for the purpose of dimensionality reduction becomes very essential. As rough set reducts are known to uncover data dependencies very well, another main objective of this work is to investigate their applicability in their current form to the domain of web page categorization and suggest any modifications if found necessary. Classification of instances of a dataset is carried out by a classifier after it learns the model from a training dataset. The training data
12 12 generally consists of instances which are labeled by a human expert. The labels are the classes into which the instances of the dataset are divided and are fixed by the human expert. The essence is that human intervention is required in the form of preparing the training data. Clustering of large datasets is universally accepted as a difficult problem, since it tries to group instances together, without the helping hand of a human supervisor. Also, algorithms such as K-Medoids have a very high time complexity and have runtimes which are unacceptably high, even for moderately large datasets. The third objective of the thesis is to integrate both clustering and classification in the context of web page categorization. In the proposed approach, clustering would be used instead of a human expert in preparing the training data for the classifier. 1.8 Organization of the Thesis The contents of the thesis are organized and presented in the following manner. Chapter 2 presents the literature review on the existing approaches for web page categorization. In chapter 3, a review of the vector space model is presented. Vector space model is a popular document representation model and it has been used in this thesis. Further, a brief introduction is given to
13 13 some of the existing similarity measures and clustering methods that are generally used in webpage categorization. Chapter 4 presents a brief overview of the rough set theory and the QuickReduct algorithm. QuickReduct algorithm can be used for the purpose of dimensionality reduction through feature selection. A review of the existing work related to QuickReduct is also presented. Chapter 5 presents the Find-K algorithm, which is one of the main contributions of this thesis. Along with the description of the algorithm, results and discussion are also presented. A graph based mathematical model is also presented here. In chapter 6, a modification to the existing rough set based QuickReduct algorithm is proposed. Chapter 7 presents a new approach to web page categorization, called Integrated Machine Learning Approach (IMLA). This method puts together clustering and classification by integrating the work reported in chapter 5 and 6 and thereby proposes a novel approach to web page categorization. Chapter 8 presents discussion and conclusions, limitations of the current work and future scope.
INTRODUCTION. Chapter GENERAL
Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which
More informationChapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction
CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle
More informationInternational ejournals
Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT
More informationDomain Specific Search Engine for Students
Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationA Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2
A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor
More informationA B2B Search Engine. Abstract. Motivation. Challenges. Technical Report
Technical Report A B2B Search Engine Abstract In this report, we describe a business-to-business search engine that allows searching for potential customers with highly-specific queries. Currently over
More informationInformation Discovery, Extraction and Integration for the Hidden Web
Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk
More informationCHAPTER 7 CONCLUSION AND FUTURE WORK
CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion Data pre-processing is very important in data mining process. Certain data cleaning techniques usually are not applicable to all kinds of data. Deduplication
More informationText Mining. Representation of Text Documents
Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,
More informationautomatic digitization. In the context of ever increasing population worldwide and thereby
Chapter 1 Introduction In the recent time, many researchers had thrust upon developing various improvised methods of automatic digitization. In the context of ever increasing population worldwide and thereby
More informationChapter 2 BACKGROUND OF WEB MINING
Chapter 2 BACKGROUND OF WEB MINING Overview 2.1. Introduction to Data Mining Data mining is an important and fast developing area in web mining where already a lot of research has been done. Recently,
More informationModelling Structures in Data Mining Techniques
Modelling Structures in Data Mining Techniques Ananth Y N 1, Narahari.N.S 2 Associate Professor, Dept of Computer Science, School of Graduate Studies- JainUniversity- J.C.Road, Bangalore, INDIA 1 Professor
More informationTERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES
TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.
More informationAn Approach To Web Content Mining
An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research
More informationComparison of FP tree and Apriori Algorithm
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti
More informationCSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)
CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for
More informationEmpirical Analysis of Single and Multi Document Summarization using Clustering Algorithms
Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationCHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES
188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two
More informationFlat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017
Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationApplication of rough ensemble classifier to web services categorization and focused crawling
With the expected growth of the number of Web services available on the web, the need for mechanisms that enable the automatic categorization to organize this vast amount of data, becomes important. A
More informationUsing Clusters on the Vivisimo Web Search Engine
Using Clusters on the Vivisimo Web Search Engine Sherry Koshman and Amanda Spink School of Information Sciences University of Pittsburgh 135 N. Bellefield Ave., Pittsburgh, PA 15237 skoshman@sis.pitt.edu,
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationPORTAL RESOURCES INFORMATION SYSTEM: THE DESIGN AND DEVELOPMENT OF AN ONLINE DATABASE FOR TRACKING WEB RESOURCES.
PORTAL RESOURCES INFORMATION SYSTEM: THE DESIGN AND DEVELOPMENT OF AN ONLINE DATABASE FOR TRACKING WEB RESOURCES by Richard Spinks A Master s paper submitted to the faculty of the School of Information
More informationAN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2
AN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2 1 M.Tech Scholar, Dept of CSE, LBSITW, Poojappura, Thiruvananthapuram sabnans1988@gmail.com 2 Associate
More informationShrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A SURVEY ON WEB CONTENT MINING DEVEN KENE 1, DR. PRADEEP K. BUTEY 2 1 Research
More informationSSV Criterion Based Discretization for Naive Bayes Classifiers
SSV Criterion Based Discretization for Naive Bayes Classifiers Krzysztof Grąbczewski kgrabcze@phys.uni.torun.pl Department of Informatics, Nicolaus Copernicus University, ul. Grudziądzka 5, 87-100 Toruń,
More informationCitation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness
UvA-DARE (Digital Academic Repository) Exploring topic structure: Coherence, diversity and relatedness He, J. Link to publication Citation for published version (APA): He, J. (211). Exploring topic structure:
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationData Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005
Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate
More information5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search
Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page
More informationText Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering
Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More informationTop Required Skills for SEO Specialists. Worldwide Research
Top Required Skills for SEO Specialists Worldwide Research FEBRUARY 2019 SEO jobs worldwide research In 2019 SEMrush Academy analyzed around 3000 SEO vacancies on Monster and Indeed, two large job search
More informationClustering of Data with Mixed Attributes based on Unified Similarity Metric
Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1
More informationDocument Clustering For Forensic Investigation
Document Clustering For Forensic Investigation Yogesh J. Kadam 1, Yogesh R. Chavan 2, Shailesh R. Kharat 3, Pradnya R. Ahire 4 1Student, Computer Department, S.V.I.T. Nasik, Maharashtra, India 2Student,
More informationOverview of Web Mining Techniques and its Application towards Web
Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous
More informationCHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM
CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering
More informationText Classification in Electronic Discovery. Disclaimer
Text in Electronic Discovery Dave Lewis, Ph.D. David D. Lewis Consulting, LLC www.daviddlewis.com Slides for lecture via Skype on February 23, 2012 in LBSC 708/INFM 718 (Seminar on E-Discovery, Univ. of
More informationSemantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman
Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information
More informationWEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS
1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,
More informationEfficient Content Based Image Retrieval System with Metadata Processing
IJIRST International Journal for Innovative Research in Science & Technology Volume 1 Issue 10 March 2015 ISSN (online): 2349-6010 Efficient Content Based Image Retrieval System with Metadata Processing
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationProvided by TryEngineering.org -
Provided by TryEngineering.org - Lesson Focus Lesson focuses on exploring how the development of search engines has revolutionized Internet. Students work in teams to understand the technology behind search
More informationTexture Image Segmentation using FCM
Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore Texture Image Segmentation using FCM Kanchan S. Deshmukh + M.G.M
More informationHow are XML-based Marc21 and Dublin Core Records Indexed and ranked by General Search Engines in Dynamic Online Environments?
How are XML-based Marc21 and Dublin Core Records Indexed and ranked by General Search Engines in Dynamic Online Environments? A. Hossein Farajpahlou Professor, Dept. Lib. and Info. Sci., Shahid Chamran
More informationKeywords Data alignment, Data annotation, Web database, Search Result Record
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationhttp://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationWeb Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India
Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationA Comparative study of Clustering Algorithms using MapReduce in Hadoop
A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering
More informationSentiment analysis under temporal shift
Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely
More informationReview on Text Mining
Review on Text Mining Aarushi Rai #1, Aarush Gupta *2, Jabanjalin Hilda J. #3 #1 School of Computer Science and Engineering, VIT University, Tamil Nadu - India #2 School of Computer Science and Engineering,
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationHybrid Recommendation System Using Clustering and Collaborative Filtering
Hybrid Recommendation System Using Clustering and Collaborative Filtering Roshni Padate Assistant Professor roshni@frcrce.ac.in Priyanka Bane B.E. Student priyankabane56@gmail.com Jayesh Kudase B.E. Student
More informationHigh Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore
High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationText Mining: A Burgeoning technology for knowledge extraction
Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.
More informationInformation Retrieval
Introduction to Information Retrieval SCCS414: Information Storage and Retrieval Christopher Manning and Prabhakar Raghavan Lecture 10: Text Classification; Vector Space Classification (Rocchio) Relevance
More informationWebBiblio Subject Gateway System:
WebBiblio Subject Gateway System: An Open Source Solution for Internet Resources Management 1. Introduction Jack Eapen C. 1 With the advent of the Internet, the rate of information explosion increased
More informationFull Website Audit. Conducted by Mathew McCorry. Digimush.co.uk
Full Website Audit Conducted by Mathew McCorry Digimush.co.uk 1 Table of Contents Full Website Audit 1 Conducted by Mathew McCorry... 1 1. Overview... 3 2. Technical Issues... 4 2.1 URL Structure... 4
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:
IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T
More informationCHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES
70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically
More informationSTRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE
STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE Wei-ning Qian, Hai-lei Qian, Li Wei, Yan Wang and Ao-ying Zhou Computer Science Department Fudan University Shanghai 200433 E-mail: wnqian@fudan.edu.cn
More informationInformation Retrieval (Part 1)
Information Retrieval (Part 1) Fabio Aiolli http://www.math.unipd.it/~aiolli Dipartimento di Matematica Università di Padova Anno Accademico 2008/2009 1 Bibliographic References Copies of slides Selected
More informationA Content Based Image Retrieval System Based on Color Features
A Content Based Image Retrieval System Based on Features Irena Valova, University of Rousse Angel Kanchev, Department of Computer Systems and Technologies, Rousse, Bulgaria, Irena@ecs.ru.acad.bg Boris
More informationSemi supervised clustering for Text Clustering
Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering
More informationPatent Image Retrieval
Patent Image Retrieval Stefanos Vrochidis IRF Symposium 2008 Vienna, November 6, 2008 Aristotle University of Thessaloniki Overview 1. Introduction 2. Related Work in Patent Image Retrieval 3. Patent Image
More informationNext Generation LMS Evaluation
Next Generation LMS Evaluation Summary of Individual Steering Committee Member Evaluations April 20th 2017 Participation The summary data here represents nine of the anticipated twelve individual evaluations
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems
More informationINF-3981 Master s Thesis in Computer Science. Inferring Image Semantics from Collection Information
INF-3981 Master s Thesis in Computer Science Inferring Image Semantics from Collection Information by Idar Thorvaldsen 06-02-2009 Faculty of Science Department of Computer Science University of Tromsø
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationMURDOCH RESEARCH REPOSITORY
MURDOCH RESEARCH REPOSITORY http://researchrepository.murdoch.edu.au/ This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout
More informationDiscovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 PP 10-15 www.iosrjen.org Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm P.Arun, M.Phil, Dr.A.Senthilkumar
More informationIncluding the Size of Regions in Image Segmentation by Region Based Graph
International Journal of Emerging Engineering Research and Technology Volume 3, Issue 4, April 2015, PP 81-85 ISSN 2349-4395 (Print) & ISSN 2349-4409 (Online) Including the Size of Regions in Image Segmentation
More informationAutomated Online News Classification with Personalization
Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798
More informationEnhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering Recommendation Algorithms
International Journal of Mathematics and Statistics Invention (IJMSI) E-ISSN: 2321 4767 P-ISSN: 2321-4759 Volume 4 Issue 10 December. 2016 PP-09-13 Enhanced Web Usage Mining Using Fuzzy Clustering and
More informationClustering. Supervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationWorld Wide Web has specific challenges and opportunities
6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has
More informationRanking Clustered Data with Pairwise Comparisons
Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances
More informationWordPress SEO. Basic SEO Practices Using WordPress. Leo Wadsworth LeoWadsworth.com
Basic SEO Practices Using WordPress Leo Wadsworth LeoWadsworth.com Copyright 2012, by Leo Wadsworth, all rights reserved. Unless you have specifically purchased additional rights, this work is for personal
More informationRanking models in Information Retrieval: A Survey
Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor
More informationElection Analysis and Prediction Using Big Data Analytics
Election Analysis and Prediction Using Big Data Analytics Omkar Sawant, Chintaman Taral, Roopak Garbhe Students, Department Of Information Technology Vidyalankar Institute of Technology, Mumbai, India
More informationIteration Reduction K Means Clustering Algorithm
Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationMachine Learning in Digital Security
Machine Learning in Digital Security White Paper www.seqrite.com Table of Contents 1. Introduction 2. Introduction to Machine Learning 3. Machine Learning usage in Security Industry 4. Clustering Samples
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationSimilarity search in multimedia databases
Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:
More informationChapter 10. Conclusion Discussion
Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with
More informationCS47300 Web Information Search and Management
CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page
More information70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing
70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY 2004 ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing Jianping Fan, Ahmed K. Elmagarmid, Senior Member, IEEE, Xingquan
More informationEfficient Indexing and Searching Framework for Unstructured Data
Efficient Indexing and Searching Framework for Unstructured Data Kyar Nyo Aye, Ni Lar Thein University of Computer Studies, Yangon kyarnyoaye@gmail.com, nilarthein@gmail.com ABSTRACT The proliferation
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationKeywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More information