Project Report on winter
|
|
- Bryan Owen
- 5 years ago
- Views:
Transcription
1 Project Report on winter Yaxin Li, Xiaofeng Liu October 17, 2017 Li, Liu October 17, / 31
2 Outline Introduction a Basic Search Engine with Improvements Features PageRank Classification Clustering Word2vec & Doc2vec Li, Liu October 17, / 31
3 Introduction Project Summary Developed a basic search engine mainly oriented to academic papers search Basic functions: information arrangement, query processing Improvements: fast search, search for different attributes, more datasets etc. Added PageRank, Naive Bayes classifier, LSI, K-Means, HAC, word2vec, doc2vec in our project Software Configuration Eclipse Neo Release(4.6.0) Java packages Lucene Dom4j La4j Tomcat 8.0 RStudio Latex Li, Liu October 17, / 31
4 Basic Search Engine Data Source How to build the index How to search the index Search Engine: backend: Tomcat + Servlet frontend: JSP + HTML + CSS Li, Liu October 17, / 31
5 Basic Search Engine - Data Sources CiteSeer full text & meta data.txt files ranged from xml files - meta data SIGMOD ICSE VLDB citaion graph & meta data papers, edges Li, Liu October 17, / 31
6 Basic Search Engine - Organization Li, Liu October 17, / 31
7 Basic Search Engine - Build Index Data Scource docid, title, authors, year, conference, fulltext, citation, simdocs Create a writer Directory dir = FSDirectory.open(Paths.get(indexPath)); StandardAnalyzer analyzer = new StandardAnalyzer(); analyzer.setversion(version.lucene 6 3 0); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); iwc.setopenmode(openmode.create OR APPEND); ndexwriter writer = new IndexWriter(dir, iwc); Li, Liu October 17, / 31
8 Basic Search Engine - Build Index Cont. Create a new document Document doc = new Document(); doc.add(new Field("title",PTitle,TextField.TYPE STORED)); doc.add(new Field("pages",pages==null?"":pages,TextField.TYPE STORED));... add a document to the index writer.adddocument(doc); write the documents into the index writer.forcemerge(1); writer.close(); Li, Liu October 17, / 31
9 Basic Search Engine - Search Index Create a searcher Directory dir = FSDirectory.open(Paths.get(indexPath)); DirectoryReader ireader = DirectoryReader.open(dir); IndexSearcher isearcher = new IndexSearcher(ireader); Create a query QueryParser parser = new QueryParser("fulltext", analyzer); String q = "entropy"; Query query = parser.parse(q); Get the Results ScoreDoc[] hits = isearcher.search(query, ).scoredocs; for(int i = 0; i < hits.length; i ++ ){ ireader.document(hits[i].doc).get("id"); ireader.document(hits[i].doc).get("title")); } Li, Liu October 17, / 31
10 Basic Search Engine - Improvements Keywords Highlights (Context Around Keywords) Searching for Different Attributes Searching for Phrases Speed Up Searching Li, Liu October 17, / 31
11 Basic Search Engine - Improvements Cont. Highlight Keywords ScoreDoc[] hits = searcher.search(query, ).scoreDocs; SimpleHTMLFormatter htmlformatter = new SimpleHTMLFormatter("<span><b>","</b></span>"); SimpleFragmenter fragmenter = new SimpleFragmenter(); fragmenter.setfragmentsize(100); Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query)); highlighter.settextfragmenter(fragmenter); Speed up searching Ranking & Return Limit (Return top 50 results at most for each query. ) Pagination(Get only 10 results for every searching) Optimize the Index Li, Liu October 17, / 31
12 Features PageRank Naïve Bayes Classification Latent Semantic Indexing Clustering K-Means HAC Li, Liu October 17, / 31
13 Features - PageRank Data Source ICSE VLDB SIGMOD citation graph PageRank Calculation Spider trap and dead end Sort papers using PageRank values or Combine PageRank with lucene similarity score, and sorting papers by the combined scores Li, Liu October 17, / 31
14 Sort the Papers with PageRank Changes in Index Building doc.add(new DoubleDocValuesField("pagerank",pagerank)); Changes in Searching //create a sort criterion SortedNumericSortField sf = new SortedNumericSortField("pagerank",SortField.Type.DOUBLE,true); Sort sort = new Sort(sf); //search the index according to the sorting criterion ScoreDoc[] hits = isearcher.search(query, 10000,sort).scoreDocs; Li, Liu October 17, / 31
15 Sort the Papers with Combined Value When searching an index, Lucene gives a similarity score for each returning document according to the text relevance we can also rank the documents by combining the text relevance and link relevance Field titlef = new Field("title",title,TextField.TYPE STORED); titlef.setboost((float) pagerank); doc.add(titlef); Li, Liu October 17, / 31
16 Features - PageRank Combine PageRank with Search Engine Li, Liu October 17, / 31
17 Features - Naïve Bayes Classification Data Sources : ICSE and VLDB Definition of Terms : Bigrams & Unigrams Feature Selection Mutual Information Feature Size: 10, 100, 1000, 10,000, 100,000 χ 2 Feature Selection set p value to 0.01, 0.05, 0.1 and 0.5 Figure: Bi-chi2 Figure: Uni-chi2 Figure: Bi-MI Figure: Uni-MI Li, Liu October 17, / 31
18 Evaluation of Classification 10-fold Cross Validation Results of Each Experiment Table: χ 2 p value <0.01 <0.05 <0.1 <0.5 uni-chi2-norm bi-chi2-norm uni-chi2-norm-rmsw bi-chi2-norm-rmsw Table: Mutual Information FeatureNum , ,000 uni-chi2-norm bi-chi2-norm uni-chi2-norm-rmsw bi-chi2-norm-rmsw Li, Liu October 17, / 31
19 Evaluation of Classification Cont Evaluations of Classification CHI2 Evaluations of Classification Mutual Information F1 Value type bigram chi2 norm bigram chi2 norm+rmsw unigram chi2 norm F1 Value type bigram mi norm bigram mi norm+rms unigram mi norm unigram chi2 norm+rmsw unigram mi norm+rm <0.01 <0.05 <0.1 <0.5 P value Number of Features Li, Liu October 17, / 31
20 Apply Classification on Website Li, Liu October 17, / 31
21 Features - LSI Data Source ICSE + VLDB 1653(words) 997(documents) tf-idf matrix SVD Clustering Li, Liu October 17, / 31
22 LSI - SVD SVD calculation with la4j.jar U: SVD term matrix D: Singular matrix V: SVD doc matrix D k V for clustering the documents k is the first k biggest singular values Li, Liu October 17, / 31
23 D k V, when k = 100 Li, Liu October 17, / 31
24 Features - Clustering K-Means distance measurement: Euclidean distance Normalized Euclidean distance, which is the same as cosine similarity > 10,000 times of calculation Hierarchical Clustering distance measurement: cosine similarity method: single link, complete link, centroid link Table: summary Data Source Method Distance Purity HAC-single Cosine similarity HAC-centroid Cosine similarity Vevtors from SVD HAC-complete Cosine similarity K-Means Euclidean K-Means Normalized Euclidean Vectors from doc2vec K-Means Normalized Euclidean Li, Liu October 17, / 31
25 Hierarchical Clustering data source: the vector from SVD distance: cosine similarity single link - purity: Li, Liu October 17, / 31
26 Hierarchical Clustering data source: the vector from SVD distance: cosine similarity centroid - purity: Li, Liu October 17, / 31
27 Hierarchical Clustering data source: the vector from SVD distance: cosine similarity complete - purity: Li, Liu October 17, / 31
28 K-Means Clustering data source: the vector from SVD distance: euclidean similarity purity: Li, Liu October 17, / 31
29 K-Means Clustering data source: the vector from SVD distance: normalized euclidean similarity, which is equal to cosine similarity purity: Li, Liu October 17, / 31
30 K-Means Clustering data source: the vector from doc2vec distance: normalized euclidean similarity purity: Li, Liu October 17, / 31
31 Features - Word2vec & Doc2vec Word2vec similar words have similar vectors recommand similar queries for users train the data with C code on GitHub, and use the vectors by java Doc2vec similar documents have similar vectors paper recommandation implemented with python::gensim Figure: recommand queries Figure: recommand docs Li, Liu October 17, / 31
32 Demo Li, Liu October 17, / 31
EPL660: Information Retrieval and Search Engines Lab 2
EPL660: Information Retrieval and Search Engines Lab 2 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Apache Lucene Extremely rich and powerful full-text search
More informationInformation Retrieval
Introduction to Information Retrieval Lucene Tutorial Chris Manning and Pandu Nayak Open source IR systems Widely used academic systems Terrier (Java, U. Glasgow) http://terrier.org Indri/Galago/Lemur
More informationCOMP Implemen0ng Search using Lucene
COMP 4601 Implemen0ng Search using Lucene 1 Luke: Lucene index analyzer WARNING: I HAVE NOT USED THIS 2 Scenario Crawler Crawl Directory containing tokenized content Lucene Lucene index directory 3 Classes
More informationIntroduc)on to Lucene. Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata
Introduc)on to Lucene Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata Open source search engines Academic Terrier (Java, University of Glasgow) Indri, Lemur (C++,
More informationApplied Databases. Sebastian Maneth. Lecture 11 TFIDF Scoring, Lucene. University of Edinburgh - February 26th, 2017
Applied Databases Lecture 11 TFIDF Scoring, Lucene Sebastian Maneth University of Edinburgh - February 26th, 2017 2 Outline 1. Vector Space Ranking & TFIDF 2. Lucene Next Lecture Assignment 1 marking will
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationLucene. Jianguo Lu. School of Computer Science. University of Windsor
Lucene Jianguo Lu School of Computer Science University of Windsor 1 A Comparison of Open Source Search Engines for 1.69M Pages 2 lucene Developed by Doug CuHng iniially Java-based. Created in 1999, Donated
More informationBirkbeck (University of London)
Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval Lucene Tutorial Chris Manning and Pandu Nayak Open source IR systems Widely used academic systems Terrier (Java, U. Glasgow) hhp://terrier.org Indri/Galago/Lemur (C++
More informationWeb Data Management. Text indexing with LUCENE (Nicolas Travers) Philippe Rigaux CNAM Paris & INRIA Saclay
http://webdam.inria.fr Web Data Management Text indexing with LUCENE (Nicolas Travers) Serge Abiteboul INRIA Saclay & ENS Cachan Ioana Manolescu INRIA Saclay & Paris-Sud University Philippe Rigaux CNAM
More informationInformation Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
More informationEagle Eye. Sommersemester 2017 Big Data Science Praktikum. Zhenyu Chen - Wentao Hua - Guoliang Xue - Bernhard Fabry - Daly
Eagle Eye Sommersemester 2017 Big Data Science Praktikum Zhenyu Chen - Wentao Hua - Guoliang Xue - Bernhard Fabry - Daly 1 Sommersemester Agenda 2009 Brief Introduction Pre-processiong of dataset Front-end
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationRepresentation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s
Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence
More informationBuilding Search Applications
Building Search Applications Lucene, LingPipe, and Gate Manu Konchady Mustru Publishing, Oakton, Virginia. Contents Preface ix 1 Information Overload 1 1.1 Information Sources 3 1.2 Information Management
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar
More informationFeature selection. LING 572 Fei Xia
Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationBing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
More informationEntity and Knowledge Base-oriented Information Retrieval
Entity and Knowledge Base-oriented Information Retrieval Presenter: Liuqing Li liuqing@vt.edu Digital Library Research Laboratory Virginia Polytechnic Institute and State University Blacksburg, VA 24061
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More informationInformation Retrieval. hussein suleman uct cs
Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information
More informationSearch Engines Exercise 5: Querying. Dustin Lange & Saeedeh Momtazi 9 June 2011
Search Engines Exercise 5: Querying Dustin Lange & Saeedeh Momtazi 9 June 2011 Task 1: Indexing with Lucene We want to build a small search engine for movies Index and query the titles of the 100 best
More informationVALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER
VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018
More informationThe Research of A multi-language supporting description-oriented Clustering Algorithm on Meta-Search Engine Result Wuling Ren 1, a and Lijuan Liu 2,b
Applied Mechanics and Materials Online: 2012-01-24 ISSN: 1662-7482, Vol. 151, pp 549-553 doi:10.4028/www.scientific.net/amm.151.549 2012 Trans Tech Publications, Switzerland The Research of A multi-language
More informationSearching and Analyzing Qualitative Data on Personal Computer
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 10, Issue 2 (Mar. - Apr. 2013), PP 41-45 Searching and Analyzing Qualitative Data on Personal Computer Mohit
More informationLAB 7: Search engine: Apache Nutch + Solr + Lucene
LAB 7: Search engine: Apache Nutch + Solr + Lucene Apache Nutch Apache Lucene Apache Solr Crawler + indexer (mainly crawler) indexer + searcher indexer + searcher Lucene vs. Solr? Lucene = library, more
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationCrawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server
Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document
More informationWord Embeddings in Search Engines, Quality Evaluation. Eneko Pinzolas
Word Embeddings in Search Engines, Quality Evaluation Eneko Pinzolas Neural Networks are widely used with high rate of success. But can we reproduce those results in IR? Motivation State of the art for
More informationSupervised classification of law area in the legal domain
AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationInformation Retrieval
Information Retrieval Assignment 3: Boolean Information Retrieval with Lucene Patrick Schäfer (patrick.schaefer@hu-berlin.de) Marc Bux (buxmarcn@informatik.hu-berlin.de) Lucene Open source, Java-based
More informationCS371R: Final Exam Dec. 18, 2017
CS371R: Final Exam Dec. 18, 2017 NAME: This exam has 11 problems and 16 pages. Before beginning, be sure your exam is complete. In order to maximize your chance of getting partial credit, show all of your
More informationLucene Java 2.9: Numeric Search, Per-Segment Search, Near-Real-Time Search, and the new TokenStream API
Lucene Java 2.9: Numeric Search, Per-Segment Search, Near-Real-Time Search, and the new TokenStream API Uwe Schindler Lucene Java Committer uschindler@apache.org PANGAEA - Publishing Network for Geoscientific
More informationDepartment of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _
COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.
More informationBehavioral Data Mining. Lecture 18 Clustering
Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i
More informationStudying the Impact of Text Summarization on Contextual Advertising
Studying the Impact of Text Summarization on Contextual Advertising G. Armano, A. Giuliani, and E. Vargiu Intelligent Agents and Soft-Computing Group Dept. of Electrical and Electronic Engineering University
More informationWeb-based File Upload and Download System
COMP4905 Honor Project Web-based File Upload and Download System Author: Yongmei Liu Student number: 100292721 Supervisor: Dr. Tony White 1 Abstract This project gives solutions of how to upload documents
More informationSEARCHING AND INDEXING BIG DATA. -By Jagadish Rouniyar
SEARCHING AND INDEXING BIG DATA -By Jagadish Rouniyar WHAT IS IT? Doug Cutting s grandmother s middle name A open source set of Java Classses Search Engine/Document Classifier/Indexer http://lucene.sourceforge.net/talks/pisa/
More informationCollective Intelligence in Action
Collective Intelligence in Action SATNAM ALAG II MANNING Greenwich (74 w. long.) contents foreword xv preface xvii acknowledgments xix about this book xxi PART 1 GATHERING DATA FOR INTELLIGENCE 1 "1 Understanding
More informationChrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO
Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO INDEX Proposal Recap Implementation Evaluation Future Works Proposal Recap Keyword Visualizer (chrome
More informationAn Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia
An Overview of Search Engine Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia haixu@microsoft.com July 24, 2007 1 Outline History of Search Engine Difference Between Software and
More informationInformation Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer
More informationEffective Latent Space Graph-based Re-ranking Model with Global Consistency
Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case
More informationApache Lucene - Scoring
Grant Ingersoll Table of contents 1 Introduction...2 2 Scoring... 2 2.1 Fields and Documents... 2 2.2 Score Boosting...3 2.3 Understanding the Scoring Formula...3 2.4 The Big Picture...3 2.5 Query Classes...
More informationModels for Document & Query Representation. Ziawasch Abedjan
Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview
More informationSearch Evolution von Lucene zu Solr und ElasticSearch. Florian
Search Evolution von Lucene zu Solr und ElasticSearch Florian Hopf @fhopf http://www.florian-hopf.de Index Indizieren Index Suchen Index Term Document Id Analyzing http://www.flickr.com/photos/quinnanya/5196951914/
More informationDevelopment of Search Engines using Lucene: An Experience
Available online at www.sciencedirect.com Procedia Social and Behavioral Sciences 18 (2011) 282 286 Kongres Pengajaran dan Pembelajaran UKM, 2010 Development of Search Engines using Lucene: An Experience
More informationvector space retrieval many slides courtesy James Amherst
vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the
More informationClustering. Bruno Martins. 1 st Semester 2012/2013
Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Hinrich Schütze Center for Information and Language Processing, University of Munich 04-06- /86 Overview Recap
More informationAutomatic Summarization
Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization
More informationLUCENE - BOOLEANQUERY
LUCENE - BOOLEANQUERY http://www.tutorialspoint.com/lucene/lucene_booleanquery.htm Copyright tutorialspoint.com Introduction BooleanQuery is used to search documents which are result of multiple queries
More informationEveryday Activity. Course Content. Objectives of Lecture 13 Search Engine
Web Technologies and Applications Winter 2001 CMPUT 499: Search Engines Dr. Osmar R. Zaïane University of Alberta Everyday Activity We use search engines whenever we look for resources on the Internet
More informationLucidWorks: Searching with curl October 1, 2012
LucidWorks: Searching with curl October 1, 2012 1. Module name: LucidWorks: Searching with curl 2. Scope: Utilizing curl and the Query admin to search documents 3. Learning objectives Students will be
More informationTermin 6: Web Suche. Übung Netzbasierte Informationssysteme. Arbeitsgruppe. Prof. Dr. Adrian Paschke
Arbeitsgruppe Übung Netzbasierte Informationssysteme Termin 6: Web Suche Prof. Dr. Adrian Paschke Arbeitsgruppe Corporate Semantic Web (AG-CSW) Institut für Informatik, Freie Universität Berlin paschke@inf.fu-berlin.de
More informationInformation Retrieval
Introduction to Information Retrieval ΠΛΕ70: Ανάκτηση Πληροφορίας Διδάσκουσα: Ευαγγελία Πιτουρά Διάλεξη 11: Εισαγωγή στο Lucene. 1 Τι είναι; Open source Java library for IR (indexing and searching) Lets
More informationOutline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity
Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using
More informationCluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University
Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationHierarchical Link Analysis for Ranking Web Data
Hierarchical Link Analysis for Ranking Web Data Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello, and Stefan Decker Digital Enterprise Research Institute, Galway June 1, 2010 Introduction
More informationComputer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am
Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.
More informationCS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted
More informationThe Topic Specific Search Engine
The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)
More informationVECTOR SPACE CLASSIFICATION
VECTOR SPACE CLASSIFICATION Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. Chapter 14 Wei Wei wwei@idi.ntnu.no Lecture
More informationSeznam.cz Fulltext Architecture
vladimir.kadlec@firma.seznam.cz April 4, 2018 Seznam.cz, history of a web search Directory 1996, pages organized in a link directory Fulltext Kompas 2003 2005, outsourcing (Empyreum, Google, Jyxo) in-house,
More informationMachine Learning Part 1
Data Science Weekend Machine Learning Part 1 KMK Online Analytic Team Fajri Koto Data Scientist fajri.koto@kmklabs.com Machine Learning Part 1 Outline 1. Machine Learning at glance 2. Vector Representation
More informationAn Introduction to Search Engines and Web Navigation
An Introduction to Search Engines and Web Navigation MARK LEVENE ADDISON-WESLEY Ал imprint of Pearson Education Harlow, England London New York Boston San Francisco Toronto Sydney Tokyo Singapore Hong
More informationExtractive Text Summarization Techniques
Extractive Text Summarization Techniques Tobias Elßner Hauptseminar NLP Tools 06.02.2018 Tobias Elßner Extractive Text Summarization Overview Rough classification (Gupta and Lehal (2010)): Supervised vs.
More informationProject Report. Project Title: Evaluation of Standard Information retrieval system related to specific queries
Project Report Project Title: Evaluation of Standard Information retrieval system related to specific queries Submitted by: Sindhu Hosamane Thippeswamy Information and Media Technologies Matriculation
More informationDesign and Implementation of Search Engine Using Vector Space Model for Personalized Search
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,
More informationVK Multimedia Information Systems
VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Results Exercise 01 Exercise 02 Retrieval
More informationA Modular Approach to Document Indexing and Semantic Search
Wright State University CORE Scholar Kno.e.sis Publications The Ohio Center of Excellence in Knowledge- Enabled Computing (Kno.e.sis) 7-2005 A Modular Approach to Document Indexing and Semantic Search
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationThe Lucene Search Engine
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens What is Lucene? Doug Cutting s grandmother s middle name A open source set of Java Classses Search Engine/Document
More informationLUCENE - FIRST APPLICATION
LUCENE - FIRST APPLICATION http://www.tutorialspoint.com/lucene/lucene_first_application.htm Copyright tutorialspoint.com Let us start actual programming with Lucene Framework. Before you start writing
More informationClustering (COSC 416) Nazli Goharian. Document Clustering.
Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,
More informationAutomatic Labeling of Issues on Github A Machine learning Approach
Automatic Labeling of Issues on Github A Machine learning Approach Arun Kalyanasundaram December 15, 2014 ABSTRACT Companies spend hundreds of billions in software maintenance every year. Managing and
More informationA short introduction to the development and evaluation of Indexing systems
A short introduction to the development and evaluation of Indexing systems Danilo Croce croce@info.uniroma2.it Master of Big Data in Business SMARS LAB 3 June 2016 Outline An introduction to Lucene Main
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationChapter 9. Classification and Clustering
Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization
More informationInformation Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes
CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John
More informationLUCENE - TERMRANGEQUERY
LUCENE - TERMRANGEQUERY http://www.tutorialspoint.com/lucene/lucene_termrangequery.htm Copyright tutorialspoint.com Introduction TermRangeQuery is the used when a range of textual terms are to be searched.
More informationIndexing in Search Engines based on Pipelining Architecture using Single Link HAC
Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationAutomated Identification of Computer Science Research Papers
University of Windsor Scholarship at UWindsor Electronic Theses and Dissertations 2016 Automated Identification of Computer Science Research Papers Tong Zhou University of Windsor Follow this and additional
More informationCS290H Graph Laplacians and Spectra. Final Project Report. Categorization of biomedical articles with spectral clustering. By Arvind C.
CS290H Graph Laplacians and Spectra Final Project Report Categorization of biomedical articles with spectral clustering By Arvind C. Rajasekaran Abstract Clustering is the process of grouping together
More informationCSE 494: Information Retrieval, Mining and Integration on the Internet
CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:
More informationBibliometrics: Citation Analysis
Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references), explicit citations to other previously published documents. Now, if you consider citations as links, academic
More information