n-gram A Large-scale N-gram Search System Using Inverted Files Susumu Yata, 1, 2 Kazuhiro Morita, 1 Masao Fuketa 1 and Jun-ichi Aoe 1 TSUBAKI 2)

Size: px
Start display at page:

Download "n-gram A Large-scale N-gram Search System Using Inverted Files Susumu Yata, 1, 2 Kazuhiro Morita, 1 Masao Fuketa 1 and Jun-ichi Aoe 1 TSUBAKI 2)"

Transcription

1 n-gram 1, n-gram n-gram A Large-scale N-gram Search System Using Inverted Files Susumu Yata, 1, 2 Kazuhiro Morita, 1 Masao Fuketa 1 and Jun-ichi Aoe 1 Recently, a search system has been proposed for large-scale n-gram data. The system uses tries for indexing n-gram data because of its retrieval performance. However, the disk usage and the construction cost of the trie index are very large. In contrast, inverted files could be easily constructed but require random access to secondary storage in retrieval. Therefore, this paper reports the actual performance of inverted files for large-scale n-gram data. 1. 1) 1 University of Tokushima 2 JSPS Research Fellow TSUBAKI 2) Wikipedia Google n-gram 3),4) n-gram 5) 7) 8) 9) 10) Document Frequency DF 11) n-gram n-gram 100 GB n-gram 1-gram 6) MySQL 9) n-gram Minimal Prefix MP 12) 13),14) 1-gram n-gram k-gram k C k/ gram 4) GB 15) n-gram n-gram 2 3 n-gram 4 1 c 2009 Information Processing Society of Japan

2 1 D Table 1 Word segmentation of documents in D ID Document Segmented words (separated by / ) 1 <S> / / / / / / </S> 2 <S> / / / / / / </S> 3 <S> / / / / / / </S> ID DocID 1 </S> 1, 2, 3 2 <S> 1, 2, 3 3 1, 2, 3 2 Table 2 An inverted index using segmented words ID DocID 4 1, 2, 3 5 1, 2, 3 6 1, 2, ID DocID n-gram D = {d 1, d 2, d 3 } ID ID DocID 2 ID ID ID DocID 2 DocID 1, 2, 3 DocID 1 d 1 DocID Variable Byte Code DocID 15) 2.2 n-gram n-gram 4) 1-gram 3 D 2-gram Table 3 2-gram data extracted from documents in D ID ID Frequency 1 2, 6 <S> 3 2 3, 1 </S> 3 3 4, , , 7 1 ID ID Frequency 6 5, , , , , 4 1 n-gram D 2-gram 3 1-gram n-gram DocID n-gram n-gram n-gram n-gram 3 Solid State Disk SSD 4 3. n-gram 3.1 n-gram n-gram 2 c 2009 Information Processing Society of Japan

3 n-gram 40% ( 1 ) 1-gram ID ( 2 ) n-gram ID ( 3 ) n-gram n-gram 1 n-gram ( i ) Dictionary lookup : ID ( ii ) Index lookup : n-gram ( iii ) N-gram lookup : n-gram ( iv ) Decode : n-gram Index lookup n-gram n-gram 1 N-gram database : n-gram n-gram Inverted index : N-gram database n-gram ID ID dictionary : ID ID 16) ID 3.2 n-gram n-gram n-gram n-gram Query Inverted index Fig. 1 Dictionary lookup ID Index lookup dictionary N-gram addresses Encoded n-grams N-gram lookup Decode 1 n-gram An n-gram search system using an inverted index N-grams N-gram database Search result n-gram DocID n-gram 1-gram n-gram n-gram n-gram 3 n-gram n-gram 2 1 Index lookup N-gram lookup Database lookup Database lookup : n-gram 2 Inverted database 1 Inverted Index N-gram database Inverted database : n-gram ID ID 3 c 2009 Information Processing Society of Japan

4 4 n-gram kb Query Dictionary lookup ID dictionary Database lookup Decode Encoded n-grams N-grams Search result Table 4 Disk usage of large-scale n-gram search systems (kb) Dic Inverted Idx N-gram DB Inverted DB Total (a) Uncompressed 76,818 29,757, ,221, ,055,762 (b) Compressed 148,126 26,969,347 29,033, ,151,370 (c) Extended 148, ,720, ,720,463 Fig. 2 Inverted database 2 n-gram An n-gram search system using an inverted database 5 Table 5 Average, median and maximum search time (s) 1-gram 3-gram 5-gram Ave Med Max Ave Med Max Ave Med Max (a) hdd (b) hdd (b ) ssd (c) hdd n-gram n-gram n-gram n-gram 64-bit Linux Google n-gram 4) 1-gram 7-gram 32 n-gram 3 4 ( a ) Uncompressed : n-gram ( b ) Compressed : n-gram ( c ) Extended : n-gram 1-gram, 3-gram, 5-gram 1,000 Core 2 Quad 2.83 GHz, 8 GB RAM, 1.5 TB HDDx2 (RAID 0), 256 GB SSD 24 HDD Compressed SSD SSD b SSD b n-gram SSD a b n-gram 40 % % b b HDD SSD % SSD b c b 3 4 c 2009 Information Processing Society of Japan

5 10 3 Fig. 3 Inverted indices 10-4 (a) Uncompressed (b ) Compressed Inverted database 10-4 (c) Extended gram n-gram Dependencies between the number of results and the search time for unigram queries Inverted indices (a) Uncompressed (b ) Compressed 10 3 Fig. 4 Inverted database (c) Extended gram n-gram Dependencies between the number of results and the search time for 5-gram queries c b 1/10 b n-gram c HDD SSD b 1-gram n-gram 3 a b n-gram c n-gram 5-gram 4 3 a b n-gram 13) 6 Table 6 6 Differences between an inverted database and a trie index 13) 1-gram 7-gram 32 7-gram GB 490 GB 1 GB 24 4 GB gram AND gram AND SSD n-gram 5. n-gram 5 c 2009 Information Processing Society of Japan

6 n-gram 13) 1) Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, London, England, 6 edition, ) Keiji Shinzato, Tomohide Shibata, Daisuke Kawahara, Chikara Hashimoto, and Sadao Kurohashi. TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access Methodology. In IJCNLP2008, pp , Hyderabad, India, January ) Thorsten Brants and Alex Franz. Web 1T 5-gram Version 1. upenn.edu/catalog/catalogentry.jsp?catalogid=ldc2006t13, ),. Web N 1. GSK2007-C/catalog.html, ) Andrew Finch, Etienne Denoual, Hideo Okuma, Michael Paul, Hirofumi Yamamoto, Keiji Yasuda, Ruiqiang Zhang, and Eiichiro Sumita. The NICT/ATR Speech Translation System for IWSLT In IWSLT 2007, pp , Trento, Italy, October ) Marcello Federico and Mauro Cettolo. Efficient Handling of N-gram Language Models for Statistical Machine Translation. In the 2nd Workshop on Statistical Machine Translation, pp , Prague, Czech Republic, June ) Jing Zheng, Wen Wang, and Necip Fazil Ayan. Development of SRI s Translation Systems for Broadcast News and Broadcast Conversations. In Interspeech 2008, Brisbane, Australia, September ) Ghinwa F. Choueiter, Stephanie Seneff, and James R. Glass. New Word Acquisition Using Subword Modeling. In Interspeech 2007, pp , Antwerp, Belgium, August ) Andrew Carlson and Ian Fette. Memory-Based Context-Sensitive Spelling Correction at Web Scale. In ICMLA 2007, pp , Cincinnati, Ohio, USA, December ) Steven Bethard and James H. Martin. Learning Semantic Links from a Corpus of Parallel Temporal and Causal Relations. In ACL-08: HLT, Short Papers (Companion Volume), pp , Columbus, Ohio, USA, June ) Martin Klein and Michael L. Nelson. Correlation of Count and Document Frequency for Google N-Grams. In ECIR 2009, pp , Toulouse, France, April ) John A. Dundas III. Implementing Dynamic Minimal Prefix Tries. Software Practice and Experience, Vol.21, No.10, pp , October ). N Google 7. 14, pp , March ) Satoshi Sekine. A Linguistic Knowledge Discovery Tool: Very Large Ngram Database Search with Arbitrary Wildcards. In Coling 2008: Companion volume: Demonstrations, pp , Manchester, UK, August ) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, ).., Vol. J71 D, No.9, pp , c 2009 Information Processing Society of Japan

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

A Retrieval Method for Double Array Structures by Using Byte N-Gram

A Retrieval Method for Double Array Structures by Using Byte N-Gram A Retrieval Method for Double Array Structures by Using Byte N-Gram Masao Fuketa, Kazuhiro Morita, and Jun-Ichi Aoe Abstract Retrieving keywords requires speed and compactness. A trie is one of the data

More information

Summarizing Search Results using PLSI

Summarizing Search Results using PLSI Summarizing Search Results using PLSI Jun Harashima and Sadao Kurohashi Graduate School of Informatics Kyoto University Yoshida-honmachi, Sakyo-ku, Kyoto, 606-8501, Japan {harashima,kuro}@nlp.kuee.kyoto-u.ac.jp

More information

Comparisons of Efficient Implementations for DAWG

Comparisons of Efficient Implementations for DAWG Comparisons of Efficient Implementations for DAWG Masao Fuketa, Kazuhiro Morita, and Jun-ichi Aoe Abstract Key retrieval is very important in various applications. A trie and DAWG are data structures for

More information

Experimental Observations of Construction Methods for Double Array Structures Using Linear Functions

Experimental Observations of Construction Methods for Double Array Structures Using Linear Functions Experimental Observations of Construction Methods for Double Array Structures Using Linear Functions Shunsuke Kanda*, Kazuhiro Morita, Masao Fuketa, Jun-Ichi Aoe Department of Information Science and Intelligent

More information

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation Volume 3, No.5, May 24 International Journal of Advances in Computer Science and Technology Pooja Bassin et al., International Journal of Advances in Computer Science and Technology, 3(5), May 24, 33-336

More information

TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access Methodology

TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access Methodology TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access Methodology Keiji Shinzato, Tomohide Shibata, Daisuke Kawahara, Chikara Hashimoto and Sadao Kurohashi Graduate School

More information

Efficient Data Structures for Massive N-Gram Datasets

Efficient Data Structures for Massive N-Gram Datasets Efficient ata Structures for Massive N-Gram atasets Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa, Italy giulio.pibiri@di.unipi.it Rossano Venturini University of Pisa and ISTI-CNR Pisa, Italy

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Pushing the Limits. ADSM Symposium Sheelagh Treweek September 1999 Oxford University Computing Services 1

Pushing the Limits. ADSM Symposium Sheelagh Treweek September 1999 Oxford University Computing Services 1 Pushing the Limits ADSM Symposium Sheelagh Treweek sheelagh.treweek@oucs.ox.ac.uk September 1999 Oxford University Computing Services 1 Overview History of ADSM services at Oxford October 1995 - started

More information

AKIKO MANADA. The University of Electro-Communications 1-5-1, Chofugaoka, Chofu, Tokyo, , JAPAN

AKIKO MANADA. The University of Electro-Communications 1-5-1, Chofugaoka, Chofu, Tokyo, , JAPAN Curriculum Vitæ AKIKO MANADA The University of Electro-Communications 1-5-1, Chofugaoka, Chofu, Tokyo, 182-8585, JAPAN Email: amanada@uec.ac.jp WORK EXPERIENCE Assistant Professor: February 2012 Present

More information

Temporal Information Access (Temporalia-2)

Temporal Information Access (Temporalia-2) Temporal Information Access (Temporalia-2) NTCIR-12 Core Task http://ntcirtemporalia.github.io Hideo Joho (University of Tsukuba) Adam Jatowt (Kyoto University) Roi Blanco (Yahoo! Research) Haitao Yu (University

More information

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007 Information Retrieval Lecture 3 - Index compression Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Dictionary and inverted index:

More information

Faster and Smaller N-Gram Language Models

Faster and Smaller N-Gram Language Models Faster and Smaller N-Gram Language Models Adam Pauls Dan Klein Computer Science Division University of California, Berkeley {adpauls,klein}@csberkeleyedu Abstract N-gram language models are a major resource

More information

CS 102. Big Data. Spring Big Data Platforms

CS 102. Big Data. Spring Big Data Platforms CS 102 Big Data Spring 2016 Big Data Platforms How Big is Big? The Data Data Sets 1000 2 (5.3 MB) Complete works of Shakespeare (text) 1000 3 (~5-500 GB) Your data 1000 4 (10 TB) Library of Congress (text)

More information

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report

More information

HITSZ-ICRC: An Integration Approach for QA TempEval Challenge

HITSZ-ICRC: An Integration Approach for QA TempEval Challenge HITSZ-ICRC: An Integration Approach for QA TempEval Challenge Yongshuai Hou, Cong Tan, Qingcai Chen and Xiaolong Wang Department of Computer Science and Technology Harbin Institute of Technology Shenzhen

More information

Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval David Guthrie Computer Science Department University of Sheffield D.Guthrie@dcs.shef.ac.uk Mark Hepple Computer Science

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Department of Information, Evidence & Research. International Clinical Trials Registry Platform. ICTRP Search Portal Revisions Document

Department of Information, Evidence & Research. International Clinical Trials Registry Platform. ICTRP Search Portal Revisions Document Department of Information, Evidence & Research International Clinical Trials Registry Platform ICTRP Search Portal Revisions Document Created on 18/02/2009 Last update on 11/1/2018 Table of contents Introduction..

More information

Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models

Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models ABSTRACT Euisok Chung Hyung-Bae Jeon Jeon-Gue Park and Yun-Keun Lee Speech Processing Research Team, ETRI, 138 Gajeongno,

More information

A Compression Technique Based On Optimality Of LZW Code (OLZW)

A Compression Technique Based On Optimality Of LZW Code (OLZW) 2012 Third International Conference on Computer and Communication Technology A Compression Technique Based On Optimality Of LZW (OLZW) Utpal Nandi Dept. of Comp. Sc. & Engg. Academy Of Technology Hooghly-712121,West

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Language Models Language models are distributions over sentences N gram models are built from local conditional probabilities Language Modeling II Dan Klein UC Berkeley, The

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval WS 2008/2009 25.11.2008 Information Systems Group Mohammed AbuJarour Contents 2 Basics of Information Retrieval (IR) Foundations: extensible Markup Language (XML)

More information

QuickSpecs. PCIe Solid State Drives for HP Workstations

QuickSpecs. PCIe Solid State Drives for HP Workstations Introduction Storage technology with NAND media is outgrowing the bandwidth limitations of the SATA bus. New high performance Storage solutions will connect directly to the PCIe bus for revolutionary performance

More information

Dashboard. 16 Jun Jun 2009 Comparing to: 16 Jun Jun % Bounce Rate. 62,921 Visits. 407,003 Page Views

Dashboard. 16 Jun Jun 2009 Comparing to: 16 Jun Jun % Bounce Rate. 62,921 Visits. 407,003 Page Views JAWA CZ Owners Club Dashboard 16 Jun 2008-16 Jun 2009 Comparing to: 16 Jun 2007-15 Jun 2008 Previous: Visits Visits 400 400 200 200 16 June 2008 19 July 2008 21 August 2008 23 September 26 October 200

More information

Compression of the Stream Array Data Structure

Compression of the Stream Array Data Structure Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In

More information

Kyoto-NMT: a Neural Machine Translation implementation in Chainer

Kyoto-NMT: a Neural Machine Translation implementation in Chainer Kyoto-NMT: a Neural Machine Translation implementation in Chainer Fabien Cromières Japan Science and Technology Agency Kawaguchi-shi, Saitama 332-0012 fabien@pa.jst.jp Abstract We present Kyoto-NMT, an

More information

Multilingual Information Retrieval

Multilingual Information Retrieval Proposal for Tutorial on Multilingual Information Retrieval Proposed by Arjun Atreya V Shehzaad Dhuliawala ShivaKarthik S Swapnil Chaudhari Under the direction of Prof. Pushpak Bhattacharyya Department

More information

Information Retrieval II

Information Retrieval II Information Retrieval II David Hawking 30 Sep 2010 Machine Learning Summer School, ANU Session Outline Ranking documents in response to a query Measuring the quality of such rankings Case Study: Tuning

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many

More information

GUJARAT TECHNOLOGICAL UNIVERSITY

GUJARAT TECHNOLOGICAL UNIVERSITY GUJARAT TECHNOLOGICAL UNIVERSITY INFORMATION TECHNOLOGY DATA COMPRESSION AND DATA RETRIVAL SUBJECT CODE: 2161603 B.E. 6 th SEMESTER Type of course: Core Prerequisite: None Rationale: Data compression refers

More information

How SPICE Language Modeling Works

How SPICE Language Modeling Works How SPICE Language Modeling Works Abstract Enhancement of the Language Model is a first step towards enhancing the performance of an Automatic Speech Recognition system. This report describes an integrated

More information

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Chirag Shah Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE

More information

CALENDAR OF FILING DEADLINES AND SEC HOLIDAYS

CALENDAR OF FILING DEADLINES AND SEC HOLIDAYS CALENDAR OF FILING S AND SEC HOLIDAYS INFORMATION IN THIS CALENDAR HAS BEEN OBTAINED BY SOURCES BELIEVED TO BE RELIABLE, BUT CANNOT BE GUARANTEED FOR ACCURACY. PLEASE CONSULT WITH PROFESSIONAL COUNSEL

More information

Distributed Implementation of a Self-Organizing. Appliance Middleware

Distributed Implementation of a Self-Organizing. Appliance Middleware Distributed Implementation of a Self-Organizing Appliance Middleware soc-eusai 2005 Conference of Smart Objects & Ambient Intelligence October 12th-14th 2005 Grenoble, France Oral Session 6 - Middleware

More information

SSD. DEIM Forum 2014 D8-6 SSD I/O I/O I/O HDD SSD I/O

SSD.    DEIM Forum 2014 D8-6 SSD I/O I/O I/O HDD SSD I/O DEIM Forum 214 D8-6 SSD, 153 855 4-6-1 135 8548 3-7-5 11 843 2-1-2 E-mail: {keisuke,haya,yokoyama,kitsure}@tkl.iis.u-tokyo.ac.jp, miyuki@sic.shibaura-it.ac.jp SSD SSD HDD 1 1 I/O I/O I/O I/O,, OLAP, SSD

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 00 Motivation What is Information Retrieval? The meaning of the term Information Retrieval (IR) can be

More information

QuickSpecs. PCIe Solid State Drives for HP Workstations

QuickSpecs. PCIe Solid State Drives for HP Workstations Solid State Drives for HP Workstations Overview Solid State Drives for HP Workstations Introduction Storage technology with NAND media is outgrowing the bandwidth limitations of the SATA bus. New high

More information

Information Extraction of Important International Conference Dates using Rules and Regular Expressions

Information Extraction of Important International Conference Dates using Rules and Regular Expressions ISBN 978-93-84422-80-6 17th IIE International Conference on Computer, Electrical, Electronics and Communication Engineering (CEECE-2017) Pattaya (Thailand) Dec. 28-29, 2017 Information Extraction of Important

More information

OpenSFS Test Cluster Donation. Dr. Nihat Altiparmak, Assistant Professor Computer Engineering & Computer Science Department University of Louisville

OpenSFS Test Cluster Donation. Dr. Nihat Altiparmak, Assistant Professor Computer Engineering & Computer Science Department University of Louisville OpenSFS Test Cluster Donation Dr. Nihat Altiparmak, Assistant Professor Computer Engineering & Computer Science Department University of Louisville 4/24/2018 Donation Details Jan 5, 2018 OpenSFS announced

More information

Workshops. 1. SIGMM Workshop on Social Media. 2. ACM Workshop on Multimedia and Security

Workshops. 1. SIGMM Workshop on Social Media. 2. ACM Workshop on Multimedia and Security 1. SIGMM Workshop on Social Media SIGMM Workshop on Social Media is a workshop in conjunction with ACM Multimedia 2009. With the growing of user-centric multimedia applications in the recent years, this

More information

Compressing and Decoding Term Statistics Time Series

Compressing and Decoding Term Statistics Time Series Compressing and Decoding Term Statistics Time Series Jinfeng Rao 1,XingNiu 1,andJimmyLin 2(B) 1 University of Maryland, College Park, USA {jinfeng,xingniu}@cs.umd.edu 2 University of Waterloo, Waterloo,

More information

2009. October. Semiconductor Business SAMSUNG Electronics

2009. October. Semiconductor Business SAMSUNG Electronics 2009. October Semiconductor Business SAMSUNG Electronics Why SSD performance is faster than HDD? HDD has long latency & late seek time due to mechanical operation SSD does not have both latency and seek

More information

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok Index extensions Manning, Raghavan and Schütze, Chapter 2 Daniël de Kok Today We will discuss some extensions to the inverted index to accommodate boolean processing: Skip pointers: improve performance

More information

The i-visto Gateway XG Uncompressed HDTV Multiple Transmission Technology for 10-Gbit/s Networks

The i-visto Gateway XG Uncompressed HDTV Multiple Transmission Technology for 10-Gbit/s Networks The Gateway XG Uncompressed HDTV Multiple Transmission Technology for Networks Takeaki Mochida, Tetsuo Kawano, Tsuyoshi Ogura, and Keiji Harada Abstract The is a new product for the (Internet video studio

More information

A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions

A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions Marcin Junczys-Dowmunt Faculty of Mathematics and Computer Science Adam Mickiewicz University ul. Umultowska 87, 61-614

More information

Lightweight Client-Side Chinese/Japanese Morphological Analyzer Based on Online Learning

Lightweight Client-Side Chinese/Japanese Morphological Analyzer Based on Online Learning Lightweight Client-Side Chinese/Japanese Morphological Analyzer Based on Online Learning Masato Hagiwara Satoshi Sekine Rakuten Institute of Technology, New York 215 Park Avenue South, New York, NY {masato.hagiwara,

More information

Processing Posting Lists Using OpenCL

Processing Posting Lists Using OpenCL Processing Posting Lists Using OpenCL Advisor Dr. Chris Pollett Committee Members Dr. Thomas Austin Dr. Sami Khuri By Radha Kotipalli Agenda Project Goal About Yioop Inverted index Compression Algorithms

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep

More information

Marcello Federico MMT Srl / FBK Trento, Italy

Marcello Federico MMT Srl / FBK Trento, Italy Marcello Federico MMT Srl / FBK Trento, Italy Proceedings for AMTA 2018 Workshop: Translation Quality Estimation and Automatic Post-Editing Boston, March 21, 2018 Page 207 Symbiotic Human and Machine Translation

More information

Advance Indexing. Limock July 3, 2014

Advance Indexing. Limock July 3, 2014 Advance Indexing Limock July 3, 2014 1 Papers 1) Gurajada, Sairam : "On-line index maintenance using horizontal partitioning." Proceedings of the 18th ACM conference on Information and knowledge management.

More information

Introduction & Administrivia

Introduction & Administrivia Introduction & Administrivia Information Retrieval Evangelos Kanoulas ekanoulas@uva.nl Section 1: Unstructured data Sec. 8.1 2 Big Data Growth of global data volume data everywhere! Web data: observation,

More information

Proceedings of the Meeting & workshop on Development of a National IT Strategy Focusing on Indigenous Content Development

Proceedings of the Meeting & workshop on Development of a National IT Strategy Focusing on Indigenous Content Development Ministry of Science, Research & Technology Iranian Information & Documentation Center (Research Center) Proceedings of the Meeting & workshop on Development of a National IT Strategy Focusing on Indigenous

More information

Mitel for Microsoft Dynamics CRM Client V5 Release Notes

Mitel for Microsoft Dynamics CRM Client V5 Release Notes Mitel for Microsoft Dynamics CRM Client V5 Release Notes February 08, 2018. Mitel for Microsoft Dynamics CRM Client V5 Release Notes Description: This Application Note Consists of the dates and version

More information

Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems

Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems Ingo Müller, Cornelius Ratsch, Franz Faerber / KIT / SAP AG March 26, 2014 EDBT, Athens, Greece Motivation: Column-Store

More information

CS506/606 - Topics in Information Retrieval

CS506/606 - Topics in Information Retrieval CS506/606 - Topics in Information Retrieval Instructors: Class time: Steven Bedrick, Brian Roark, Emily Prud hommeaux Tu/Th 11:00 a.m. - 12:30 p.m. September 25 - December 6, 2012 Class location: WCC 403

More information

Data Block. Data Block. Copy A B C D P HDD 0 HDD 1 HDD 2 HDD 3 HDD 4 HDD 0 HDD 1

Data Block. Data Block. Copy A B C D P HDD 0 HDD 1 HDD 2 HDD 3 HDD 4 HDD 0 HDD 1 RAID Network RAID File System 1) Takashi MATSUMOTO 1) ( 101-8430 2{1{2 E-mail:tmatsu@nii.ac.jp) ABSTRACT. The NRFS is a brand-new kernel-level subsystem for a low-cost distributed le system with fault-tolerant

More information

Example. Section: PS 709 Examples of Calculations of Reduced Hours of Work Last Revised: February 2017 Last Reviewed: February 2017 Next Review:

Example. Section: PS 709 Examples of Calculations of Reduced Hours of Work Last Revised: February 2017 Last Reviewed: February 2017 Next Review: Following are three examples of calculations for MCP employees (undefined hours of work) and three examples for MCP office employees. Examples use the data from the table below. For your calculations use

More information

Network Traffic Reduction by Hypertext Compression

Network Traffic Reduction by Hypertext Compression Network Traffic Reduction by Hypertext Compression BEN CHOI & NAKUL BHARADE Computer Science, College of Engineering and Science Louisiana Tech University, Ruston, LA 71272, USA pro@benchoi.org Abstract.

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval

A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval Simon Jonassen and Svein Erik Bratsberg Department of Computer and Information Science Norwegian University of

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Australian Journal of Basic and Applied Sciences Journal home page: www.ajbasweb.com A Review on Raid Levels Implementation and Comparisons P. Sivakumar and K. Devi Department of Computer

More information

A Statistical Study of ANY Resource Record Based DNS Query Request Packet Traffic

A Statistical Study of ANY Resource Record Based DNS Query Request Packet Traffic A Statistical Study of ANY Resource Record Based DNS Query Request Packet Traffic A Statistical Study of ANY Resource Record Based DNS Query Request Packet Traffic Yasuo Musashi,* Yuto Takeda,** Nobuhiro

More information

Introduction to Information Retrieval. Hongning Wang

Introduction to Information Retrieval. Hongning Wang Introduction to Information Retrieval Hongning Wang CS@UVa What is information retrieval? 2 Why information retrieval Information overload It refers to the difficulty a person can have understanding an

More information

DNS Query Access and Backscattering SMTP Distributed Denial-of-Service Attack

DNS Query Access and Backscattering SMTP Distributed Denial-of-Service Attack DNS Query Access and Backscattering SMTP Distributed Denial-of-Service Attack Yasuo Musashi, Ryuichi Matsuba, and Kenichi Sugitani Center for Multimedia and Information Technologies, Kumamoto University,

More information

Record Placement Based on Data Skew Using Solid State Drives

Record Placement Based on Data Skew Using Solid State Drives BPOE-5 Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1,2, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories,

More information

PI SERVER 2012 Do. More. Faster. Now! Copyri g h t 2012 OSIso f t, LLC.

PI SERVER 2012 Do. More. Faster. Now! Copyri g h t 2012 OSIso f t, LLC. PI SERVER 2012 Do. More. Faster. Now! Copyri g h t 2012 OSIso f t, LLC. AUGUST 7, 2007 APRIL 14, 2010 APRIL 24, 2012 Copyri g h t 2012 OSIso f t, LLC. 2 PI SERVER 2010 PERFORMANCE 2010 R3 Max Point Count

More information

White Paper. File System Throughput Performance on RedHawk Linux

White Paper. File System Throughput Performance on RedHawk Linux White Paper File System Throughput Performance on RedHawk Linux By: Nikhil Nanal Concurrent Computer Corporation August Introduction This paper reports the throughput performance of the,, and file systems

More information

Stabilizing Minimum Error Rate Training

Stabilizing Minimum Error Rate Training Stabilizing Minimum Error Rate Training George Foster and Roland Kuhn National Research Council Canada first.last@nrc.gc.ca Abstract The most commonly used method for training feature weights in statistical

More information

Personalized Search for TV Programs Based on Software Man

Personalized Search for TV Programs Based on Software Man Personalized Search for TV Programs Based on Software Man 12 Department of Computer Science, Zhengzhou College of Science &Technology Zhengzhou, China 450064 E-mail: 492590002@qq.com Bao-long Zhang 3 Department

More information

Iterative CKY parsing for Probabilistic Context-Free Grammars

Iterative CKY parsing for Probabilistic Context-Free Grammars Iterative CKY parsing for Probabilistic Context-Free Grammars Yoshimasa Tsuruoka and Jun ichi Tsujii Department of Computer Science, University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033 CREST, JST

More information

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

More information

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap to to Information Retrieval Index Construct Ruixuan Li Huazhong University of Science and Technology http://idc.hust.edu.cn/~rxli/ October, 2012 1 2 How to construct index? Computerese term document docid

More information

New Concept based Indexing Technique for Search Engine

New Concept based Indexing Technique for Search Engine Indian Journal of Science and Technology, Vol 10(18), DOI: 10.17485/ijst/2017/v10i18/114018, May 2017 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 New Concept based Indexing Technique for Search

More information

Data acquisition system of COMPASS experiment - progress and future plans

Data acquisition system of COMPASS experiment - progress and future plans Data acquisition system of COMPASS experiment - progress and future plans Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague & CERN COMPASS experiment COMPASS experiment

More information

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu Database Architecture 2 & Storage Instructor: Matei Zaharia cs245.stanford.edu Summary from Last Time System R mostly matched the architecture of a modern RDBMS» SQL» Many storage & access methods» Cost-based

More information

IATF Stakeholder Conference

IATF Stakeholder Conference IATF Stakeholder Conference 13 September 2017 Oberursel, Germany Rüdiger Funke (BMW Group) Number of certified sites against ISO/TS 16949 (and IATF 16949) 70,000 60,000 50,000 40,000 30,000 30,156 50,071

More information

Ternary Tree Tree Optimalization for n-gram for n-gram Indexing

Ternary Tree Tree Optimalization for n-gram for n-gram Indexing Ternary Tree Tree Optimalization for n-gram for n-gram Indexing Indexing Daniel Robenek, Jan Platoš, Václav Snášel Department Daniel of Computer Robenek, Science, Jan FEI, Platoš, VSB Technical Václav

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Year 10 OCR GCSE Computer Science (9-1)

Year 10 OCR GCSE Computer Science (9-1) 01 4 th September 02 11 th September 03 18 th September Half Term 1 04 25 th September 05 2 nd October 06 9 th October 07 16 th October NA Students on in school Thursday PM and Friday Only Unit 1, Lesson

More information

Research Topics in Information Retrieval

Research Topics in Information Retrieval Research Topics in Information Retrieval Cristina Ribeiro Sérgio Nunes FEUP / INESC TEC Information Systems Research Group http://infolab.fe.up.pt Information Retrieval "Information retrieval (IR) is finding

More information

GTRC Hosting Infrastructure Reports

GTRC Hosting Infrastructure Reports GTRC Hosting Infrastructure Reports GTRC 2012 1. Description - The Georgia Institute of Technology has provided a data hosting infrastructure to support the PREDICT project for the data sets it provides.

More information

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Index Construction Overview Introduction Hardware

More information

Service withdrawal: Selected IBM ServicePac offerings

Service withdrawal: Selected IBM ServicePac offerings Announcement ZS09-0086, dated April 21, 2009 Service withdrawal: Selected IBM offerings Table of contents 1 Overview 9 Announcement countries 8 Withdrawal date Overview Effective April 21, 2009, IBM will

More information

From The European Library to The European Digital Library. Jill Cousins Inforum, Prague, May 2007

From The European Library to The European Digital Library. Jill Cousins Inforum, Prague, May 2007 From The European Library to The European Digital Library Jill Cousins Inforum, Prague, May 2007 Timeline Past to Present Started as TEL a project funded by the EU and led by The British Library now fully

More information

Adaptec MaxIQ SSD Cache Performance Solution for Web Server Environments Analysis

Adaptec MaxIQ SSD Cache Performance Solution for Web Server Environments Analysis Adaptec MaxIQ SSD Cache Performance Solution for Web Server Environments Analysis September 22, 2009 Page 1 of 7 Introduction Adaptec has requested an evaluation of the performance of the Adaptec MaxIQ

More information

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures

More information

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses

More information

QuickSpecs. PCIe Solid State Drives for HP Workstations

QuickSpecs. PCIe Solid State Drives for HP Workstations Overview Introduction Storage technology with NAND media is outgrowing the bandwidth limitations of the SATA bus. New high performance Storage solutions will connect directly to the PCIe bus for revolutionary

More information

Dawood Public School Computer Studies Course Outline for Class V

Dawood Public School Computer Studies Course Outline for Class V Dawood Public School Computer Studies Course Outline for 2017-2018 Class V Course book- Keyboard Computer Science with Application Software (V) Second edition (Oxford University Press) Month wise distribution

More information

CMMI Version 1.2. Josh Silverman Northrop Grumman

CMMI Version 1.2. Josh Silverman Northrop Grumman CMMI Version 1.2 Josh Silverman Northrop Grumman Topics The Concept of Maturity: Why CMMI? CMMI Overview/Aspects Version 1.2 Changes Sunsetting of Version 1.1 Training Summary The Concept of Maturity:

More information

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.

More information

Before you install a product, verify that no empty directory exists with the same name as the installation directory.

Before you install a product, verify that no empty directory exists with the same name as the installation directory. Informatica Identity Resolution Version 9.5.3 Release Notes September 2013 Copyright (c) 1998-2013 Informatica Corporation. All rights reserved. Contents Installation... 1 Installing to a Subdirectory....

More information

CS290N Summary Tao Yang

CS290N Summary Tao Yang CS290N Summary 2015 Tao Yang Text books [CMS] Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Publisher: Addison-Wesley, 2010. Book website. [MRS] Christopher

More information

Assimilation process for shortened constables pay scale produced by Greater Manchester Police

Assimilation process for shortened constables pay scale produced by Greater Manchester Police Assimilation process for shortened constables pay scale produced by Greater Manchester Police The current constables pay scale of 11 points will be reduced by 3 points over a 2 year period to create a

More information

Query Difficulty Prediction for Contextual Image Retrieval

Query Difficulty Prediction for Contextual Image Retrieval Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.

More information

Improving Throughput in Cloud Storage System

Improving Throughput in Cloud Storage System Improving Throughput in Cloud Storage System Chanho Choi chchoi@dcslab.snu.ac.kr Shin-gyu Kim sgkim@dcslab.snu.ac.kr Hyeonsang Eom hseom@dcslab.snu.ac.kr Heon Y. Yeom yeom@dcslab.snu.ac.kr Abstract Because

More information