On Scalable Information Retrieval Systems

Size: px
Start display at page:

Download "On Scalable Information Retrieval Systems"

Transcription

1 On Scalable Information Retrieval Systems Ophir Frieder 1

2 Scalable Search Structured Semi-structured Text, video, etc. Answer Engine 2

3 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 3

4 Duplicate Document Detection Union of data obtained from multiple sources often contains duplicates Duplicates affect both retrieval effectiveness and retrieval efficiency Duplicate detection is either syntactic or semantic, where semantic is far more challenging. 4

5 What is a Duplicate Document? Semantic Similarity If a document contains roughly the same semantic content it is a duplicate whether or not it is a precise syntactic match. 5

6 Duplicate Detection Techniques Main duplicate detection approaches: Hash based approaches (syntactic) Information retrieval techniques Resemblance r ( A, B ) = S S ( ( A A ) ) S S ( ( B B ) ) 6

7 Duplicate Detection with IR Using documents as queries, rank all documents in the collection with similar terms Documents with equivalent weights are duplicates For each query term, the corresponding posting list entries must be retrieved for large collections, I/O costs are prohibitive 7

8 Duplicate Detection with Resemblance Calculate the resemblance of each document to every other document with matching features Divide the document into shingles (X terms) used to create a unique hash Calculate the resemblance based on hashes rather than terms N 2 comparison approaches not feasible for large collections Optimizations, filter which shingles to use E.g., every 25 th shingle or a combination of multiple shingles 8

9 Issues with Prior Approaches Hash techniques not resilient to small changes in document representation. IR techniques - slow for large collections. Resemblance documents are clustered into multiple clusters due to partitioning duplicate classification is difficult. 9

10 Combined (I-Match) Algorithm Tokenize document Create list of unique tokens Filter tokens - What to filter? Create a unique hash of remaining tokens Search collection for duplicate hashes 10

11 Filtration Based On Collection Statistics Hi & Low 25% Low 25% High 25% Mid 50% N 1. Sort according to idf = log n N = Number _ Of _ Documents _ In _ Collection n = Number _ Of _ Documents _ Term _ Occurs _ In 2. Filter unwanted components 11

12 LA Times Collection Create random duplicates to test effectiveness. For every i th word, pick a random number from one to ten. If the number is higher than the random threshold (call it alpha) then pick a number from 1 to 3. If the random number chosen is a one then remove the word. If the number is a two then flip it with a word at position i+1. If it is a three, add a word (randomly pick one from the term list). Insert duplicate into the collection. 12

13 Document Clusters Formed Document Resemblance Resemblance-Opt Combined LA LA LA LA LA LA LA LA LA LA Average I-Match did not produce any false positives while Resemblance did. 13

14 Processing Time 2GB Algorithm MEAN Time Std Deviation Median Time Resemblance Resemblance - Opt I-Match Syntactic 65 N/A N/A 14

15 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 15

16 SIRE Goals Integrate structured and semi-structured data using a framework that also integrates unstructured data. Improve accuracy of retrieved results Support scalability: data volume retrieval speeds Support legacy data 16

17 Portability The information retrieval prototype was implemented on the following relational platforms: NCR Teradata DBC-machines Microsoft SQL Server Sybase Oracle IBM DB2 and SQL/DS 17

18 Relational Inverted Index All inverted index entries <term> <list of documents> e.g., vehicle D1, D3, D4 results in: term vehicle vehicle vehicle docid D1 D3 D4 18

19 Text Retrieval Conference (TREC) Sample Document <DOC> <DOCNO> AP </DOCNO> <FILEID>AP-NR EST</FILEID> <FIRST>u i BC-Japan-Stocks </FIRST> <SECOND>BC-Japan-Stocks,0026</SECOND> <HEAD>Stocks Up In Tokyo</HEAD> <DATELINE>TOKYO (AP) </DATELINE> <TEXT> The Nikkei Stock Average closed at 29, points up points on the Tokyo Stock Exchange Wednesday. </TEXT> </DOC> 19

20 Relational Document Representation (Term Processing) DOCUMENT docid docname headline dateline 28 AP Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points 28 1 up 28 1 tokyo 28 1 exchange 28 1 wednesday TERM term df idf average closed exchange nikkei points stock tokyo up wednesday

21 Simplistic Models: Keyword and Boolean Searches 21

22 Relational Approach: Keyword Search Techniques Keyword search select i.docid from INDEX i, QUERY q where i.term = q.term Keyword search with stop word list select i.docid from INDEX i, QUERY q, STOPLIST s where (i.term = q.term) and (i.term <> s.term) 22

23 Relational Approach: Boolean Search Techniques OR query select docid select docid from INDEX from INDEX where term = term1 where term = term1 OR union term = term2 OR select docid term = term3 OR from INDEX... where term = term2 term = termn union select docid from INDEX where term = term3... union select docid from INDEX where term = termn 23

24 Relational Approach: Boolean Search Techniques AND query select docid select docid from INDEX from INDEX a, INDEX b, INDEX c,... INDEX N where term = term1 where a.term = term1 AND intersect b.term = term2 AND select docid c.term = term3 AND from INDEX... where term = term2 n.term = termn AND intersect a.docid= b.docid AND select docid b.docid = c.docid AND from INDEX... where term = term3 N-1.docID= N.docID... intersect select docid from INDEX where term = termn 24

25 Fixed Join-Count AND Queries Find all documents that contain all of the terms found in the QUERY relation: select i.docid from INDEX i, QUERY q where i.term = q.term group by i.docid having count (distinct (i.term)) = select count(*) from QUERY 25

26 TAND Queries Find all documents that contain at least X of the terms found in the QUERY relation: select i.docid from INDEX i, QUERY q where i.term = q.term group by i.docid having count (distinct (i.term)) >= X 26

27 Relevance Ranking: Vector Space & Probabilistic Models 27

28 Vector Space Model Term Frequency (tf ik ): number of occurrences of term t k in document i Document Frequency (df j ): number of documents which contain t j Inverse Document Frequency (idf j ): log(d/df j ) where d is the total number of documents Notes: idf is a measure of uniqueness of a term across the collection tf is the frequency of a term in a given document 28

29 Vector Space Model: Sample Relational Query List all documents in the order of their similarity coefficient where the coefficient is computed using the dot product. SELECT FROM WHERE d.docid, d.docname, SUM(i.termcnt * t.idf * q.termcnt * t.idf) DOCUMENT d, QUERY q, INDEX i, TERM t q.term = i.term AND q.term = t.term AND d.docid = i.docid GROUP BY d.docid, d.docname ORDER BY 3 DESC 29

30 Similarity Coefficients Several similarity coefficients based on the query vector X and the document vector Y are defined: Inner Prod uct x y Cosine Coefficient t i= 1 i i t xiyi i= 1 t t 2 xi 2 yi i= 1 i= 1 30

31 SQL for Probabilistic Similarity Measure num _ terms log ( numdocs dfi ) +.5 tfid ( df ) (.75 doclength avgdoclength) i= 1 i / + tf id qtf SELECT d.docid, d.docname, SUM( LOG(((NumDocs - t.df) + 0.5) / (t.df + 0.5)) * ((2.2*i.tf) / (.3 + ((.75 * d.doclen)/avgdoclen) + i.tf)) * q.termcnt ) FROM INDEX i, TERM t, DOCUMENT d, QUERY q WHERE i.term = t.term AND i.docid = d.docid AND t.term = q.term GROUP BY d.docid, d.docname ORDER BY 3; 31

32 Relational Document Representation (Term Processing) DOCUMENT docid docname headline dateline 28 AP Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points 28 1 up 28 1 tokyo 28 1 exchange 28 1 wednesday TERM term df idf average closed exchange nikkei points stock tokyo up wednesday

33 Relational Query Representation (Term Processing) QUERY term termcnt nikkei 1 stock 2 exchange 2 american 1 ORIGINAL QUERY: nikkei stock exchange american stock exchange SQL: (Query Weight * Document Weight) SELECT d.docid, d.docname, SUM(a.termcnt * c.idf * b.termcnt * c.idf) FROM QUERY a, INDEX b, TERM c, DOCUMENT d WHERE a.term = b.term AND a.term = c.term AND b.docid = d.docid GROUP BY d.docid, docname ORDER BY 3 DESC 33

34 Sample Term Query Result (Inner/Dot Product) Term Q-Termcnt Q-Weight D-Termcnt D-Weight Q-Wt * D-Wt nikkei stock exchange american Similarity Coefficient

35 Simple Phrase Parsing Simple phrase parser with the following rules Phrases do not include stop terms Phrases do not span across punctuation Example: The Nikkei Stock Average closed at 29, points up points, on the Tokyo Stock Exchange Wednesday. Phrases: nikkei stock stock average average closed points up tokyo stock stock exchange exchange wednesday 35

36 Relational Document Representation (Phrase Processing) DOCUMENT docid docname headline dateline 28 AP Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt phrase 28 1 nikkei stock 28 1 stock average 28 1 average closed 28 1 points up 28 1 tokyo stock 28 1 stock exchange 28 1 exchange wednesday PHRASE phrase df idf average closed exchange Wednesday nikkei stock points up stock average stock exchange tokyo stock

37 Enhancing Accuracy With Relevance Feedback 37

38 Relevance Feedback The modification of the search process so as to improve accuracy by incorporating information obtained from prior relevance judgments. Q 0 top relevant documents new terms Q 0 Q 1 matching documents database search database search 38

39 Relevance Feedback Example Q tunnel under English Channel 1 Document Collection Top Ranked Document: The tunnel under the English channel is often called a Chunnel Q1 tunnel under English Channel Chunnel Documents Retrieved Relevant Retrieved b 2 Not Relevant Documents Retrieved Relevant Retrieved b 39

40 Feedback Mechanisms Manual - relevant documents are identified manually and new terms are selected either manually or automatically. Automatic - relevant documents are identified automatically by assuming the top-ranked documents, are relevant and new terms are selected automatically. 40

41 Relevance Feedback Parameters Various techniques can be used to improve the relevance feedback process. Number of Top-Ranked Documents Number of Feedback Terms Feedback Term Selection Techniques Term Weighting Document Clustering Relevance Feedback Thresholding Term Frequency Cutoff Points Query Expansion Using a Thesaurus 41

42 Relevance Feedback Evaluation Improvement from relevance feedback, nidf weights at 0.00 at 0.20 at 0.40 at 0.60 at 0.80 at 1.00 nidf, no feedback nidf, feedback 10 terms Recall 42

43 Comparative TREC-8 Results Run IIT Avg. Precision # Above Median # At Median # Below Median # Best # Worst iit00t iit00td iit00tde iit00m

44 Performance Optimizations: Query Thresholds & Clustered Indexes 44

45 Query Thresholds Consider a query with terms t 1, t 2, t 3,..., t n. Sort the terms by their frequency across the collection (least frequent terms appear first). Define a threshold as the percentage of terms taken in the original query in a newly created reduced query. Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term 7 Term 8 Term 9 Term 10 Threshold = 20 Threshold = 50 Threshold = 80 45

46 Relevant Retrieved as a Function of Query Thresholds 2500 Relevant Retrieved Query Threshold (Percent) 46

47 Run Time as a Function of Query Thresholds CPU ,989 13, ,238 1,736 5, Query Threshold (Percent) 47

48 Relevant New Documents Per CPU Cycle Threshold Relevant Retrieved CPU Cycles New Relevant Docs per Cycle

49 Caveat: Logical Design versus Physical Implementation While the design shown represents the replication of the document identifier, in the physical implementation, clustered tables are actually used. That is, attribute values that are logically repeated many times are physically clustered by the attribute value to eliminate the replication storing only one copy for each unique attribute value. (Note clustered tables in Oracle implementations) The I/O to retrieve a posting list is achieved via a grouped block read as opposed to retrieval across distributed storage. 49

50 Clustered Indexes: Posting List Processing TRADITIONAL term docid tf stock 1 1 stock 28 2 stock stock stock stock stock stock stock CLUSTERED term docid tf stock { ( 1 1 ), ( 28 2 ), ( ), ( ), ( ), ( ), ( ), ( ), ( ) } 50

51 Clustered Indexes: Relevance Feedback Processing TRADITIONAL docid termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points 28 1 up 28 1 tokyo 28 1 exchange 28 1 wednesday CLUSTERED docid termcnt term 28 { ( 1 nikkei ), ( 2 stock ), ( 1 average ), ( 1 closed ), ( 2 points ), ( 1 up ), ( 1 tokyo ), ( 1 exchange ), ( 1 wednesday ) } 51

52 Technology Transfer Concern: An Academic Solution Without a Public Commercial World Problem Resolution: National Institutes of Health s National Center for Complementary and Alternative Medicine Citation Index 52

53 NIH-NCCAM Application Short query lengths necessitate query expansion Advanced Search techniques are needed since used by roughly 30% of the users Efficient processing critical filtration an option. Scalability not currently a concern but future needs may dictate such. 53

54 NCCAM - Search Interface 54

55 NCCAM Results Page Interface 55

56 System Architecture Internet Servlet Engine User Oracle DBMS HTTP Server Sun ES

57 Servlet Architecture HTTP Search Request Generic Search Servlet Dynamic HTML MRU Cache Query Engine RDBMS Connection Pool RDBMS 57

58 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 58

59 Scalability via Parallelism: Not My Problem: It s the Database Vendors Problem 59

60 Parallel Information Retrieval Most parallel information retrieval systems require custom hardware and software which reduces portability across systems. Parallelism in a relational information retrieval system is a function of the database management system and does not require custom hardware and software. 60

61 TPC- C Benchmarks TPC-C BENCHMARK RESULTS These results are valid as of date 7/5/2000 9:46:09 AM TPC-C Results - Revision 3.X Company System Spec. RevitpmC $/tpmc Total Sys. Currency Database Operating TP MonitoServer CPU# Server CPCluster # Front En Date SubmAvailability ALR Revolution US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 6 N 6 11/6/97 12/31/97 ALR ALR Revol US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 6 N 5 7/9/97 11/30/97 ALR Revolution US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 4 4/4/97 4/30/97 Acer AcerAltos US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/17/99 5/17/99 Acer AcerAltos US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 5 2/16/98 2/16/98 Amdahl EnVista Fr US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 6 3/28/97 3/31/97 Bull Escala EP US $ Oracle 8i EIBM AIX 4.3WebshpereIBM RS64-8 N 6 6/20/00 9/30/00 Bull Escala T US $ Oracle 8i VIBM AIX 4.3WebshpereIBM RS64-6 N 5 6/20/00 6/20/00 Bull EPC 440 c Euros Oracle8i 8IBM AIX 4.3IBM TXSer IBM RS64-4 N 8 12/2/99 12/2/99 Bull Escala EP US $ Oracle 8i VIBM AIX 4.3IBM TXSer IBM RS64 24 N 15 11/5/99 3/1/00 Bull Express US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 8 N 8 3/26/98 6/15/98 Bull ESCALA P US $ Oracle7 7.IBM AIX 4.1Tuxedo ETMotorola P 32 Y 16 1/16/97 6/30/97 Bull ESCALA D US $ Sybase SQIBM AIX 4.2Tuxedo 4.2Motorola P 8 N 6 11/15/96 1/31/97 Bull ESCALA D US $ Oracle7 7.IBM AIX 4.1Tuxedo 4.2Motorola P 8 N 7 6/3/96 11/30/96 Bull ESCALA D US $ Oracle7 7.IBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 2/1/96 7/30/96 Bull ESCALA D US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 4 N 3 5/9/95 5/1/95 Bull ESCALA D US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 5/9/95 6/1/95 Bull ESCALA R US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 4 N 3 5/9/95 5/1/95 Bull ESCALA R US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 5/9/95 6/1/95 Compaq ProLiant DL US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 6/23/00 8/1/00 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 6 N 4 4/7/00 4/7/00 Compaq ML US $ Microsoft Microsoft WCompaq DIntel Pentiu 2 N 2 3/8/00 8/1/00 Compaq ProLiant PD US $ Oracle 8i VMicrosoft WCompaq DIntel Pentiu 48 Y 6 2/11/00 3/31/00 Compaq AlphaServe US $ Sybase AdCompaq TrApplication Alphachip 4 N 4 2/9/00 3/17/00 Compaq PDC/O US $ Oracle 8i VMicrosoft WCompaq DIntel Pentiu 48 Y 12 12/23/99 3/31/00 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 12/20/99 12/31/99 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft C Intel Pentiu 1 N 1 10/12/99 12/31/99 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft C Intel Pentiu 2 N 3 9/29/99 12/31/99 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 9/20/99 12/31/99 61

62 More Vendors TPC- C Benchmarks DG AViiON AV US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 3/4/99 3/25/99 DG AViiON AV US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 1/22/99 3/25/99 DG AViiON AV US $ Microsoft Microsoft WTopEnd v2 Intel Pentiu 8 N 5 3/2/98 5/31/98 DG AViiON US $ Microsoft Microsoft WTopEnd v2 Intel Pentiu 6 N 9 11/21/97 2/28/98 Dell PowerEdge US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 5/26/00 8/1/00 Dell PowerEdge US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 5/26/00 8/1/00 Dell PowerEdge US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 11/22/99 12/31/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/28/99 9/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/28/99 9/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 3/28/99 6/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 3/28/99 6/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 2 1/5/98 1/5/98 Dell PowerEdge US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 4 3/12/97 4/11/97 Fujitsu Si Primergy N Euros Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 8 7/5/00 10/1/00 Fujitsu Si Primergy K Euros Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 12/13/99 1/1/00 Fujitsu Si Primergy US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 3/17/99 4/26/99 Fujitsu Si Primergy US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 12/23/98 2/28/99 Fujitsu Si Primergy US $ Microsoft Microsoft WOpen UTMIntel Pentiu 4 N 7 9/28/98 12/29/98 Fujitsu Si RM600 mo US $ Informix OSNI ReliantOpenUTM MIPS R N 14 3/14/98 3/1/98 Fujitsu Si Primergy US $ Microsoft Microsoft WSNI openuintel Pentiu 4 N 6 12/1/97 1/1/98 Fujitsu Si RM600 Mo US $ Informix OPyramid ReOpenUTM MIPS R100 8 N 7 7/11/97 7/31/97 Fujitsu Si Primergy US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 5 2/14/97 3/31/97 Fujitsu Si RM400-C US $ Oracle7 7.SNI ReliantOpenUTM MIPS R100 1 N 2 2/14/97 3/31/97 Fujitsu Si RM600 Mo US $ Informix OSNI Relian OpenUTM MIPS R N 7 12/20/96 6/30/97 Fujitsu Si RM600 Mo US $ Informix OSNI Relian Open UTMMIPS R N 6 9/9/96 9/9/96 Fujitsu Si RM 600 Mo US $ Informix OSNI SINIX-Tuxedo 4.2MIPS R N 5 1/5/96 5/1/96 Fujitsu Si RM 600 Mo US $ Informix OSNI SINIX-Tuxedo MIPS R440 8 N 3 11/13/95 5/1/96 Fujitsu Si RM 400 Mo US $ Informix OSNI SINIX- Tuxedo 4.2MIPS R440 1 N 1 7/20/95 11/1/95 Fujitsu/ICLGRANPOW US $ Fujitsu/ICLMicrosoft WBEA TuxedIntel Pentiu 4 N 6 7/19/99 1/16/00 62

63 More Vendors TPC- C Benchmarks HP HP NetServ US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 7/8/97 11/30/97 HP NetServer L US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 4/3/97 4/3/97 HP NetServer L US $ Microsoft Microsoft WTuxedo 4.1Intel Pentiu 2 N 2 12/16/96 2/28/97 IBM Netfinity US $ IBM DB2 UMicrosoft WMicrosoft CIntel Pentiu 4 Y 96 7/3/00 12/7/00 IBM RS/6000 E US $ Oracle 8i EIBM AIX 4.3WebshpereIBM RS64-8 N 6 5/31/00 9/30/00 IBM RS/6000 E US $ Oracle8i EIBM AIX 4.3WebshpereIBM RS64-6 N 5 5/9/00 6/9/00 IBM Netfinity US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 6 2/25/00 2/25/00 IBM RISC Syste US $ Oracle 8i VIBM AIX 4.3IBM TXSer IBM RS64 24 N 15 10/29/99 3/1/00 IBM RS 6000 E US $ Oracle OraIBM AIX 4.3IBM TXSer IBM RS64 60 Y 60 6/30/99 6/30/99 IBM AS/400e M US $ IBM DB2 foibm OS/40BEA TuxedIBM Power 12 N 6 6/7/99 6/1/99 IBM RISC Syste US $ Oracle v8ibm AIX 4.3IBM TXSer IBM RS64-4 N 8 5/28/99 11/19/99 IBM AS/400e Se US $ IBM DB2 foibm OS/40CICS for A IBM Power 12 N 97 9/1/98 9/11/98 IBM RS/6000 E US $ Oracle OraIBM AIX 4.3IBM TXSer IBM Power 12 N 15 8/11/98 1/21/99 IBM RS/6000 E US $ Oracle OraIBM AIX 4.3IBM TXSer IBM Power 12 N 8 3/3/98 9/2/98 IBM RISC Syste US $ Sybase AdIBM AIX 4.2BEA TuxedIBM Power 4 N 5 2/12/98 2/12/98 IBM AS/400e S US $ IBM DB2 foibm OS/40CICS for A IBM AS A3 12 N 64 8/18/97 8/29/97 IBM RS/6000 W US $ Sybase AdIBM AIX 4.2BEA TuxedIBM Power 4 N 3 5/12/97 9/30/97 IBM RS/6000 E US $ Sybase SQIBM AIX 4.2BEA TuxedIBM Power 8 N 3 5/6/97 9/30/97 IBM RS/6000 E US $ Sybase SQIBM AIX 4.2BEA TuxedIBM Power 8 N 3 5/6/97 9/30/97 IBM RS/6000 W US $ Sybase SQIBM AIX 4.2BEA TuxedIBM Power 4 N 3 4/7/97 4/25/97 IBM RISC Syste US $ Oracle7 7.IBM AIX 4.1Tuxedo ETIBM Power 32 Y 16 12/10/96 6/30/97 IBM RISC Syste US $ Oracle7 7.IBM AIX 4.1Tuxedo ETIBM Power 32 Y 16 12/10/96 6/30/97 IBM RS6000 Po US $ Sybase SQIBM AIX 4.1BEA TuxedMotorola P 8 N 6 7/23/96 12/15/96 IBM RS6000 Po US $ Sybase SQIBM AIX 4.1BEA TuxedMotorola P 8 N 6 7/23/96 12/15/96 IntergraphInterServe US $ Microsoft Microsoft W-none- Intel Pentiu 2 N 4 7/30/97 7/1/97 IntergraphInterServe US $ Microsoft Microsoft W-none- Intel Pentiu 2 N 2 3/5/97 3/31/97 IntergraphInterServe US $ Microsoft Microsoft W-none- Intel Pentiu 1 N 1 3/5/97 3/31/97 Itautec InfoSERVE Brazil $ Microsoft Microsoft W-none- Intel Pentiu 4 N 6 6/30/97 3/1/97 Motorola Motorola S US $ Oracle7 7.IBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 4/24/96 7/30/96 NEC Express US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 6/23/00 9/29/00 NEC Express US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 6/19/00 9/29/00 NEC Express US $ Oracle8i EMicrosoft WBEA TuxedIntel Pentiu 32 Y 8 6/29/99 11/30/99 NEC Express US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 5 8/5/98 12/29/98 63

64 More Vendors TPC- C Benchmarks NEC UP 4800/ E+08 Yen Informix ONEC UP-UTuxedo (R4MIPS R440 2 N 3 1/27/95 4/21/95 NEC UP 4800/ E+08 Yen Informix ONEC UP-UTuxedo (R4MIPS R440 6 N 7 1/27/95 6/21/95 SGI Origin US $ Sybase AdSGI IRIX 6.BEA TuxedMIPS R100 2 N 2 4/23/98 7/31/98 SGI Origin US $ Informix OSGI IRIX 6.BEA TuxedMIPS R N 26 4/30/97 10/29/97 Sequent NUMACen US $ Oracle OraDYNIX/ptx BEA TuxedIntel Xeon 64 N 8 12/18/98 6/15/99 Sequent NUMACen US $ Oracle OraSequent DYBEA TuxedIntel Xeon 32 N 4 10/13/98 3/15/99 Sun Enterprise US $ Fujitsu/ICLSun SolariBEA Tuxed Ultra SPA 4 N 7 2/2/00 7/28/00 Sun Enterprise US $ Sybase AdSun SolariBEA Tuxed Ultra SPA 14 N 15 11/23/99 3/30/00 Sun Enterprise US $ Oracle8i ESun SolarisBEA Tuxed Ultra SPA 96 Y 40 9/24/99 1/31/00 Sun Starfire Ent US $ Oracle 8i vsun SolariBEA Tuxed Ultra SPA 64 N 32 3/24/99 8/22/99 Sybase Digital Alph US $ Sybase SQDigital UNIXITI Tuxedo Digital DEC 10 N 10 12/21/95 3/1/96 Tandem Integrity NR US $ Informix OSGI IRIX 6.IMC TuxedMIPS R N 7 11/10/95 2/28/96 Unisys e-@ction E US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 4 2/16/00 7/13/00 Unisys Unisys e-@ US $ Oracle8i EUnixWare 7Tuxedo 6.4Intel Pentiu 8 N 5 12/13/99 6/1/00 Unisys Aquanta ES US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 3 10/27/99 12/31/99 Unisys Aquanta ES US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 10/12/99 12/31/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 2 N 1 9/7/99 9/30/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 8 N 4 6/22/99 9/30/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/11/99 5/11/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/7/99 5/7/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 4/1/99 3/31/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 3/17/99 3/17/99 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Xeon 4 4 N 3 1/5/99 1/5/99 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 3 12/4/98 12/29/98 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 3 11/11/98 12/29/98 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N NEC UP 4800/6 3 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N NEC UP 4800/6 3 64

65 Experimental Platform: NCR Teradata DBC/1012 Work Station Client Computer (client resident software) Channels Channel Work Station LAN COP IFP IFP PE YNET IFP - Interface Processor COP - Communications Processor Ynet - Interprocessor Bus AMP - Access Module Processor AP - Application Processor PE - Parsing Engine AMP AMP AMP AP disks disks disks UNIX disk LAN 65

66 Sample Relevance Ranking Query SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)* ( (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf)))) FROM trec6$d5$idx a, trec6$d5$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term AND c.qryid = 301 GROUP BY c.qryid, b.docid UNION SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)* ( (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf)))) FROM trec6$d4$idx a, trec6$d4$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term AND c.qryid = 301 GROUP BY c.qryid, b.docid ORDER BY 3 DESC; 66

67 Breakdown of DBC/1012 Processing Steps Step 1 - A read lock is placed on tables trec6$d5$idx, trec6$d5$docavgtf, trec6$d4$idx, and trec6$d4$docavgtf. Step 2 - A single processor is used to select and join rows via a merge join from trec6$q6$qrynidf and trec6$q6$qryavgtf where the value of qryid = 301. The results are stored on spool 3. Step 3 - An all processor join is used to select and join rows via a row hash match scan from trec6$d5$idx and spool 3 where the values of the term attribute match. The results are sorted and stored on spool 4. Step 4 - An all processor join is used to select and join rows via a row hash match scan from trec6$d5$docavgtf and spool 4 where the values of the docid attribute match. The results are stored on spool 2 which is built locally on each processor. Step 5 - The SUM value for the aggregate function is calculated from the data on spool 2 and the results are stored on spool 5. (The next two steps, 6a and 6b, are executed in parallel) Step 6a - The data from spool 5 is retrieved and distributed via a hash code to spool 1 which encompasses all processors. Step 6b - An all processor join is used to select and join rows via a row hash match scan from trec6$d4$idx and spool 3 where the values of the term attribute match. The results are sorted and stored on spool 9. Step 7 - An all processor join is used to select and join rows via a row hash match scan from trec6$d4$docavgtf and spool 9 where the values of the docid attribute match. The results are stored on spool 7 which is built locally on each processor. Step 8 - The SUM value for the aggregate function is calculated from the data on spool 7 and the results are stored on spool 10. Step 9 - The data from spool 10 is retrieved and distributed via a hash code to spool 1 which encompasses all processors. A sort is then done to remove duplicates from data on spool 1. Step 10 - An END TRANSACTION step is sent to all processors involved and the contents of spool 1 are sent back to the user. 67

68 Parallel Performance Parallel Efficiency measures how efficiently the processors are distributing the workload. max_cpu = maximum CPU time Parallel Efficiency (PE) = avg_ cpu max_ cpu across all processors avg_cpu = average CPU time across all processors Average CPU Maximum Data Storage Parallel time per CPU time imbalance Efficiency processor per processor across processors 4 processors, disk 2 only % 84.9% primary index on term 4 processors, disk 2 and % 86.9% primary index on term 24 processors, disk 2 only % 40.6% primary index on term 24 processors, disk 2 and % 44.2% primary index on term 68

69 Hashing Algorithm Term 4 Processors Hashing Algorithm A,E,I,M,Q,U,Y Proc #1 B,F,J,N,R,V,Z Proc #2 C,G,K,O,S,W Proc #3 D,H,L,P,T,X Proc #4 Term 24 Processors Hashing Algorithm A Proc #1 B Proc #2 C Proc # V Proc #22 WX Proc #23 YZ Proc #24 69

70 Term Distribution Distribution of terms based on starting letter Number of Terms a c e g i k m o q s u w y Starting Letter 70

71 Parallel Performance Parallel Efficiency before and after balancing data storage across 24 processors Average CPU Maximum Data Storage Parallel time per CPU time imbalance Efficiency processor per processor across processors 24 processors, disk 2 only % 40.6% primary index on term 24 processors, disk 2 and % 44.2% primary index on term 24 processors, disk 2 only % 91.6% primary index on docid and term 24 processors, disk 2 and % 93.8% primary index on docid and term 71

72 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 72

73 Current Enterprise Portals 73

74 Next Generation Search Engines 74

75 75 75

76 76 76

77 77 77

78 78 78

79 IIT Production Mediator Logs 271 Queries 142 With User Feedback Satisfied 13% Ok 13% Dissatisfied 13% Happy 38% Unhappy 23% 79

80 Technology Transfer Industrial America Online BIT Systems Harris Corporation IITRI NCR Unnamed Others (Proprietary) Assorted dot-dead companies (hopefully not due to our technology!!!) Government National Institutes of Health Additional Others 80

81 Information Retrieval Laboratory Faculty Members: O. Frieder D. Grossman N. Goharian X. Li P. Wan Senior Affiliates A. Chowdhury AOL D. Holmes NCR M. C. McCabe - US Gov. Students: Steven Beitzel Rebecca Cathey Ankit Jain Eric Jensen Vincent Nguyen Angelo Pilotto Michael Saelee Chih-Wei Yi Wang Yu 81

82 References D. A. Grossman, O. Frieder, D. O. Holmes, and D. C. Roberts, Integrating Structured Data and Text: A Relational Approach, Journal of the American Society of Information Science, February C. Lundquist, O. Frieder, D. Holmes, D. Grossman, A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrieval, Journal of the American Society of Information Science, April O. Frieder, D. Grossman, A. Chowdhury, and G. Frieder, "Efficiency Considerations in Very Large Information Retrieval Servers," Journal of Digital Information, (British Computer Society), 1(5), April Invited Paper. A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe, Analyses of Multiple-Evidence Combinations for Retrieval Strategies, ACM Twentieth SIGIR, New Orleans, Louisiana, September D. Grossman, S. Beitzel, E. Jensen, and O. Frieder, IIT Intranet Mediator: Bringing Data Together on a Corporate Intranet, IEEE IT PRO, January/February A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe, Collection Statistics for Fast Duplicate Document Detection, ACM Transactions on Information Systems (TOIS), April

Relational Approach. Problem Definition

Relational Approach. Problem Definition Relational Approach (COSC 416) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman, Frieder Grossman, Frieder 2002, 2010 1 Problem

More information

Relational Approach. Problem Definition

Relational Approach. Problem Definition Relational Approach (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman & Frieder 1 Problem Definition Three conceptual

More information

A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrieval

A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrieval A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrieval Carol Lundquist 1, Ophir Frieder 2, David Grossman 3, and David O. Holmes 4 Abstract. A scalable,

More information

Relevance Feedback & Other Query Expansion Techniques

Relevance Feedback & Other Query Expansion Techniques Relevance Feedback & Other Query Expansion Techniques (Thesaurus, Semantic Network) (COSC 416) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Informion Retrieval Algorithms and Heuristics,

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Collection Statistics for Fast Duplicate Document Detection

Collection Statistics for Fast Duplicate Document Detection Collection Statistics for Fast Duplicate Document Detection ABDUR CHOWDHURY, OPHIR FRIEDER, DAVID GROSSMAN, and MARY CATHERINE McCABE Illinois Institute of Technology We present a new algorithm for duplicate

More information

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Nazli Goharian, Ankit Jain, Qian Sun Information Retrieval Laboratory Illinois Institute of Technology Chicago, Illinois {goharian,ajain,qian@ir.iit.edu}

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

InfoBrief. Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux

InfoBrief. Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux InfoBrief Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux Leveraging Oracle 9i Real Application Clusters (RAC) Technology and Red

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

IBM's Regatta Still Lags NCR Teradata in Data Warehousing

IBM's Regatta Still Lags NCR Teradata in Data Warehousing Decision Framework, A. Butler, K. Strange Research Note 17 September 2002 's Regatta Still Lags NCR Teradata in Data Warehousing 's new Regatta server is maturing fast in the online transaction processing

More information

UNITED STATES. Performance Report. IBM Netfinity HUVLRQÃ -DQXDU\Ã

UNITED STATES. Performance Report. IBM Netfinity HUVLRQÃ -DQXDU\Ã Performance Report IBM Netfinity 7000 9HUVLRQÃ -DQXDU\Ã Š Executive Overview The performance of the IBM Netfinity* 7000 server, announced worldwide in September 1997, was evaluated using the following

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Exploiting Parallelism to Support Scalable Hierarchical Clustering

Exploiting Parallelism to Support Scalable Hierarchical Clustering Exploiting Parallelism to Support Scalable Hierarchical Clustering Rebecca Cathey, Eric Jensen, Steven Beitzel, Ophir Frieder, David Grossman Information Retrieval Laboratory http://ir.iit.edu Background

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Does the TPC still have relevance? H. Reza Taheri HPTS 2017, 9-Oct-2017

Does the TPC still have relevance? H. Reza Taheri HPTS 2017, 9-Oct-2017 Does the TPC still have relevance? H. Reza Taheri HPTS 2017, 9-Oct-2017 2016 VMware Inc. All rights reserved. Outline History of the TPC Where things stand today Why the decline? The way forward Not gonna

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Keyword Search in Databases

Keyword Search in Databases Keyword Search in Databases Wei Wang University of New South Wales, Australia Outline Based on the tutorial given at APWeb 2006 Introduction IR Preliminaries Systems Open Issues Dr. Wei Wang @ CSE, UNSW

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

DQpowersuite. Superior Architecture. A Complete Data Integration Package

DQpowersuite. Superior Architecture. A Complete Data Integration Package DQpowersuite Superior Architecture Since its first release in 1995, DQpowersuite has made it easy to access and join distributed enterprise data. DQpowersuite provides an easy-toimplement architecture

More information

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Elementary IR: Scalable Boolean Text Search. (Compare with R & G ) Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context

More information

Release Notes. BMC Performance Manager Express for Hardware by Sentry Software Version January 18, 2007

Release Notes. BMC Performance Manager Express for Hardware by Sentry Software Version January 18, 2007 Release Notes BMC Performance Manager Express for Hardware Version 2.3.00 January 18, 2007 Sentry Software is releasing version 2.3.00 of the BMC Performance Manager Express for Hardware. These release

More information

Database Server. 2. Allow client request to the database server (using SQL requests) over the network.

Database Server. 2. Allow client request to the database server (using SQL requests) over the network. Database Server Introduction: Client/Server Systems is networked computing model Processes distributed between clients and servers. Client Workstation (usually a PC) that requests and uses a service Server

More information

Compaq AlphaServer ES40 Model 6/833 4 CPU. Client Server System. Total System Cost TPC-C Throughput Price/Performance Availability Date

Compaq AlphaServer ES40 Model 6/833 4 CPU. Client Server System. Total System Cost TPC-C Throughput Price/Performance Availability Date Compaq AlphaServer ES40 Model 6/833 4 CPU Client Server System TPC-C Rev. 3.5 Report Date: February 26, 2001 Total System Cost TPC-C Throughput Price/Performance Availability Date $712,376 37,274 $19.11

More information

Lecture 14: I/O Benchmarks, Busses, and Automated Data Libraries Professor David A. Patterson Computer Science 252 Spring 1998

Lecture 14: I/O Benchmarks, Busses, and Automated Data Libraries Professor David A. Patterson Computer Science 252 Spring 1998 Lecture 14: I/O Benchmarks, Busses, and Automated Data Libraries Professor David A. Patterson Computer Science 252 Spring 1998 DAP Spr. 98 UCB 1 Review: A Little Queuing Theory Queue System server Proc

More information

IIT at TREC-10. A. Chowdhury AOL Inc. D. Holmes NCR Corporation

IIT at TREC-10. A. Chowdhury AOL Inc. D. Holmes NCR Corporation IIT at TREC-10 M. Aljlayl, S. Beitzel, E. Jensen Information Retrieval Laboratory Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 {aljlayl, beitzel, jensen } @ ir.iit.edu

More information

7. Query Processing and Optimization

7. Query Processing and Optimization 7. Query Processing and Optimization Processing a Query 103 Indexing for Performance Simple (individual) index B + -tree index Matching index scan vs nonmatching index scan Unique index one entry and one

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Data warehouse and Data Mining

Data warehouse and Data Mining Data warehouse and Data Mining Lecture No. 13 Teradata Architecture and its compoenets Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and

More information

Performance Optimization for Informatica Data Services ( Hotfix 3)

Performance Optimization for Informatica Data Services ( Hotfix 3) Performance Optimization for Informatica Data Services (9.5.0-9.6.1 Hotfix 3) 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,

More information

Content Management in Large-Scale Information Retrieval Systems

Content Management in Large-Scale Information Retrieval Systems Content Management in Large-Scale Information Retrieval Systems S. Beitzel Information Retrieval Laboratory Computer Science Department Illinois Institute of Technology Chicago, IL, U.S.A. steve@ir.iit.edu

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many

More information

Walking Four Machines by the Shore

Walking Four Machines by the Shore Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0

More information

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy CompSci 516 Database Systems Lecture 20 Parallel DBMS Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements HW3 due on Monday, Nov 20, 11:55 pm (in 2 weeks) See some

More information

Dialog (interactive) data input. Reporting. Printing processing

Dialog (interactive) data input. Reporting. Printing processing Tutorials, D. Prior Research Note 24 February 2003 Who Sets the Pace in the SAP Performance 'Olympics'? SAP and its hardware vendors use many different application performance benchmarks. But records for

More information

Parallel DBMS. Chapter 22, Part A

Parallel DBMS. Chapter 22, Part A Parallel DBMS Chapter 22, Part A Slides by Joe Hellerstein, UCB, with some material from Jim Gray, Microsoft Research. See also: http://www.research.microsoft.com/research/barc/gray/pdb95.ppt Database

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part VI Lecture 17, March 24, 2015 Mohammad Hammoud Today Last Two Sessions: DBMS Internals- Part V External Sorting How to Start a Company in Five (maybe

More information

HP ProLiant DL580 G5. HP ProLiant BL680c G5. IBM p570 POWER6. Fujitsu Siemens PRIMERGY RX600 S4. Egenera BladeFrame PB400003R.

HP ProLiant DL580 G5. HP ProLiant BL680c G5. IBM p570 POWER6. Fujitsu Siemens PRIMERGY RX600 S4. Egenera BladeFrame PB400003R. HP ProLiant DL58 G5 earns #1 overall four-processor performance; ProLiant BL68c takes #2 four-processor performance on Windows in two-tier SAP Sales and Distribution Standard Application Benchmark HP leadership

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs. Parallel Database Systems STAVROS HARIZOPOULOS stavros@cs.cmu.edu Outline Background Hardware architectures and performance metrics Parallel database techniques Gamma Bonus: NCR / Teradata Conclusions

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

QuickSpecs. ISG Navigator for Universal Data Access M ODELS OVERVIEW. Retired. ISG Navigator for Universal Data Access

QuickSpecs. ISG Navigator for Universal Data Access M ODELS OVERVIEW. Retired. ISG Navigator for Universal Data Access M ODELS ISG Navigator from ISG International Software Group is a new-generation, standards-based middleware solution designed to access data from a full range of disparate data sources and formats.. OVERVIEW

More information

HP ProLiant delivers #1 overall TPC-C price/performance result with the ML350 G6

HP ProLiant delivers #1 overall TPC-C price/performance result with the ML350 G6 HP ProLiant ML350 G6 sets new TPC-C price/performance record ProLiant ML350 continues its leadership for the small business HP Leadership with the ML350 G6» The industry s best selling x86 2-processor

More information

QLE10000 Series Adapter Provides Application Benefits Through I/O Caching

QLE10000 Series Adapter Provides Application Benefits Through I/O Caching QLE10000 Series Adapter Provides Application Benefits Through I/O Caching QLogic Caching Technology Delivers Scalable Performance to Enterprise Applications Key Findings The QLogic 10000 Series 8Gb Fibre

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

Introduction to IR Systems: Supporting Boolean Text Search

Introduction to IR Systems: Supporting Boolean Text Search Introduction to IR Systems: Supporting Boolean Text Search Ramakrishnan & Gehrke: Chapter 27, Sections 27.1 27.2 CPSC 404 Laks V.S. Lakshmanan 1 Information Retrieval A research field traditionally separate

More information

Data Warehouse Tuning. Without SQL Modification

Data Warehouse Tuning. Without SQL Modification Data Warehouse Tuning Without SQL Modification Agenda About Me Tuning Objectives Data Access Profile Data Access Analysis Performance Baseline Potential Model Changes Model Change Testing Testing Results

More information

Intel Enterprise Solutions

Intel Enterprise Solutions Intel Enterprise Solutions Catalin Morosanu Business Development Manager High Performance Computing catalin.morosanu@intel.com Intel s figures 2003/Q104 Revenue 2003: $ 31 billion first Quarter 2004: $

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa ICS 624 Spring 2011 Overview of DB & IR Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 1 Example

More information

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof. CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang Text To Knowledge IR and Boolean Search Text to Knowledge (IE)

More information

ArcInfo 9.0 System Requirements

ArcInfo 9.0 System Requirements ArcInfo 9.0 System Requirements This PDF contains system requirements information, including hardware requirements, best performance configurations, and limitations, for ArcInfo 9.0. HP HP-UX 11i (11.11)

More information

Trade-ins from qualified competitor products to Informix Dynamic Server V9

Trade-ins from qualified competitor products to Informix Dynamic Server V9 Software Announcement September 28, 2004 Trade-ins from qualified competitor products to Informix Dynamic Server V9 Overview This trade-in offering for Informix Dynamic Server (IDS) V9 gives you another

More information

BVRIT HYDERABAD College of Engineering for Women. Department of Computer Science and Engineering. Course Hand Out

BVRIT HYDERABAD College of Engineering for Women. Department of Computer Science and Engineering. Course Hand Out BVRIT HYDERABAD College of Engineering for Women Department of Computer Science and Engineering Course Hand Out Subject Name : Information Retrieval Systems Prepared by : Dr.G.Naga Satish, Associate Professor

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

IBM Systems: Helping the world use less servers

IBM Systems: Helping the world use less servers Agenda Server Consolidation Reasons Server Consolidation Methodology Power Systems Server Consolidation Server Consolidation Examples Demo of SCON Tool Mike Rede Field Technical Sales Specialist mrede@us.ibm.com

More information

Hostname System Configuration Documentation

Hostname System Configuration Documentation Hostname System Configuration Documentation Version 0.0 25-Jan-18 Delivered January 25, 2018 Version 0.0 By: Gary Neshanian (consultant) Nish Consulting 2336 Elden Ave., Suite G Costa Mesa, CA 92627 Phone

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture X: Parallel Databases Topics Motivation and Goals Architectures Data placement Query processing Load balancing

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740 Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740 A performance study with NVDIMM-N Dell EMC Engineering September 2017 A Dell EMC document category Revisions Date

More information

number of documents in global result list

number of documents in global result list Comparison of different Collection Fusion Models in Distributed Information Retrieval Alexander Steidinger Department of Computer Science Free University of Berlin Abstract Distributed information retrieval

More information

In-Memory Data Management

In-Memory Data Management In-Memory Data Management Martin Faust Research Assistant Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University of Potsdam Agenda 2 1. Changed Hardware 2.

More information

Oracle Database Competency Center

Oracle Database Competency Center Oracle Database Competency Center Suchai Yenruedee Consulting & Customer Support Director Advanced Solutions Application Hosting Services Database Competency Center Space: 167.54 sqm. Location: 7th Floor

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part VI Lecture 14, March 12, 2014 Mohammad Hammoud Today Last Session: DBMS Internals- Part V Hash-based indexes (Cont d) and External Sorting Today s Session:

More information

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted

More information

Senior Technical Manager, ATG, Oracle Corporation. Vamsi Mudumba. High Availability. High Availability

Senior Technical Manager, ATG, Oracle Corporation. Vamsi Mudumba. High Availability. High Availability High Availability High Availability Vamsi Mudumba Senior Technical Manager, ATG, Oracle Corporation Agenda HA Overview Availability Defined HA Importance Designing Solutions for HA Causes of Downtime HA

More information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

More information

Oracle9i Real Application Clusters. Principal Sales Consultant DB Tech. Team Oracle Corporation

Oracle9i Real Application Clusters. Principal Sales Consultant DB Tech. Team Oracle Corporation Oracle9i Real Application Clusters Principal Sales Consultant DB Tech. Team Oracle Corporation What is a Cluster? Group of servers acting as single system Requires hardware (interconnect) software (clusterware)

More information

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques 376a. Database Design Dept. of Computer Science Vassar College http://www.cs.vassar.edu/~cs376 Class 16 Query optimization What happens Database is given a query Query is scanned - scanner creates a list

More information

COURSE 12. Parallel DBMS

COURSE 12. Parallel DBMS COURSE 12 Parallel DBMS 1 Parallel DBMS Most DB research focused on specialized hardware CCD Memory: Non-volatile memory like, but slower than flash memory Bubble Memory: Non-volatile memory like, but

More information

ArcSDE 8.1 Questions and Answers

ArcSDE 8.1 Questions and Answers ArcSDE 8.1 Questions and Answers 1. What is ArcSDE 8.1? ESRI ArcSDE software is the GIS gateway that facilitates managing spatial data in a database management system (DBMS). ArcSDE allows you to manage

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Experience the GRID Today with Oracle9i RAC

Experience the GRID Today with Oracle9i RAC 1 Experience the GRID Today with Oracle9i RAC Shig Hiura Pre-Sales Engineer Shig_Hiura@etagon.com 2 Agenda Introduction What is the Grid The Database Grid Oracle9i RAC Technology 10g vs. 9iR2 Comparison

More information

NEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III

NEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III NEC Express5800 A2040b 22TB Data Warehouse Fast Track Reference Architecture with SW mirrored HGST FlashMAX III Based on Microsoft SQL Server 2014 Data Warehouse Fast Track (DWFT) Reference Architecture

More information

Data, Information, and Databases

Data, Information, and Databases Data, Information, and Databases BDIS 6.1 Topics Covered Information types: transactional vsanalytical Five characteristics of information quality Database versus a DBMS RDBMS: advantages and terminology

More information

Data about data is database Select correct option: True False Partially True None of the Above

Data about data is database Select correct option: True False Partially True None of the Above Within a table, each primary key value. is a minimal super key is always the first field in each table must be numeric must be unique Foreign Key is A field in a table that matches a key field in another

More information

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany. Routing and Ad-hoc Retrieval with the TREC-3 Collection in a Distributed Loosely Federated Environment Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers University of Dortmund, Germany

More information

Veritas NetBackup 6.5 Clients and Agents

Veritas NetBackup 6.5 Clients and Agents Veritas NetBackup 6.5 Clients and Agents The Veritas NetBackup Platform Next-Generation Data Protection Overview Veritas NetBackup provides a simple yet comprehensive selection of innovative clients and

More information

Itanium 2. Itanium.

Itanium 2. Itanium. Itanium 2 Itanium 2 Itanium www.intel.com/itanium2 ... 2... 2... 4... 4... 4... 4... 5... 5... 5... 6 Itanium 9MB L3 Itanium 2 1.60GHz Itanium Itanium 2 Itanium 2 Itanium 2 25% 1 5 15% IA-32 Itanium 2

More information

SAP SD Benchmark with DB2 and Red Hat Enterprise Linux 5 on IBM System x3850 M2

SAP SD Benchmark with DB2 and Red Hat Enterprise Linux 5 on IBM System x3850 M2 SAP SD Benchmark using DB2 and Red Hat Enterprise Linux 5 on IBM System x3850 M2 Version 1.0 November 2008 SAP SD Benchmark with DB2 and Red Hat Enterprise Linux 5 on IBM System x3850 M2 1801 Varsity Drive

More information

Database Group Research Overview. Immanuel Trummer

Database Group Research Overview. Immanuel Trummer Database Group Research Overview Immanuel Trummer Talk Overview User Query Data Analysis Result Processing Talk Overview Fact Checking Query User Data Vocalization Data Analysis Result Processing Query

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Prof. Chris Clifton 27 August 2018 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 AD-hoc IR: Basic Process Information

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Software withdrawal: IBM VisualAge Pacbase V3.0 features Replacement available

Software withdrawal: IBM VisualAge Pacbase V3.0 features Replacement available Withdrawal Announcement October 14, 2003 Software withdrawal: IBM VisualAge Pacbase V3.0 features Replacement available Overview Effective January 9, 2004, IBM will withdraw from marketing VisualAge Pacbase

More information

BusinessObjects Enterprise XI 3.0 for Linux

BusinessObjects Enterprise XI 3.0 for Linux Revision Date: February 22, 2010 BusinessObjects Enterprise XI 3.0 for Linux Overview Contents This document lists specific platforms and configurations for the BusinessObjects Enterprise XI 3.0 for Linux.

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe CHAPTER 19 Query Optimization Introduction Query optimization Conducted by a query optimizer in a DBMS Goal: select best available strategy for executing query Based on information available Most RDBMSs

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

VERITAS Storage Foundation 4.0 TM for Databases

VERITAS Storage Foundation 4.0 TM for Databases VERITAS Storage Foundation 4.0 TM for Databases Powerful Manageability, High Availability and Superior Performance for Oracle, DB2 and Sybase Databases Enterprises today are experiencing tremendous growth

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU !2 MapReduce Overview! Sometimes a single computer cannot process data or takes too long traditional serial programming is not always

More information

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012 Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 6 April 12, 2012 1 Acknowledgements: The slides are provided by Nikolaus Augsten

More information

New Magic Quadrant Definitions

New Magic Quadrant Definitions Markets, A. Butler Research Note 11 September 2003 Magic Quadrant for Enterprise Servers, 2003 This new Magic Quadrant addresses the changes in server workloads in large organizations. It covers rack-optimized

More information