On Scalable Information Retrieval Systems

Size: px

Start display at page:

Download "On Scalable Information Retrieval Systems"

Clement Reynolds
5 years ago
Views:

1 On Scalable Information Retrieval Systems Ophir Frieder 1

2 Scalable Search Structured Semi-structured Text, video, etc. Answer Engine 2

3 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 3

4 Duplicate Document Detection Union of data obtained from multiple sources often contains duplicates Duplicates affect both retrieval effectiveness and retrieval efficiency Duplicate detection is either syntactic or semantic, where semantic is far more challenging. 4

5 What is a Duplicate Document? Semantic Similarity If a document contains roughly the same semantic content it is a duplicate whether or not it is a precise syntactic match. 5

6 Duplicate Detection Techniques Main duplicate detection approaches: Hash based approaches (syntactic) Information retrieval techniques Resemblance r ( A, B ) = S S ( ( A A ) ) S S ( ( B B ) ) 6

7 Duplicate Detection with IR Using documents as queries, rank all documents in the collection with similar terms Documents with equivalent weights are duplicates For each query term, the corresponding posting list entries must be retrieved for large collections, I/O costs are prohibitive 7

8 Duplicate Detection with Resemblance Calculate the resemblance of each document to every other document with matching features Divide the document into shingles (X terms) used to create a unique hash Calculate the resemblance based on hashes rather than terms N 2 comparison approaches not feasible for large collections Optimizations, filter which shingles to use E.g., every 25 th shingle or a combination of multiple shingles 8

9 Issues with Prior Approaches Hash techniques not resilient to small changes in document representation. IR techniques - slow for large collections. Resemblance documents are clustered into multiple clusters due to partitioning duplicate classification is difficult. 9

10 Combined (I-Match) Algorithm Tokenize document Create list of unique tokens Filter tokens - What to filter? Create a unique hash of remaining tokens Search collection for duplicate hashes 10

11 Filtration Based On Collection Statistics Hi & Low 25% Low 25% High 25% Mid 50% N 1. Sort according to idf = log n N = Number _ Of _ Documents _ In _ Collection n = Number _ Of _ Documents _ Term _ Occurs _ In 2. Filter unwanted components 11

12 LA Times Collection Create random duplicates to test effectiveness. For every i th word, pick a random number from one to ten. If the number is higher than the random threshold (call it alpha) then pick a number from 1 to 3. If the random number chosen is a one then remove the word. If the number is a two then flip it with a word at position i+1. If it is a three, add a word (randomly pick one from the term list). Insert duplicate into the collection. 12

13 Document Clusters Formed Document Resemblance Resemblance-Opt Combined LA LA LA LA LA LA LA LA LA LA Average I-Match did not produce any false positives while Resemblance did. 13

14 Processing Time 2GB Algorithm MEAN Time Std Deviation Median Time Resemblance Resemblance - Opt I-Match Syntactic 65 N/A N/A 14

15 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 15

16 SIRE Goals Integrate structured and semi-structured data using a framework that also integrates unstructured data. Improve accuracy of retrieved results Support scalability: data volume retrieval speeds Support legacy data 16

17 Portability The information retrieval prototype was implemented on the following relational platforms: NCR Teradata DBC-machines Microsoft SQL Server Sybase Oracle IBM DB2 and SQL/DS 17

18 Relational Inverted Index All inverted index entries <term> <list of documents> e.g., vehicle D1, D3, D4 results in: term vehicle vehicle vehicle docid D1 D3 D4 18

19 Text Retrieval Conference (TREC) Sample Document <DOC> <DOCNO> AP </DOCNO> <FILEID>AP-NR EST</FILEID> <FIRST>u i BC-Japan-Stocks </FIRST> <SECOND>BC-Japan-Stocks,0026</SECOND> <HEAD>Stocks Up In Tokyo</HEAD> <DATELINE>TOKYO (AP) </DATELINE> <TEXT> The Nikkei Stock Average closed at 29, points up points on the Tokyo Stock Exchange Wednesday. </TEXT> </DOC> 19

20 Relational Document Representation (Term Processing) DOCUMENT docid docname headline dateline 28 AP Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points 28 1 up 28 1 tokyo 28 1 exchange 28 1 wednesday TERM term df idf average closed exchange nikkei points stock tokyo up wednesday

21 Simplistic Models: Keyword and Boolean Searches 21

22 Relational Approach: Keyword Search Techniques Keyword search select i.docid from INDEX i, QUERY q where i.term = q.term Keyword search with stop word list select i.docid from INDEX i, QUERY q, STOPLIST s where (i.term = q.term) and (i.term <> s.term) 22

23 Relational Approach: Boolean Search Techniques OR query select docid select docid from INDEX from INDEX where term = term1 where term = term1 OR union term = term2 OR select docid term = term3 OR from INDEX... where term = term2 term = termn union select docid from INDEX where term = term3... union select docid from INDEX where term = termn 23

24 Relational Approach: Boolean Search Techniques AND query select docid select docid from INDEX from INDEX a, INDEX b, INDEX c,... INDEX N where term = term1 where a.term = term1 AND intersect b.term = term2 AND select docid c.term = term3 AND from INDEX... where term = term2 n.term = termn AND intersect a.docid= b.docid AND select docid b.docid = c.docid AND from INDEX... where term = term3 N-1.docID= N.docID... intersect select docid from INDEX where term = termn 24

25 Fixed Join-Count AND Queries Find all documents that contain all of the terms found in the QUERY relation: select i.docid from INDEX i, QUERY q where i.term = q.term group by i.docid having count (distinct (i.term)) = select count(*) from QUERY 25

26 TAND Queries Find all documents that contain at least X of the terms found in the QUERY relation: select i.docid from INDEX i, QUERY q where i.term = q.term group by i.docid having count (distinct (i.term)) >= X 26

27 Relevance Ranking: Vector Space & Probabilistic Models 27

28 Vector Space Model Term Frequency (tf ik ): number of occurrences of term t k in document i Document Frequency (df j ): number of documents which contain t j Inverse Document Frequency (idf j ): log(d/df j ) where d is the total number of documents Notes: idf is a measure of uniqueness of a term across the collection tf is the frequency of a term in a given document 28

29 Vector Space Model: Sample Relational Query List all documents in the order of their similarity coefficient where the coefficient is computed using the dot product. SELECT FROM WHERE d.docid, d.docname, SUM(i.termcnt * t.idf * q.termcnt * t.idf) DOCUMENT d, QUERY q, INDEX i, TERM t q.term = i.term AND q.term = t.term AND d.docid = i.docid GROUP BY d.docid, d.docname ORDER BY 3 DESC 29

30 Similarity Coefficients Several similarity coefficients based on the query vector X and the document vector Y are defined: Inner Prod uct x y Cosine Coefficient t i= 1 i i t xiyi i= 1 t t 2 xi 2 yi i= 1 i= 1 30

31 SQL for Probabilistic Similarity Measure num _ terms log ( numdocs dfi ) +.5 tfid ( df ) (.75 doclength avgdoclength) i= 1 i / + tf id qtf SELECT d.docid, d.docname, SUM( LOG(((NumDocs - t.df) + 0.5) / (t.df + 0.5)) * ((2.2*i.tf) / (.3 + ((.75 * d.doclen)/avgdoclen) + i.tf)) * q.termcnt ) FROM INDEX i, TERM t, DOCUMENT d, QUERY q WHERE i.term = t.term AND i.docid = d.docid AND t.term = q.term GROUP BY d.docid, d.docname ORDER BY 3; 31

32 Relational Document Representation (Term Processing) DOCUMENT docid docname headline dateline 28 AP Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points 28 1 up 28 1 tokyo 28 1 exchange 28 1 wednesday TERM term df idf average closed exchange nikkei points stock tokyo up wednesday

33 Relational Query Representation (Term Processing) QUERY term termcnt nikkei 1 stock 2 exchange 2 american 1 ORIGINAL QUERY: nikkei stock exchange american stock exchange SQL: (Query Weight * Document Weight) SELECT d.docid, d.docname, SUM(a.termcnt * c.idf * b.termcnt * c.idf) FROM QUERY a, INDEX b, TERM c, DOCUMENT d WHERE a.term = b.term AND a.term = c.term AND b.docid = d.docid GROUP BY d.docid, docname ORDER BY 3 DESC 33

34 Sample Term Query Result (Inner/Dot Product) Term Q-Termcnt Q-Weight D-Termcnt D-Weight Q-Wt * D-Wt nikkei stock exchange american Similarity Coefficient

35 Simple Phrase Parsing Simple phrase parser with the following rules Phrases do not include stop terms Phrases do not span across punctuation Example: The Nikkei Stock Average closed at 29, points up points, on the Tokyo Stock Exchange Wednesday. Phrases: nikkei stock stock average average closed points up tokyo stock stock exchange exchange wednesday 35

36 Relational Document Representation (Phrase Processing) DOCUMENT docid docname headline dateline 28 AP Stocks Up In Tokyo TOKYO (AP) INDEX docid termcnt phrase 28 1 nikkei stock 28 1 stock average 28 1 average closed 28 1 points up 28 1 tokyo stock 28 1 stock exchange 28 1 exchange wednesday PHRASE phrase df idf average closed exchange Wednesday nikkei stock points up stock average stock exchange tokyo stock

37 Enhancing Accuracy With Relevance Feedback 37

38 Relevance Feedback The modification of the search process so as to improve accuracy by incorporating information obtained from prior relevance judgments. Q 0 top relevant documents new terms Q 0 Q 1 matching documents database search database search 38

39 Relevance Feedback Example Q tunnel under English Channel 1 Document Collection Top Ranked Document: The tunnel under the English channel is often called a Chunnel Q1 tunnel under English Channel Chunnel Documents Retrieved Relevant Retrieved b 2 Not Relevant Documents Retrieved Relevant Retrieved b 39

40 Feedback Mechanisms Manual - relevant documents are identified manually and new terms are selected either manually or automatically. Automatic - relevant documents are identified automatically by assuming the top-ranked documents, are relevant and new terms are selected automatically. 40

41 Relevance Feedback Parameters Various techniques can be used to improve the relevance feedback process. Number of Top-Ranked Documents Number of Feedback Terms Feedback Term Selection Techniques Term Weighting Document Clustering Relevance Feedback Thresholding Term Frequency Cutoff Points Query Expansion Using a Thesaurus 41

42 Relevance Feedback Evaluation Improvement from relevance feedback, nidf weights at 0.00 at 0.20 at 0.40 at 0.60 at 0.80 at 1.00 nidf, no feedback nidf, feedback 10 terms Recall 42

43 Comparative TREC-8 Results Run IIT Avg. Precision # Above Median # At Median # Below Median # Best # Worst iit00t iit00td iit00tde iit00m

44 Performance Optimizations: Query Thresholds & Clustered Indexes 44

45 Query Thresholds Consider a query with terms t 1, t 2, t 3,..., t n. Sort the terms by their frequency across the collection (least frequent terms appear first). Define a threshold as the percentage of terms taken in the original query in a newly created reduced query. Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term 7 Term 8 Term 9 Term 10 Threshold = 20 Threshold = 50 Threshold = 80 45

46 Relevant Retrieved as a Function of Query Thresholds 2500 Relevant Retrieved Query Threshold (Percent) 46

47 Run Time as a Function of Query Thresholds CPU ,989 13, ,238 1,736 5, Query Threshold (Percent) 47

48 Relevant New Documents Per CPU Cycle Threshold Relevant Retrieved CPU Cycles New Relevant Docs per Cycle

49 Caveat: Logical Design versus Physical Implementation While the design shown represents the replication of the document identifier, in the physical implementation, clustered tables are actually used. That is, attribute values that are logically repeated many times are physically clustered by the attribute value to eliminate the replication storing only one copy for each unique attribute value. (Note clustered tables in Oracle implementations) The I/O to retrieve a posting list is achieved via a grouped block read as opposed to retrieval across distributed storage. 49

50 Clustered Indexes: Posting List Processing TRADITIONAL term docid tf stock 1 1 stock 28 2 stock stock stock stock stock stock stock CLUSTERED term docid tf stock { ( 1 1 ), ( 28 2 ), ( ), ( ), ( ), ( ), ( ), ( ), ( ) } 50

51 Clustered Indexes: Relevance Feedback Processing TRADITIONAL docid termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points 28 1 up 28 1 tokyo 28 1 exchange 28 1 wednesday CLUSTERED docid termcnt term 28 { ( 1 nikkei ), ( 2 stock ), ( 1 average ), ( 1 closed ), ( 2 points ), ( 1 up ), ( 1 tokyo ), ( 1 exchange ), ( 1 wednesday ) } 51

52 Technology Transfer Concern: An Academic Solution Without a Public Commercial World Problem Resolution: National Institutes of Health s National Center for Complementary and Alternative Medicine Citation Index 52

53 NIH-NCCAM Application Short query lengths necessitate query expansion Advanced Search techniques are needed since used by roughly 30% of the users Efficient processing critical filtration an option. Scalability not currently a concern but future needs may dictate such. 53

54 NCCAM - Search Interface 54

55 NCCAM Results Page Interface 55

56 System Architecture Internet Servlet Engine User Oracle DBMS HTTP Server Sun ES

57 Servlet Architecture HTTP Search Request Generic Search Servlet Dynamic HTML MRU Cache Query Engine RDBMS Connection Pool RDBMS 57

58 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 58

59 Scalability via Parallelism: Not My Problem: It s the Database Vendors Problem 59

60 Parallel Information Retrieval Most parallel information retrieval systems require custom hardware and software which reduces portability across systems. Parallelism in a relational information retrieval system is a function of the database management system and does not require custom hardware and software. 60

61 TPC- C Benchmarks TPC-C BENCHMARK RESULTS These results are valid as of date 7/5/2000 9:46:09 AM TPC-C Results - Revision 3.X Company System Spec. RevitpmC $/tpmc Total Sys. Currency Database Operating TP MonitoServer CPU# Server CPCluster # Front En Date SubmAvailability ALR Revolution US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 6 N 6 11/6/97 12/31/97 ALR ALR Revol US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 6 N 5 7/9/97 11/30/97 ALR Revolution US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 4 4/4/97 4/30/97 Acer AcerAltos US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/17/99 5/17/99 Acer AcerAltos US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 5 2/16/98 2/16/98 Amdahl EnVista Fr US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 6 3/28/97 3/31/97 Bull Escala EP US $ Oracle 8i EIBM AIX 4.3WebshpereIBM RS64-8 N 6 6/20/00 9/30/00 Bull Escala T US $ Oracle 8i VIBM AIX 4.3WebshpereIBM RS64-6 N 5 6/20/00 6/20/00 Bull EPC 440 c Euros Oracle8i 8IBM AIX 4.3IBM TXSer IBM RS64-4 N 8 12/2/99 12/2/99 Bull Escala EP US $ Oracle 8i VIBM AIX 4.3IBM TXSer IBM RS64 24 N 15 11/5/99 3/1/00 Bull Express US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 8 N 8 3/26/98 6/15/98 Bull ESCALA P US $ Oracle7 7.IBM AIX 4.1Tuxedo ETMotorola P 32 Y 16 1/16/97 6/30/97 Bull ESCALA D US $ Sybase SQIBM AIX 4.2Tuxedo 4.2Motorola P 8 N 6 11/15/96 1/31/97 Bull ESCALA D US $ Oracle7 7.IBM AIX 4.1Tuxedo 4.2Motorola P 8 N 7 6/3/96 11/30/96 Bull ESCALA D US $ Oracle7 7.IBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 2/1/96 7/30/96 Bull ESCALA D US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 4 N 3 5/9/95 5/1/95 Bull ESCALA D US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 5/9/95 6/1/95 Bull ESCALA R US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 4 N 3 5/9/95 5/1/95 Bull ESCALA R US $ Informix OIBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 5/9/95 6/1/95 Compaq ProLiant DL US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 6/23/00 8/1/00 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 6 N 4 4/7/00 4/7/00 Compaq ML US $ Microsoft Microsoft WCompaq DIntel Pentiu 2 N 2 3/8/00 8/1/00 Compaq ProLiant PD US $ Oracle 8i VMicrosoft WCompaq DIntel Pentiu 48 Y 6 2/11/00 3/31/00 Compaq AlphaServe US $ Sybase AdCompaq TrApplication Alphachip 4 N 4 2/9/00 3/17/00 Compaq PDC/O US $ Oracle 8i VMicrosoft WCompaq DIntel Pentiu 48 Y 12 12/23/99 3/31/00 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 12/20/99 12/31/99 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft C Intel Pentiu 1 N 1 10/12/99 12/31/99 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft C Intel Pentiu 2 N 3 9/29/99 12/31/99 Compaq ProLiant US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 9/20/99 12/31/99 61

62 More Vendors TPC- C Benchmarks DG AViiON AV US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 3/4/99 3/25/99 DG AViiON AV US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 1/22/99 3/25/99 DG AViiON AV US $ Microsoft Microsoft WTopEnd v2 Intel Pentiu 8 N 5 3/2/98 5/31/98 DG AViiON US $ Microsoft Microsoft WTopEnd v2 Intel Pentiu 6 N 9 11/21/97 2/28/98 Dell PowerEdge US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 5/26/00 8/1/00 Dell PowerEdge US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 5/26/00 8/1/00 Dell PowerEdge US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 11/22/99 12/31/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/28/99 9/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/28/99 9/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 3/28/99 6/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 3/28/99 6/1/99 Dell PowerEdge US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 2 1/5/98 1/5/98 Dell PowerEdge US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 4 3/12/97 4/11/97 Fujitsu Si Primergy N Euros Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 8 7/5/00 10/1/00 Fujitsu Si Primergy K Euros Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 12/13/99 1/1/00 Fujitsu Si Primergy US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 3/17/99 4/26/99 Fujitsu Si Primergy US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 12/23/98 2/28/99 Fujitsu Si Primergy US $ Microsoft Microsoft WOpen UTMIntel Pentiu 4 N 7 9/28/98 12/29/98 Fujitsu Si RM600 mo US $ Informix OSNI ReliantOpenUTM MIPS R N 14 3/14/98 3/1/98 Fujitsu Si Primergy US $ Microsoft Microsoft WSNI openuintel Pentiu 4 N 6 12/1/97 1/1/98 Fujitsu Si RM600 Mo US $ Informix OPyramid ReOpenUTM MIPS R100 8 N 7 7/11/97 7/31/97 Fujitsu Si Primergy US $ Microsoft Microsoft W-none- Intel Pentiu 4 N 5 2/14/97 3/31/97 Fujitsu Si RM400-C US $ Oracle7 7.SNI ReliantOpenUTM MIPS R100 1 N 2 2/14/97 3/31/97 Fujitsu Si RM600 Mo US $ Informix OSNI Relian OpenUTM MIPS R N 7 12/20/96 6/30/97 Fujitsu Si RM600 Mo US $ Informix OSNI Relian Open UTMMIPS R N 6 9/9/96 9/9/96 Fujitsu Si RM 600 Mo US $ Informix OSNI SINIX-Tuxedo 4.2MIPS R N 5 1/5/96 5/1/96 Fujitsu Si RM 600 Mo US $ Informix OSNI SINIX-Tuxedo MIPS R440 8 N 3 11/13/95 5/1/96 Fujitsu Si RM 400 Mo US $ Informix OSNI SINIX- Tuxedo 4.2MIPS R440 1 N 1 7/20/95 11/1/95 Fujitsu/ICLGRANPOW US $ Fujitsu/ICLMicrosoft WBEA TuxedIntel Pentiu 4 N 6 7/19/99 1/16/00 62

63 More Vendors TPC- C Benchmarks HP HP NetServ US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 7/8/97 11/30/97 HP NetServer L US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 4 4/3/97 4/3/97 HP NetServer L US $ Microsoft Microsoft WTuxedo 4.1Intel Pentiu 2 N 2 12/16/96 2/28/97 IBM Netfinity US $ IBM DB2 UMicrosoft WMicrosoft CIntel Pentiu 4 Y 96 7/3/00 12/7/00 IBM RS/6000 E US $ Oracle 8i EIBM AIX 4.3WebshpereIBM RS64-8 N 6 5/31/00 9/30/00 IBM RS/6000 E US $ Oracle8i EIBM AIX 4.3WebshpereIBM RS64-6 N 5 5/9/00 6/9/00 IBM Netfinity US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 6 2/25/00 2/25/00 IBM RISC Syste US $ Oracle 8i VIBM AIX 4.3IBM TXSer IBM RS64 24 N 15 10/29/99 3/1/00 IBM RS 6000 E US $ Oracle OraIBM AIX 4.3IBM TXSer IBM RS64 60 Y 60 6/30/99 6/30/99 IBM AS/400e M US $ IBM DB2 foibm OS/40BEA TuxedIBM Power 12 N 6 6/7/99 6/1/99 IBM RISC Syste US $ Oracle v8ibm AIX 4.3IBM TXSer IBM RS64-4 N 8 5/28/99 11/19/99 IBM AS/400e Se US $ IBM DB2 foibm OS/40CICS for A IBM Power 12 N 97 9/1/98 9/11/98 IBM RS/6000 E US $ Oracle OraIBM AIX 4.3IBM TXSer IBM Power 12 N 15 8/11/98 1/21/99 IBM RS/6000 E US $ Oracle OraIBM AIX 4.3IBM TXSer IBM Power 12 N 8 3/3/98 9/2/98 IBM RISC Syste US $ Sybase AdIBM AIX 4.2BEA TuxedIBM Power 4 N 5 2/12/98 2/12/98 IBM AS/400e S US $ IBM DB2 foibm OS/40CICS for A IBM AS A3 12 N 64 8/18/97 8/29/97 IBM RS/6000 W US $ Sybase AdIBM AIX 4.2BEA TuxedIBM Power 4 N 3 5/12/97 9/30/97 IBM RS/6000 E US $ Sybase SQIBM AIX 4.2BEA TuxedIBM Power 8 N 3 5/6/97 9/30/97 IBM RS/6000 E US $ Sybase SQIBM AIX 4.2BEA TuxedIBM Power 8 N 3 5/6/97 9/30/97 IBM RS/6000 W US $ Sybase SQIBM AIX 4.2BEA TuxedIBM Power 4 N 3 4/7/97 4/25/97 IBM RISC Syste US $ Oracle7 7.IBM AIX 4.1Tuxedo ETIBM Power 32 Y 16 12/10/96 6/30/97 IBM RISC Syste US $ Oracle7 7.IBM AIX 4.1Tuxedo ETIBM Power 32 Y 16 12/10/96 6/30/97 IBM RS6000 Po US $ Sybase SQIBM AIX 4.1BEA TuxedMotorola P 8 N 6 7/23/96 12/15/96 IBM RS6000 Po US $ Sybase SQIBM AIX 4.1BEA TuxedMotorola P 8 N 6 7/23/96 12/15/96 IntergraphInterServe US $ Microsoft Microsoft W-none- Intel Pentiu 2 N 4 7/30/97 7/1/97 IntergraphInterServe US $ Microsoft Microsoft W-none- Intel Pentiu 2 N 2 3/5/97 3/31/97 IntergraphInterServe US $ Microsoft Microsoft W-none- Intel Pentiu 1 N 1 3/5/97 3/31/97 Itautec InfoSERVE Brazil $ Microsoft Microsoft W-none- Intel Pentiu 4 N 6 6/30/97 3/1/97 Motorola Motorola S US $ Oracle7 7.IBM AIX 4.1Tuxedo 4.2Motorola P 8 N 5 4/24/96 7/30/96 NEC Express US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 6/23/00 9/29/00 NEC Express US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 4 6/19/00 9/29/00 NEC Express US $ Oracle8i EMicrosoft WBEA TuxedIntel Pentiu 32 Y 8 6/29/99 11/30/99 NEC Express US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 5 8/5/98 12/29/98 63

64 More Vendors TPC- C Benchmarks NEC UP 4800/ E+08 Yen Informix ONEC UP-UTuxedo (R4MIPS R440 2 N 3 1/27/95 4/21/95 NEC UP 4800/ E+08 Yen Informix ONEC UP-UTuxedo (R4MIPS R440 6 N 7 1/27/95 6/21/95 SGI Origin US $ Sybase AdSGI IRIX 6.BEA TuxedMIPS R100 2 N 2 4/23/98 7/31/98 SGI Origin US $ Informix OSGI IRIX 6.BEA TuxedMIPS R N 26 4/30/97 10/29/97 Sequent NUMACen US $ Oracle OraDYNIX/ptx BEA TuxedIntel Xeon 64 N 8 12/18/98 6/15/99 Sequent NUMACen US $ Oracle OraSequent DYBEA TuxedIntel Xeon 32 N 4 10/13/98 3/15/99 Sun Enterprise US $ Fujitsu/ICLSun SolariBEA Tuxed Ultra SPA 4 N 7 2/2/00 7/28/00 Sun Enterprise US $ Sybase AdSun SolariBEA Tuxed Ultra SPA 14 N 15 11/23/99 3/30/00 Sun Enterprise US $ Oracle8i ESun SolarisBEA Tuxed Ultra SPA 96 Y 40 9/24/99 1/31/00 Sun Starfire Ent US $ Oracle 8i vsun SolariBEA Tuxed Ultra SPA 64 N 32 3/24/99 8/22/99 Sybase Digital Alph US $ Sybase SQDigital UNIXITI Tuxedo Digital DEC 10 N 10 12/21/95 3/1/96 Tandem Integrity NR US $ Informix OSGI IRIX 6.IMC TuxedMIPS R N 7 11/10/95 2/28/96 Unisys e-@ction E US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 4 2/16/00 7/13/00 Unisys Unisys e-@ US $ Oracle8i EUnixWare 7Tuxedo 6.4Intel Pentiu 8 N 5 12/13/99 6/1/00 Unisys Aquanta ES US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 4 N 3 10/27/99 12/31/99 Unisys Aquanta ES US $ Microsoft Microsoft WMicrosoft CIntel Pentiu 8 N 5 10/12/99 12/31/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 2 N 1 9/7/99 9/30/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 8 N 4 6/22/99 9/30/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/11/99 5/11/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 5/7/99 5/7/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 4/1/99 3/31/99 Unisys Aquanta ES US $ Microsoft Microsoft WBEA TuxedIntel Pentiu 4 N 3 3/17/99 3/17/99 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Xeon 4 4 N 3 1/5/99 1/5/99 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 3 12/4/98 12/29/98 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N 3 11/11/98 12/29/98 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N NEC UP 4800/6 3 Unisys Aquanta Q US $ Microsoft Microsoft WTuxedo 6.3Intel Pentiu 4 N NEC UP 4800/6 3 64

65 Experimental Platform: NCR Teradata DBC/1012 Work Station Client Computer (client resident software) Channels Channel Work Station LAN COP IFP IFP PE YNET IFP - Interface Processor COP - Communications Processor Ynet - Interprocessor Bus AMP - Access Module Processor AP - Application Processor PE - Parsing Engine AMP AMP AMP AP disks disks disks UNIX disk LAN 65

66 Sample Relevance Ranking Query SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)* ( (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf)))) FROM trec6$d5$idx a, trec6$d5$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term AND c.qryid = 301 GROUP BY c.qryid, b.docid UNION SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)* ( (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf)))) FROM trec6$d4$idx a, trec6$d4$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term AND c.qryid = 301 GROUP BY c.qryid, b.docid ORDER BY 3 DESC; 66

67 Breakdown of DBC/1012 Processing Steps Step 1 - A read lock is placed on tables trec6$d5$idx, trec6$d5$docavgtf, trec6$d4$idx, and trec6$d4$docavgtf. Step 2 - A single processor is used to select and join rows via a merge join from trec6$q6$qrynidf and trec6$q6$qryavgtf where the value of qryid = 301. The results are stored on spool 3. Step 3 - An all processor join is used to select and join rows via a row hash match scan from trec6$d5$idx and spool 3 where the values of the term attribute match. The results are sorted and stored on spool 4. Step 4 - An all processor join is used to select and join rows via a row hash match scan from trec6$d5$docavgtf and spool 4 where the values of the docid attribute match. The results are stored on spool 2 which is built locally on each processor. Step 5 - The SUM value for the aggregate function is calculated from the data on spool 2 and the results are stored on spool 5. (The next two steps, 6a and 6b, are executed in parallel) Step 6a - The data from spool 5 is retrieved and distributed via a hash code to spool 1 which encompasses all processors. Step 6b - An all processor join is used to select and join rows via a row hash match scan from trec6$d4$idx and spool 3 where the values of the term attribute match. The results are sorted and stored on spool 9. Step 7 - An all processor join is used to select and join rows via a row hash match scan from trec6$d4$docavgtf and spool 9 where the values of the docid attribute match. The results are stored on spool 7 which is built locally on each processor. Step 8 - The SUM value for the aggregate function is calculated from the data on spool 7 and the results are stored on spool 10. Step 9 - The data from spool 10 is retrieved and distributed via a hash code to spool 1 which encompasses all processors. A sort is then done to remove duplicates from data on spool 1. Step 10 - An END TRANSACTION step is sent to all processors involved and the contents of spool 1 are sent back to the user. 67

68 Parallel Performance Parallel Efficiency measures how efficiently the processors are distributing the workload. max_cpu = maximum CPU time Parallel Efficiency (PE) = avg_ cpu max_ cpu across all processors avg_cpu = average CPU time across all processors Average CPU Maximum Data Storage Parallel time per CPU time imbalance Efficiency processor per processor across processors 4 processors, disk 2 only % 84.9% primary index on term 4 processors, disk 2 and % 86.9% primary index on term 24 processors, disk 2 only % 40.6% primary index on term 24 processors, disk 2 and % 44.2% primary index on term 68

69 Hashing Algorithm Term 4 Processors Hashing Algorithm A,E,I,M,Q,U,Y Proc #1 B,F,J,N,R,V,Z Proc #2 C,G,K,O,S,W Proc #3 D,H,L,P,T,X Proc #4 Term 24 Processors Hashing Algorithm A Proc #1 B Proc #2 C Proc # V Proc #22 WX Proc #23 YZ Proc #24 69

70 Term Distribution Distribution of terms based on starting letter Number of Terms a c e g i k m o q s u w y Starting Letter 70

71 Parallel Performance Parallel Efficiency before and after balancing data storage across 24 processors Average CPU Maximum Data Storage Parallel time per CPU time imbalance Efficiency processor per processor across processors 24 processors, disk 2 only % 40.6% primary index on term 24 processors, disk 2 and % 44.2% primary index on term 24 processors, disk 2 only % 91.6% primary index on docid and term 24 processors, disk 2 and % 93.8% primary index on docid and term 71

72 Scalable Information Systems: Characteristics Ingest data from multiple sources Duplicate document detection Process multiple type data sources Structured & unstructured data integration (SIRE) Use scalable (parallel) technology systems Parallel SIRE Integrate retrieved data to yield answers IIT Mediator 72

73 Current Enterprise Portals 73

74 Next Generation Search Engines 74

75 75 75

76 76 76

77 77 77

78 78 78

79 IIT Production Mediator Logs 271 Queries 142 With User Feedback Satisfied 13% Ok 13% Dissatisfied 13% Happy 38% Unhappy 23% 79

80 Technology Transfer Industrial America Online BIT Systems Harris Corporation IITRI NCR Unnamed Others (Proprietary) Assorted dot-dead companies (hopefully not due to our technology!!!) Government National Institutes of Health Additional Others 80

81 Information Retrieval Laboratory Faculty Members: O. Frieder D. Grossman N. Goharian X. Li P. Wan Senior Affiliates A. Chowdhury AOL D. Holmes NCR M. C. McCabe - US Gov. Students: Steven Beitzel Rebecca Cathey Ankit Jain Eric Jensen Vincent Nguyen Angelo Pilotto Michael Saelee Chih-Wei Yi Wang Yu 81

82 References D. A. Grossman, O. Frieder, D. O. Holmes, and D. C. Roberts, Integrating Structured Data and Text: A Relational Approach, Journal of the American Society of Information Science, February C. Lundquist, O. Frieder, D. Holmes, D. Grossman, A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrieval, Journal of the American Society of Information Science, April O. Frieder, D. Grossman, A. Chowdhury, and G. Frieder, "Efficiency Considerations in Very Large Information Retrieval Servers," Journal of Digital Information, (British Computer Society), 1(5), April Invited Paper. A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe, Analyses of Multiple-Evidence Combinations for Retrieval Strategies, ACM Twentieth SIGIR, New Orleans, Louisiana, September D. Grossman, S. Beitzel, E. Jensen, and O. Frieder, IIT Intranet Mediator: Bringing Data Together on a Corporate Intranet, IEEE IT PRO, January/February A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe, Collection Statistics for Fast Duplicate Document Detection, ACM Transactions on Information Systems (TOIS), April

Relational Approach. Problem Definition

Relational Approach. Problem Definition Relational Approach (COSC 416) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman, Frieder Grossman, Frieder 2002, 2010 1 Problem