2 cessor is needed to convert incoming (dynamic) queries into a format compatible with the representation model. Finally, a relevance measure is used

Size: px
Start display at page:

Download "2 cessor is needed to convert incoming (dynamic) queries into a format compatible with the representation model. Finally, a relevance measure is used"

Transcription

1 PROBLEM 4: TERM WEIGHTING SCHEMES IN INFORMATION RETRIEVAL MARY PAT CAMPBELL y, GRACE E. CHO z, SUSAN NELSON x, CHRIS ORUM {, JANELLE V. REYNOLDS-FLEMING k, AND ILYA ZAVORINE Problem Presenter: Laura Mather National Security Agency Abstract. Information retrieval is the process of evaluating a user's query, or information need, against a set of documents (books, journal articles, web pages, etc.) to determine which of the documents satises the query. With the advent of the World Wide Web, there is suddenly a need to query enormous sets of documents both eciently and accurately. In the vector space model of information retrieval, documents are represented by sparse vectors each component of which corresponds to a term, usually a word, in the documents set. In the simplest case, the components of these vectors are the raw frequency counts of each term in each document. More sophisticated term weighting schemes are used to improve information retrieval accuracy. We study a specic term weighting scheme (log-entropy weighting) to determine its eectiveness on dierent aspects of retrieval. New approaches to term weighting are also examined. In addition, we describe our workshop experience and some of our technical work.. Introduction and Motivation. Browsing through the pages on the World Wide Web has become one of the fastest growing hobbies in the world. The World Wide Web is also a great source of information when conducting personal research. It is estimated that there are currently 50 million web pages on the internet. To that end, search engines for the World Wide Web and other information sources have become a tool to help the querent in the search for information. Search engines are based on certain information retrieval models. Some examples are the boolean model, the probabilistic model, and the vector space model. There are advantages and disadvantages to working with any of these models. The main purpose of them, however, is to retrieve relevant documents specic to a search. We work with the vector space model. This model uses a storage matrix whose columns represent the documents in a collection and whose rows represent the term frequencies among the documents. We are interested in ad-hoc quering only, i.e. when dynamic queries are compared against a static document database in order to nd documents closest to the query. When conducting a query, one method is to search through the storage matrix and match the query terms with row terms producing the document closest to the query. Many times, ad-hoc querying requires more sophisticated measures of searching such as term weighting schemes which attempt to prioritize documents according to relevant terms. We concentrate on a term weighting scheme called log-entropy. Our goal is to answer the question \Is log-entropy weighting optimal for querying?" As part of the answer to this question, we need to develop other term weighting schemes to compare them with the log-entropy weighting. The group as a whole created the term weighting schemes used in this paper. When certain aspects of the problem were handled individually, those involved are credited individually. 2. Denitions and Mathematical Framework. A document database is a collection of documents which are relevant to certain subjects. Documents consist of words, or terms, which to a greater or lesser extent identify these subjects. For instance, the word \and" may appear frequently in all documents but it says almost nothing about the contents of these documents. Users are interested in certain subjects and would like to search and nd those, and only those, documents relevant to these subjects. In order to nd these documents, the user generates concise queries which consist of words relevant to these subjects. Search engines then use these queries to nd pertinent documents. For the purposes of this paper, we dene a search engine by three components. First, the static database of documents needs to be characterized using a representation model. Secondly, a query pro- This work was supported in part by the NSA and NSF y New York University z North Carolina State University x Georgia Institute of Technology { University of Utah k Texas A & M University University of Maryland at College Park

2 2 cessor is needed to convert incoming (dynamic) queries into a format compatible with the representation model. Finally, a relevance measure is used to compare converted queries against documents. In this paper we develop more ecient representation models for the collection of documents while keeping the query processor and the relevance measure xed. We use the vector space model for the database. A good introduction to the model is given in [4]. This model is used because it yields relatively simple and inexpensive computations while being easy to understand. Another attractive feature of the vector space model is that it is amenable to the techniques of linear algebra. We consider a nite collection of documents d ; : : : ; d ndocs, where ndocs is the cardinality of this collection. The total number of distinct terms appearing in all the documents is nterms. The database is represented by a nterms ndocs matrix, A = [a ij ], with nterms ndocs, where a ij is some function of f ij 's. The quantity f ij is simply the number of times the term i appears in document j. In the simplest case a ij = f ij. In this case, the matrix A is called the raw frequency matrix. Each document is represented by a vector of size nterms. A query is also represented by a vector of size nterms whose entries are one or zero representing the presence or absence of a search term. Information is retrieved by comparing query vectors with document vectors and determining which documents are \near" the queries in some sense. Example 2. shows a simple database and a query, as well as the corresponding raw frequency matrix. Notice that the word which appears most often - \the" - is least informative. Example 2.: d = \the rst short document" d 2 = \another one" d 3 = \the longest document in the database" query = \document" nterms = 9; ndocs = 3 d d 2 d 3 query another database document 0 rst longest of one short the In the above example, the raw frequency matrix is represented by the 9 rows and the rst 3 columns of the matrix above. Notice that all entries are integers. The vector space model, however, allows for scaling these to non-integer values. This may be accomplished through various weighting schemes. A discussion of some weighting schemes and their uses can be found in []. One such weighting scheme that has received attention is log-entropy discussed in [2], where it is stated to be the best of several dierent weighting schemes. Before discussing entropy and log-entropy, we x some notation. Let p ij and q ij be the word counts normalized to sum to one over the rows and columns respectively p ij = f P ij ndocs ; q ij = j= f ij f P ij nterms : i= f ij A single term i selected uniformly at random from all appearances of term i in the document set has probability p ij of coming from the j th document. A single word selected uniformly at random from the j th document has probability q ij of being the i th term. The log-entropy weighting scheme operates row-wise and replaces each number f ij with the positive real number

3 Problem 4 IMMW 97 3 log ( + f ij )?P ndocs? j= p ij log p ij log(ndocs) Here the row entropy is the quantity? P ndocs j= p ij log p ij. It measures uncertainty of location (in terms of documents) when a single term i is selected at random. Its minimum value is zero, and this occurs when term i appears in at most one document. Its largest possible value is log(ndocs), occuring when term i appears the same number of times in all ndocs documents. Since the maximum entropy is an increasing function of ndocs, this functional dependence is eliminated by using normalized entropy? P ndocs j= p ij log p ij log(ndocs) that takes values between zero and one. In log-entropy weighting, a term whose appearance tends to be equally likely among the documents is given low weight across a row, and a term whose appearance is concentrated in a few documents is given higher weight in those documents where it appears. The eect of the expression log ( + f ij ) is to dampen large variations in the raw term frequency. Log-entropy is a weighting scheme that operates row-by-row, with no interaction between terms. We also consider a column weighting scheme that uses the entropy calculated from the column probabilities q ij. We have a hueristic that documents with low column entropy (high repetition of a few terms) are less informative than documents with high column entropy (documents with a large variety of terms) as stated below. This rests on the idea that the information conveyed in a document increases as the uncertainty of the content increases (as measured by the uniformity of the probabilities q ij ). This column weighting scheme operates column-by-column replacing each f ij by the positive number (f ij ) (?P nterms i= q ij log q ij ): log (nterms) 3. Model Development. In order to test the log-entropy weighting scheme, we developed several alternative weighting schemes. We discussed the assumptions that motivate weighting schemes and how they apply to our problem. We organized our weighting ideas into categories for clarity and systematic testing, and, nally, developed a mathematical model for each idea. The assumptions can be divided into two categories: Standard Assumptions, which are well-established by the information retrieval community, and Additional Assumptions, which are original ideas that motivated some of the alternative weighting schemes. 3.. Standard Assumptions.. Most terms in a document do not contribute to the content. 2. \Relevant" terms are those that occur with moderate frequency. 3. \Irrelevant" terms are those that occur either the most or the least frequent. 4. If no other term but the query term appears in the document, then the document is considered irrelevant to the querent. 5. The classic denitions of precision and recall will be used to determine the eciency of weighting schemes. 6. The inner product will be used as the measure of similarity Additional Assumptions.. The length of a document contributes to the relevance of the subject matter. For example, the longer a document is, the more relevant it is to any subject because it contains more words, whereas the shorter the document, the less relevant to any subject. 2. Medium document length is most useful to the searcher in that the searcher is looking for the most relevant documents that are the quickest to read and to understand. This idea of medium document length can be represented by a unimodal function of length. 3. Important \content" material appears at the beginning of a document. 4. Documents are considered to be more useful to the reader when the term frequencies in a document are more evenly distributed. :

4 Development of Weighting Schemes. Our ultimate goal is to develop an information retrieval system based on the vector space model that would allow us to nd the most relevant information pertinent to our search. To achieve this goal we seek representations of the knowledge base which are more ecient than the standard \raw frequency" representation. In response to our goal, we divided our weighting schemes into three categories: local weightings, column normalizations, and row normalizations. Local weighting centers on the manipulation of the f ij term in some manner. Column and row normalizations are schemes that use an entire column or row. Normalization of the j th column involves elementwise multiplication of column vector d j by a scalar valued function of f ;j ; : : : ; f nterms;j, that is, just the entries of the j th column. Normalization of the i th row involves elementwise multiplication of that row vector by a scalar valued function of f i; ; : : : ; f i;ndocs. We consistently apply the same ordering to the weighting: First column normalization is performed simultaneously on all columns, and then the result of this operation is used as input to the local weighting scheme and row normalization. The results of local weighting and row normalizations are multiplied to obtain the nal matrix Column Normalizations. The column normalizations use all the information in the columns to prioritize the terms in the documents in some way.. Identity: The identity is the operation that does nothing to the word count data; the column vector does not change. 2. l: The l normalization divides each column entry by the maximum of all the column entries. This brings raw frequency counts of each term to lie between zero and one. The eect is that the column vectors are scaled to have the same l norm. This causes all documents to be pulled \closer" in the term space. The l normalization gives the ratio of the frequency of a given word to the word of highest frequency. 3. Entropy: One of our assumptions is that documents with low term entropy (high repetition of few words) are less useful to the searcher than documents with high term entropy (all words equally likely). This rests on the idea that the information conveyed in a document increases as the uncertainty of the content increases. When normalizing a column by entropy we therefore replace each raw frequency count in the column vector by P nterms? i= q ij log q ij : log(nterms) 4. Median centering: When searching on documents of dierent lengths, very long or very short documents are problematic since very long ones tend to be judged to be more relevant than they actually are and very short ones tend to be judged to be less relevant than they actually are, as per our assumptions. The intention of median centering is to correct this problem and yet preserve some variation in the length of the documents. To this end we compute the l 2 norm of each column and nd the median of these values and their deviation from the median. Let med d be the median and dev di be the deviation of the l 2 norm of document i from the med d. We then multipy each column entry by med d? sgn(dev di ) log (log jdev d i j + ), where is a scaling parameter Local Weightings. The local weightings refer to information specic to an individual term in one particular document. This information will be used after column normalization is done.. Raw frequency (Identity): The raw frequency f ij is the number of times that term i appears in document j. It represents the pure information collected from the document. 2. Log-frequency: The log-frequency scales the raw frequency using the formula log (f ij + ). The scaling is done as a means for comparing both large and small values of raw frequency. 3. x log x : One of our assumptions is that relevant words that actually suggest the content of documents occur with moderate frequency, as explained in [4]. In support of this idea, we have developed

5 Problem 4 IMMW 97 5 a non-symmetric, quadratic-like, concave down function with a long right tail, x log x, where x f is the relative frequency ij max i (f. The graph of ij ) x log ( x ) is shown in Fig. 3.. The idea behind the function is that as the frequency of the term increases, the value of the term in respect to content increases until a certain maximum. Once the maximum has occured, the value of the term decreases with frequency, but at a much slower rate Row Normalizations. The row normalizations use all the information in the rows to determine which terms have priority in the documents.. Identity: The identity leaves the row vectors unchanged, i.e.we make no attempt to correct for words that occur equally in almost all documents nor to emphasize words that are concentrated in just a few documents. 2. l : l normalization divides each row entry by the total sum of the entries in the row. This makes the value of each term lie between zero and one. For an average row, it is unlikely that any entry will be close to one, so it is the rows that are concentrated in one document that end up getting the most weighting with this normalization. 3. Entropy: In documents (column vectors), high normalized entropy is considered good and low normalized entropy is considered bad. Here the weighting scheme operates row-wise and replaces each locally weighted entry in the row with the positive real number??p ndocs j= p ij log p ij log (ndocs) for each row i: 4. Testing Procedure. A database of 34 document les and 25 query les was created using a C program written by Ilya. The program goes through the documents and queries one by one and counts frequencies for all distint words. It creates an alphabetical list of terms and a raw frequency matrix in the MATLAB (Version 4.2c) sparse matrix format. Of the 34 document database, 09 les were downloaded from the World Wide Web using standard search engines and clustered around 5 subjects that may or may not have some overlap. The ve subjects were \lm noir", \sports", \broken ankle", \salmon", and \dolphins." The other 25 document les were articially generated in order to test certain normalization and weighting schemes. For example, in order to measure the performance of column entropy normalization, several documents were created which contain nothing but certain query words repeated many times. As pointed out in the assumptions section, these les have no essential information in them and should not be retrieved during the search. In conjunction with the document database, a relevance matrix was created that rated the relevance of each document particular to a search. Mary Pat and Grace created a program in MATLAB (Version 4.2c) to apply certain combinations of weighting schemes to the sparse document database. Using the order in which we conduct our weighting schemes, there are a total of 36 dierent permutations. Since we are using precision and recall as our basis for eciency, a precision/recall graph is drawn indicating the average for each weighting scheme. The combined document/query database produced a sparse matrix of size It took approximately 70 seconds to generate this matrix on a Sun Sparc 20. It took approximately 300 seconds on average to perform the experiments with one combination of a weighting and normalizations on a Sun Sparc Performance of Information-Retrieval System. We tested all 36 combinations of column/row normalizations and local weighting to compare the performance. There are two parameters to measure performance: recall and precision. Recall is the proportion of all relevant documents in the collection that are retrieved by the system, and precision is the proportion of the relevant documents in the set returned to the user. the number of retrieved relevant documents Recall = the number of relevant documents the number of retrieved relevant documents P recision = the number of retrieved documents

6 6 For example, when a user gets 00 documents returned by a system and 40 of them are relevant to the query, then the precision is 0:4. For each precision/recall graph, we calculated several levels of recall and for each recall, precision is calculated and averaged across the queries. Fig.5. is the recall-precision graph for the combination nonenone-none. In this scheme, we use the raw frequency matrix without any column/row normalization or local weighting. This curve shows the typical behavior of recall-precision: when recall is small, precision increases as recall increases, and once precision reaches its maximum it starts decreasing slowly. This behavior is typical because as there is a xed number of relevant documents in the collection, when a small number of documents are returned to a user, the chance of more relevant documents to be returned will increase as the number of returned documents increases. However, once all relevant documents are already retrieved by requesting more documents, the precision will go down. Fig.5.2 is the recall-precision graph for our base line none-log-entropy. Here, the scheme uses log (f ij + ) for local weighting, entropy for row normalization, and no column normalization. The basic shape of the curve is the same as Fig.5., except precision is slightly higher everywhere. The last graph, Fig.5.3, is a little dierent. It is the recall-precision graph for the combination nonex log (=x)-none. The x log (=x) is used for local weighting, and no column/row normalization. First, precision increases rapidly at the beginning. Secondly, the maximum precision is about 30% higher than none-none-none combination, and about 20% higher than our base line none-log-entropy. We analyzed performance of all 36 dierent weighting schemes. As the measure for performance, maximum precision of each weighting scheme is used. The summary of our analysis is shown in Table 5.. performance worst best local weighting column none entropy centering l +% +8% +0% none; log (f + ) 0% 0% 0% x log (=x) local none log (f + ) x log (=x) + 8% % any row none entropy l +0 3% +2 6% none; log (f + )?0 %?3 4% x log (=x) Table.5. Performance of Weighting Schemes First, for column normalization, when none or log (f ij + ) is used as local weighting, median centering and l normalization performed about 8 0% better than when we use no column normalization, whereas entropy does not seem to make a big dierence. However, with x log (=x) local weighting, column normalization does not aect the performance. Secondly, for local weighting, our new weighting x log (=x) made a visible improvement, about 24 29%. log (f ij + ) improved the performance only modestly. Finally, for row normalization, when none or log (f ij + ) is used as local weighting, l normalization performed best. On the other hand, when x log (=x) is used, the performance was best when no row normalization was done. This suggests that we should not use any row normalization, when x log (=x) is used as local weighting. Among all 36 weighting schemes we tested, the best performance was done with the combinations c? x log (=x)? none where c is any column normalization. In that case, precision was about 70%. The worst combination in performance was none? none? none, i.e., when no normalization or weighting is done. Precision in this case was about 43%. To summarize, our new local weighting x log (=x) performed substantially better than log(f ij + ). When using x log (=x), it is recommended no row normalization be performed. Column normalization does not improve or worsen the performance when x log (=x) is used. Finally, the base line none? log? entropy had the maximum precision of 49% which is only slightly better than the performance of none? none? none and about 20% lower then c? x log (=x)? none. Thus, none? log? entropy weight scheme does not seem to be optimal. 6. Workshop Experience. We spent approximately 7 full days working on our problem. The rst two days were spent familiarizing ourselves with the basics of information retrieval and how it applied to our problem. Some of our learning came from group sessions where Laura presented background and

7 Problem 4 IMMW 97 7 answered questions. Other learning came from independent research in the library and independent reading of papers suggested by Laura. In addition, we tried to outline specic goals for the remainder of the workshop. After the goals were decided, we worked both individually and in our group to address them. The group sessions served two-fold purposes. One purpose was to generate new ideas, but another was to give members a chance to sound their ideas and get feedback from the group. Some of these ideas had already been conceived and implemented. Others could not be further developed at the workshop for lack of time. We decided to focus on developing dierent variations of the representation model partially because we believed that the remaining time would be sucient for us to thoroughly test our ideas. The most time-consuming part of the testing phase was coding. It took about a day to fully debug and ne-tune the word counter code and about a day to get the MATLAB programs running. 7. Conclusions. We have shown that the particular weighting scheme known as log-entropy is not optimal for ad-hoc querying. As discussed in the previous section, the log-entropy weighting scheme performs at a precision rate of 50%. A new weighting scheme consisting of c? x log ( x )? none, where c is any column normalization performs at a precision rate of 70%. It would appear that the weighting scheme that has the greatest eect on performance is the local weighting. There are three improvements that could be done in future work. First, a dierent similarity measure could be explored. As stated in the Assumptions Section 3., we employed the inner-product similarity measure. There are other types of similarity measures as represented in [3], but each of these measures uses the inner product as a basis and creates a more complicated measure from there. Second, the assumption from 3.2 could be applied: that more important content material appears at the beginning of the document. In this case, the whole document would not have to be searched. The rst two or three paragraphs could be searched and it would reduce the size of our representation matrix. Finally, the query terms could be weighted. Using a more clearly dened search would limit the number of \useless" documents that were returned. We pushed the limits of MATLAB while experimenting with our database. Considering the size of the documents, the computation time were rather slow. In order to test more realistic and therefore larger databases, it would be necessary to use a lower level language like C or FORTRAN or possibly employ a parallel computer. Finally, our group still believes that column normalizations should have some eect on the search engine. We would like to explore alternate schemes to prove the validity of column normalizations. Our group contained six people that each brought dierent skills to the problem solving endeavor. We were able to work well with each other and respect each others opinions. While this workshop is one week of intense problem solving, it is safe to say that we learned something about our problem and about each other. REFERENCES [] C. Buckley and G. Salton, Term weighting approaches in automatic text retrieval, tech. rep., Cornell University, 987. [2] S. Dumais, Improving the retrieval of information from external sources., Behavior Research Methods, Instruments and Computers, 23 (99), pp. 229{236. [3] W. Jones and G. Furnas, Pictures of relevance: a geometric analysis of similarity measures., Journal of the American Society for Information Science, 38 (987), pp. 420{442. [4] G. Salton, A. Wong, and C. Yang, A vector space model for automatic indexing., Communications of the ACM, 8 (975), pp. 63{620.

8 8 Fig. 3.. x log ( x ) Local Weighting Scheme precision 0.9 none none none recall Fig. 5.. No Weighting Schemes: none-none-none

9 Problem 4 IMMW 97 9 precision 0.9 none log entropy recall Fig Baseline weighting scheme: none-log-entropy precision 0.9 none locfam none recall Fig Best Weighting Combination: any column-x log (=x)-none

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139 Enumeration of Full Graphs: Onset of the Asymptotic Region L. J. Cowen D. J. Kleitman y F. Lasaga D. E. Sussman Department of Mathematics Massachusetts Institute of Technology Cambridge, MA 02139 Abstract

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Probabilistic Learning Approaches for Indexing and Retrieval with the. TREC-2 Collection

Probabilistic Learning Approaches for Indexing and Retrieval with the. TREC-2 Collection Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection Norbert Fuhr, Ulrich Pfeifer, Christoph Bremkamp, Michael Pollmann University of Dortmund, Germany Chris Buckley

More information

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION AN EMPIRICAL STUDY OF PERSPECTIVE-BASED USABILITY INSPECTION Zhijun Zhang, Victor Basili, and Ben Shneiderman Department of Computer Science University of Maryland College Park, MD 20742, USA fzzj, basili,

More information

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907 The Game of Clustering Rowena Cole and Luigi Barone Department of Computer Science, The University of Western Australia, Western Australia, 697 frowena, luigig@cs.uwa.edu.au Abstract Clustering is a technique

More information

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst

More information

Research on outlier intrusion detection technologybased on data mining

Research on outlier intrusion detection technologybased on data mining Acta Technica 62 (2017), No. 4A, 635640 c 2017 Institute of Thermomechanics CAS, v.v.i. Research on outlier intrusion detection technologybased on data mining Liang zhu 1, 2 Abstract. With the rapid development

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

Chapter 5 Lempel-Ziv Codes To set the stage for Lempel-Ziv codes, suppose we wish to nd the best block code for compressing a datavector X. Then we ha

Chapter 5 Lempel-Ziv Codes To set the stage for Lempel-Ziv codes, suppose we wish to nd the best block code for compressing a datavector X. Then we ha Chapter 5 Lempel-Ziv Codes To set the stage for Lempel-Ziv codes, suppose we wish to nd the best block code for compressing a datavector X. Then we have to take into account the complexity of the code.

More information

2 J. Karvo et al. / Blocking of dynamic multicast connections Figure 1. Point to point (top) vs. point to multipoint, or multicast connections (bottom

2 J. Karvo et al. / Blocking of dynamic multicast connections Figure 1. Point to point (top) vs. point to multipoint, or multicast connections (bottom Telecommunication Systems 0 (1998)?? 1 Blocking of dynamic multicast connections Jouni Karvo a;, Jorma Virtamo b, Samuli Aalto b and Olli Martikainen a a Helsinki University of Technology, Laboratory of

More information

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori Use of K-Near Optimal Solutions to Improve Data Association in Multi-frame Processing Aubrey B. Poore a and in Yan a a Department of Mathematics, Colorado State University, Fort Collins, CO, USA ABSTRACT

More information

Amit Singhal, Chris Buckley, Mandar Mitra. Department of Computer Science, Cornell University, Ithaca, NY 14853

Amit Singhal, Chris Buckley, Mandar Mitra. Department of Computer Science, Cornell University, Ithaca, NY 14853 Pivoted Document Length Normalization Amit Singhal, Chris Buckley, Mandar Mitra Department of Computer Science, Cornell University, Ithaca, NY 8 fsinghal, chrisb, mitrag@cs.cornell.edu Abstract Automatic

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany. Routing and Ad-hoc Retrieval with the TREC-3 Collection in a Distributed Loosely Federated Environment Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers University of Dortmund, Germany

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

8: Statistics. Populations and Samples. Histograms and Frequency Polygons. Page 1 of 10

8: Statistics. Populations and Samples. Histograms and Frequency Polygons. Page 1 of 10 8: Statistics Statistics: Method of collecting, organizing, analyzing, and interpreting data, as well as drawing conclusions based on the data. Methodology is divided into two main areas. Descriptive Statistics:

More information

Exercise 2: Hopeld Networks

Exercise 2: Hopeld Networks Articiella neuronnät och andra lärande system, 2D1432, 2004 Exercise 2: Hopeld Networks [Last examination date: Friday 2004-02-13] 1 Objectives This exercise is about recurrent networks, especially the

More information

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc. Siemens TREC-4 Report: Further Experiments with Database Merging Ellen M. Voorhees Siemens Corporate Research, Inc. Princeton, NJ ellen@scr.siemens.com Abstract A database merging technique is a strategy

More information

For the hardest CMO tranche, generalized Faure achieves accuracy 10 ;2 with 170 points, while modied Sobol uses 600 points. On the other hand, the Mon

For the hardest CMO tranche, generalized Faure achieves accuracy 10 ;2 with 170 points, while modied Sobol uses 600 points. On the other hand, the Mon New Results on Deterministic Pricing of Financial Derivatives A. Papageorgiou and J.F. Traub y Department of Computer Science Columbia University CUCS-028-96 Monte Carlo simulation is widely used to price

More information

Localization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD

Localization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD CAR-TR-728 CS-TR-3326 UMIACS-TR-94-92 Samir Khuller Department of Computer Science Institute for Advanced Computer Studies University of Maryland College Park, MD 20742-3255 Localization in Graphs Azriel

More information

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs Lecture 5 Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs Reading: Randomized Search Trees by Aragon & Seidel, Algorithmica 1996, http://sims.berkeley.edu/~aragon/pubs/rst96.pdf;

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

Performance Measures for Multi-Graded Relevance

Performance Measures for Multi-Graded Relevance Performance Measures for Multi-Graded Relevance Christian Scheel, Andreas Lommatzsch, and Sahin Albayrak Technische Universität Berlin, DAI-Labor, Germany {christian.scheel,andreas.lommatzsch,sahin.albayrak}@dai-labor.de

More information

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) *

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) * OpenStax-CNX module: m39305 1 Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) * Free High School Science Texts Project This work is produced by OpenStax-CNX

More information

9/24/ Hash functions

9/24/ Hash functions 11.3 Hash functions A good hash function satis es (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way

More information

From NP to P Musings on a Programming Contest Problem

From NP to P Musings on a Programming Contest Problem From NP to P Musings on a Programming Contest Problem Edward Corwin Antonette Logar Mathematics and CS SDSM&T Rapid City, SD 57701 edward.corwin@sdsmt.edu ABSTRACT A classic analysis of algorithms problem

More information

Frequency Distributions

Frequency Distributions Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

More information

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ - 1 - ITERATIVE SEARCHING IN AN ONLINE DATABASE Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ 07962-1910 ABSTRACT An experiment examined how people use

More information

However, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t

However, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t FAST CALCULATION OF GEOMETRIC MOMENTS OF BINARY IMAGES Jan Flusser Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Pod vodarenskou vez 4, 82 08 Prague 8, Czech

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

The problem of minimizing the elimination tree height for general graphs is N P-hard. However, there exist classes of graphs for which the problem can

The problem of minimizing the elimination tree height for general graphs is N P-hard. However, there exist classes of graphs for which the problem can A Simple Cubic Algorithm for Computing Minimum Height Elimination Trees for Interval Graphs Bengt Aspvall, Pinar Heggernes, Jan Arne Telle Department of Informatics, University of Bergen N{5020 Bergen,

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm

An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm Henan Zhao and Rizos Sakellariou Department of Computer Science, University of Manchester,

More information

Telecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA

Telecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA A Decoder-based Evolutionary Algorithm for Constrained Parameter Optimization Problems S lawomir Kozie l 1 and Zbigniew Michalewicz 2 1 Department of Electronics, 2 Department of Computer Science, Telecommunication

More information

CPSC 320 Sample Solution, Playing with Graphs!

CPSC 320 Sample Solution, Playing with Graphs! CPSC 320 Sample Solution, Playing with Graphs! September 23, 2017 Today we practice reasoning about graphs by playing with two new terms. These terms/concepts are useful in themselves but not tremendously

More information

Modeling Plant Succession with Markov Matrices

Modeling Plant Succession with Markov Matrices Modeling Plant Succession with Markov Matrices 1 Modeling Plant Succession with Markov Matrices Concluding Paper Undergraduate Biology and Math Training Program New Jersey Institute of Technology Catherine

More information

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

second_language research_teaching sla vivian_cook language_department idl

second_language research_teaching sla vivian_cook language_department idl Using Implicit Relevance Feedback in a Web Search Assistant Maria Fasli and Udo Kruschwitz Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom fmfasli

More information

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli Eect of fan-out on the Performance of a Single-message cancellation scheme Atul Prakash (Contact Author) Gwo-baw Wu Seema Jetli Department of Electrical Engineering and Computer Science University of Michigan,

More information

III Data Structures. Dynamic sets

III Data Structures. Dynamic sets III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations

More information

Lab 2: Support Vector Machines

Lab 2: Support Vector Machines Articial neural networks, advanced course, 2D1433 Lab 2: Support Vector Machines March 13, 2007 1 Background Support vector machines, when used for classication, nd a hyperplane w, x + b = 0 that separates

More information

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 14: Data Replication Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Database Replication What is database replication The advantages of

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Kalev Kask and Rina Dechter. Department of Information and Computer Science. University of California, Irvine, CA

Kalev Kask and Rina Dechter. Department of Information and Computer Science. University of California, Irvine, CA GSAT and Local Consistency 3 Kalev Kask and Rina Dechter Department of Information and Computer Science University of California, Irvine, CA 92717-3425 fkkask,dechterg@ics.uci.edu Abstract It has been

More information

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Lecture 2: Analyzing Algorithms: The 2-d Maxima Problem

Lecture 2: Analyzing Algorithms: The 2-d Maxima Problem Lecture 2: Analyzing Algorithms: The 2-d Maxima Problem (Thursday, Jan 29, 1998) Read: Chapter 1 in CLR. Analyzing Algorithms: In order to design good algorithms, we must first agree the criteria for measuring

More information

Virtual Memory - Overview. Programmers View. Virtual Physical. Virtual Physical. Program has its own virtual memory space.

Virtual Memory - Overview. Programmers View. Virtual Physical. Virtual Physical. Program has its own virtual memory space. Virtual Memory - Overview Programmers View Process runs in virtual (logical) space may be larger than physical. Paging can implement virtual. Which pages to have in? How much to allow each process? Program

More information

Recognition. Clark F. Olson. Cornell University. work on separate feature sets can be performed in

Recognition. Clark F. Olson. Cornell University. work on separate feature sets can be performed in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 907-912, 1996. Connectionist Networks for Feature Indexing and Object Recognition Clark F. Olson Department of Computer

More information

The Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a

The Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a Preprint 0 (2000)?{? 1 Approximation of a direction of N d in bounded coordinates Jean-Christophe Novelli a Gilles Schaeer b Florent Hivert a a Universite Paris 7 { LIAFA 2, place Jussieu - 75251 Paris

More information

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA Chapter 1 : BioMath: Transformation of Graphs Use the results in part (a) to identify the vertex of the parabola. c. Find a vertical line on your graph paper so that when you fold the paper, the left portion

More information

A Semi-Discrete Matrix Decomposition for Latent. Semantic Indexing in Information Retrieval. December 5, Abstract

A Semi-Discrete Matrix Decomposition for Latent. Semantic Indexing in Information Retrieval. December 5, Abstract A Semi-Discrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval Tamara G. Kolda and Dianne P. O'Leary y December 5, 1996 Abstract The vast amount of textual information available

More information

Concurrent Programming Lecture 3

Concurrent Programming Lecture 3 Concurrent Programming Lecture 3 3rd September 2003 Atomic Actions Fine grain atomic action We assume that all machine instructions are executed atomically: observers (including instructions in other threads)

More information

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented which, for a large-dimensional exponential family G,

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS A Project Report Presented to The faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements

More information

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

A Virtual Laboratory for Study of Algorithms

A Virtual Laboratory for Study of Algorithms A Virtual Laboratory for Study of Algorithms Thomas E. O'Neil and Scott Kerlin Computer Science Department University of North Dakota Grand Forks, ND 58202-9015 oneil@cs.und.edu Abstract Empirical studies

More information

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract Implementations of Dijkstra's Algorithm Based on Multi-Level Buckets Andrew V. Goldberg NEC Research Institute 4 Independence Way Princeton, NJ 08540 avg@research.nj.nec.com Craig Silverstein Computer

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Direct Variations DIRECT AND INVERSE VARIATIONS 19. Name

Direct Variations DIRECT AND INVERSE VARIATIONS 19. Name DIRECT AND INVERSE VARIATIONS 19 Direct Variations Name Of the many relationships that two variables can have, one category is called a direct variation. Use the description and example of direct variation

More information

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University of Maryland, College Park, MD 20742 oard@glue.umd.edu

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

CPSC 444 Project Milestone III: Prototyping & Experiment Design Feb 6, 2018

CPSC 444 Project Milestone III: Prototyping & Experiment Design Feb 6, 2018 CPSC 444 Project Milestone III: Prototyping & Experiment Design Feb 6, 2018 OVERVIEW... 2 SUMMARY OF MILESTONE III DELIVERABLES... 2 1. Blog Update #3 - Low-fidelity Prototyping & Cognitive Walkthrough,

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

16 Greedy Algorithms

16 Greedy Algorithms 16 Greedy Algorithms Optimization algorithms typically go through a sequence of steps, with a set of choices at each For many optimization problems, using dynamic programming to determine the best choices

More information

Reduction of Huge, Sparse Matrices over Finite Fields Via Created Catastrophes

Reduction of Huge, Sparse Matrices over Finite Fields Via Created Catastrophes Reduction of Huge, Sparse Matrices over Finite Fields Via Created Catastrophes Carl Pomerance and J. W. Smith CONTENTS 1. Introduction 2. Description of the Method 3. Outline of Experiments 4. Conclusion

More information

CSE100. Advanced Data Structures. Lecture 8. (Based on Paul Kube course materials)

CSE100. Advanced Data Structures. Lecture 8. (Based on Paul Kube course materials) CSE100 Advanced Data Structures Lecture 8 (Based on Paul Kube course materials) CSE 100 Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

COSC 311: ALGORITHMS HW1: SORTING

COSC 311: ALGORITHMS HW1: SORTING COSC 311: ALGORITHMS HW1: SORTIG Solutions 1) Theoretical predictions. Solution: On randomly ordered data, we expect the following ordering: Heapsort = Mergesort = Quicksort (deterministic or randomized)

More information

Managing Multiple Record Entries Part II

Managing Multiple Record Entries Part II 1 Managing Multiple Record Entries Part II Lincoln Stoller, Ph.D. Braided Matrix, Inc. Contents of Part I I. Introduction to multiple record entries II. Techniques A. Global transaction B. Hierarchical

More information

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Data Structures Hashing Structures Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Hashing Structures I. Motivation and Review II. Hash Functions III. HashTables I. Implementations

More information

VisoLink: A User-Centric Social Relationship Mining

VisoLink: A User-Centric Social Relationship Mining VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.

More information

Extensions to RTP to support Mobile Networking: Brown, Singh 2 within the cell. In our proposed architecture [3], we add a third level to this hierarc

Extensions to RTP to support Mobile Networking: Brown, Singh 2 within the cell. In our proposed architecture [3], we add a third level to this hierarc Extensions to RTP to support Mobile Networking Kevin Brown Suresh Singh Department of Computer Science Department of Computer Science University of South Carolina Department of South Carolina Columbia,

More information

Bar Graphs and Dot Plots

Bar Graphs and Dot Plots CONDENSED LESSON 1.1 Bar Graphs and Dot Plots In this lesson you will interpret and create a variety of graphs find some summary values for a data set draw conclusions about a data set based on graphs

More information

Condence Intervals about a Single Parameter:

Condence Intervals about a Single Parameter: Chapter 9 Condence Intervals about a Single Parameter: 9.1 About a Population Mean, known Denition 9.1.1 A point estimate of a parameter is the value of a statistic that estimates the value of the parameter.

More information

Experiments on string matching in memory structures

Experiments on string matching in memory structures Experiments on string matching in memory structures Thierry Lecroq LIR (Laboratoire d'informatique de Rouen) and ABISS (Atelier de Biologie Informatique Statistique et Socio-Linguistique), Universite de

More information

Plaintext (P) + F. Ciphertext (T)

Plaintext (P) + F. Ciphertext (T) Applying Dierential Cryptanalysis to DES Reduced to 5 Rounds Terence Tay 18 October 1997 Abstract Dierential cryptanalysis is a powerful attack developed by Eli Biham and Adi Shamir. It has been successfully

More information

Graphs. The ultimate data structure. graphs 1

Graphs. The ultimate data structure. graphs 1 Graphs The ultimate data structure graphs 1 Definition of graph Non-linear data structure consisting of nodes & links between them (like trees in this sense) Unlike trees, graph nodes may be completely

More information

An ATM Network Planning Model. A. Farago, V.T. Hai, T. Cinkler, Z. Fekete, A. Arato. Dept. of Telecommunications and Telematics

An ATM Network Planning Model. A. Farago, V.T. Hai, T. Cinkler, Z. Fekete, A. Arato. Dept. of Telecommunications and Telematics An ATM Network Planning Model A. Farago, V.T. Hai, T. Cinkler, Z. Fekete, A. Arato Dept. of Telecommunications and Telematics Technical University of Budapest XI. Stoczek u. 2, Budapest, Hungary H-1111

More information

A Linear-C Implementation of Dijkstra's Algorithm. Chung-Hsing Hsu and Donald Smith and Saul Levy. Department of Computer Science. Rutgers University

A Linear-C Implementation of Dijkstra's Algorithm. Chung-Hsing Hsu and Donald Smith and Saul Levy. Department of Computer Science. Rutgers University A Linear-C Implementation of Dijkstra's Algorithm Chung-Hsing Hsu and Donald Smith and Saul Levy Department of Computer Science Rutgers University LCSR-TR-274 October 9, 1996 Abstract Linear-C is a data-parallel

More information

A Parallel Intermediate Representation based on. Lambda Expressions. Timothy A. Budd. Oregon State University. Corvallis, Oregon.

A Parallel Intermediate Representation based on. Lambda Expressions. Timothy A. Budd. Oregon State University. Corvallis, Oregon. A Parallel Intermediate Representation based on Lambda Expressions Timothy A. Budd Department of Computer Science Oregon State University Corvallis, Oregon 97331 budd@cs.orst.edu September 20, 1994 Abstract

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a Multi-Layer Incremental Induction Xindong Wu and William H.W. Lo School of Computer Science and Software Ebgineering Monash University 900 Dandenong Road Melbourne, VIC 3145, Australia Email: xindong@computer.org

More information

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

TABLES AND HASHING. Chapter 13

TABLES AND HASHING. Chapter 13 Data Structures Dr Ahmed Rafat Abas Computer Science Dept, Faculty of Computer and Information, Zagazig University arabas@zu.edu.eg http://www.arsaliem.faculty.zu.edu.eg/ TABLES AND HASHING Chapter 13

More information

4. Write sets of directions for how to check for direct variation. How to check for direct variation by analyzing the graph :

4. Write sets of directions for how to check for direct variation. How to check for direct variation by analyzing the graph : Name Direct Variations There are many relationships that two variables can have. One of these relationships is called a direct variation. Use the description and example of direct variation to help you

More information

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of California, San Diego CA 92093{0114, USA Abstract. We

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

FSRM Feedback Algorithm based on Learning Theory

FSRM Feedback Algorithm based on Learning Theory Send Orders for Reprints to reprints@benthamscience.ae The Open Cybernetics & Systemics Journal, 2015, 9, 699-703 699 FSRM Feedback Algorithm based on Learning Theory Open Access Zhang Shui-Li *, Dong

More information

CHAPTER 2: SAMPLING AND DATA

CHAPTER 2: SAMPLING AND DATA CHAPTER 2: SAMPLING AND DATA This presentation is based on material and graphs from Open Stax and is copyrighted by Open Stax and Georgia Highlands College. OUTLINE 2.1 Stem-and-Leaf Graphs (Stemplots),

More information

At the end of the chapter, you will learn to: Present data in textual form. Construct different types of table and graphs

At the end of the chapter, you will learn to: Present data in textual form. Construct different types of table and graphs DATA PRESENTATION At the end of the chapter, you will learn to: Present data in textual form Construct different types of table and graphs Identify the characteristics of a good table and graph Identify

More information

2 The Service Provision Problem The formulation given here can also be found in Tomasgard et al. [6]. That paper also details the background of the mo

2 The Service Provision Problem The formulation given here can also be found in Tomasgard et al. [6]. That paper also details the background of the mo Two-Stage Service Provision by Branch and Bound Shane Dye Department ofmanagement University of Canterbury Christchurch, New Zealand s.dye@mang.canterbury.ac.nz Asgeir Tomasgard SINTEF, Trondheim, Norway

More information

2. CNeT Architecture and Learning 2.1. Architecture The Competitive Neural Tree has a structured architecture. A hierarchy of identical nodes form an

2. CNeT Architecture and Learning 2.1. Architecture The Competitive Neural Tree has a structured architecture. A hierarchy of identical nodes form an Competitive Neural Trees for Vector Quantization Sven Behnke and Nicolaos B. Karayiannis Department of Mathematics Department of Electrical and Computer Science and Computer Engineering Martin-Luther-University

More information