Markov Models for Clusters in Concordance Compression - Extended Abstract*

Size: px
Start display at page:

Download "Markov Models for Clusters in Concordance Compression - Extended Abstract*"

Transcription

1 Markov Models for Clusters in Concordance Compression - Extended Abstract* A. Bookstein S. T. Klein Center for Information and Language Studies Dept. of Math. & CS Bar Ilan University University of Chicago Chicago, IL Ramat-Gan Israel a-bookstein(puchicago.edu tomiqbimacs.cs.biu.ac.il T. Raita Comp. Sci. Dept. University of Turku Turku Finland raita(peuroni.cs.utu.fi Abstract An earlier paper developed a procedure for compressing concordances, assuming that all elements occurred independently. In this paper, the earlier models are extended to take the possibility of clusterin into account. We suggest several models adapted to concordances of large full-text information retrieval systems, which are generally subject to clustering. 1 Introduction and Background The development of optical disk technology has made it possible to widely distribute large, full text databases. But large as the capacity of CD-roms may be, it still doesn't match our ambitions for storing data [5]. For example, it is often overlooked that to be able to access and manipulate text, auxiliary data-structures must also be created and stored, and these often occupy as much space as the original text itself. Thus, to distribute a functional, full-text information retrieval system, consideration must be given how to store these data structures efficiently. Most large information retrieval systems are based on inverted files. In this approach, query processing does not directly involve the original text files (in which key words might be located using some pattern matching technique), but rather the auxiliary dictionary and concordance files. The dictionary is a list of all the different words appearing in the text and is usually ordered alphabetically. For each entry in 'Two of the authors (A.B. and S.T.K.) wish to acknowledge that the material in this paper is based upon research supported by the U.S. National Science Foundation under award number IRI , and by grant No from the United States - Israel Binational Science Foundation (BSF), Jerusalem, Israel. T.R. acknowledges support by the Academy of Finland under grant No. 4964/30/ $ IEEE

2 117 the dictionary, there is a pointer into the concordance, which lists each occurrence of the word. Every occurrence of a word in the database can be uniquely characterized by a sequence of numbers that gives its exact position in the text. Typically, such a sequence would consist of the document number d, the paragraph number p (in the document), the sentence number s (in the paragraph) and the word number w (in the sentence). The quadruple (d,p, s, w) is the coordinate of the occurrence. The concordance contains, for every word of the dictionary, the lexicographically ordered list of all its coordinates in the text. The concordance is generally of the order of magnitude of the text itself, its exact size depending on the omission or inclusion of the most frequent words, the so-called stop-words. Compressing it not only saves space, but also saves processing time by reducing the number of 1/0 operations needed to fetch parts of the concordance into main memory (see [5], [2], [8]). We note that in a static information retrieval system, compression and decompression are not symmetrical tasks. Compression is done only once, while building the system, whereas decompression is needed during the processing of every query and directly affects the response time. One may thus use extensive and costly preprocessing for compression, provided reasonably fast decompression methods are possible. Moreover, in an information retrieval system, while we compress full files (text, concordance, etc.), we decompress only (possibly many) short pieces on demand; these may be accessed at random by means of pointers to their exact locations. This limits the value of adaptive methods based on tables that systematically change from the beginning to the end of the file. Oddly, compared to full text, concordance compression has received relatively little attention. Some ad-hoc methods were used [3], [7], or simple models, assuming a structured entity in which term occurrences were distributed independently [l], [8]. In this paper we generalize the independence models to incorporate a tendency of the terms to cluster. 2 Modeling the Concordance For our model of a textual database, we assume that the text is divided into documents and the documents are made up of words. We thus use only a two level hierarchy to identify the location of a word, which makes the exposition easier. The methods can, however, be readily adapted to more complex concordance structures, like the 4-level hierarchy mentioned above. In our present model, the conceptual concordance consists, for each word, of a series of (d,w) pairs, d standing for a document number, and w for the index, or offset, of a word within the given document. It is sometimes convenient to translate this model to an equivalent one, in which we indicate 1) the index of the next document containing the word, 2) the number of times the word occurs in the document, followed by 3) the list of word indices of the various occurrences:

3 118 (GI"; wordz :... Wl,...,W") Our task is to model each of the components of the coordinate, and use standard compression methods to compress each entity. Below we assume that we know (from the dictionary) the total number of times a word occurs in the database, the number of different documents in which it occurs (N), and (from a separate table) the number of words in each document. We also know the value of D, the number of documents in the collection. The model of [I] assumed that all documents are approximately of the same size, that there is no between-document clustering, and that within a single document, words are independently distributed. On the basis of these assumptions, three probability distributions were derived for each term, giving the location of the next document containing the term, the number of times the term occurs in this document, and the locations of the word within the document. Finally Huffman, arithmetic, or Shannon-Fano codes were generated for these probability distributions, and the codeword corresponding to the actually appearing value d, m or w was used in the encoding. We now turn to the more realistic case in which the independence assumptions do not hold. For example, if the documents of an Information Retrieval system are grouped by author or some other criterion, the d-values of a many terms will tend to appear in clusters because of the specific style of the author, or because adjacent documents might treat similar subjects. Similarly, term-occurrences will cluster within documents, reflecting content variations over a document. In the following section, we model the bit-generation process in a cluster prone environment by means of a Markov chain, and suggest how to exploit it for the compression of concordances. In Section 4 we report on some experimental results obtained by applying these methods on the concordance of the Hebrew Bible. 3 Models of Clustering In our earlier examination of concordances, we first looked at the pattern of term occurrences over documents, and then at the distribution of terms within a given document. Formally, the analyses were identical. Since this remains true when clustering is considered, we shall here examine only the first case explicitly: that is, we have D documents, of which N are known to have at least one occurrence of the term being considered; our goal is to analyze the patterns of the gaps separating documents that contain the term. Formally, we shall represent each term by a bitmap, with a position for each document. If the term is present in a document, that document's bit position will be set to one; otherwise it will be zero. Below it is convenient to talk in terms of bitmaps rather than in terms of the concordance of which they are a component.

4 119 Should the tendency for clustering be pronounced, codes based on assumptions of term independence will produce poor compression. In this section we offer several models of clustering that potentially can be used to improve compression. 3.1 Full Markov Model We investigate a family of n-state Markov chains as models of how our bitmap was generated as we traverse the bitmap from beginning to end, at each position we are in a state, and that state determines the probability of being in any given state at the next location. The nature of the transition determines whether the next bit is a one or zero. The simplest model is the independence model. This has two states, which, to be consistent with what follows, we denote by C (within cluster) and B (between clusters). In the independence model we assume the probabilities of the transitions C -+ C and B + C are equal. In this model, as in all the Markov models we consider, we assume that we are generating terms only when we are making a transition into a cluster, even if the cluster consists of only one document. A simple, true Markov model is based on the same two states C and B, but in the true two-state model, the probability of a transition to C (that is, the probability of turning on a bit) differs depending on the state we are in. The state C indicates that we are in a cluster, and thus are more likely to generate occurrences of the designated term. We enter B as soon as we leave the cluster (that is, generate a zero-bit ). A limitation of the two-state model is that it doesn t recognize the possibility of spurious zeroes within a cluster: that is, we would like to incorporate the possibility that we may be in a cluster and generate a single zero bit without leaving the cluster. (For simplicity, we are neglecting the symmetric problem of spurious one s outside of a cluster; this can be dealt with in an identical manner.) We accommodate this possibility by introducing transitional states. For illustration, we describe in detail a 3-state model. The states are: 1. Cluster State (C); 2. Transitional State (X); and 3. Between Cluster State (B). The state X of the three-state model is introduced to permit us to be in a cluster and yet not generate any terms-it is a maybe state. However, if two consecutive documents do not contain the term, we have left the cluster-we are in state B between clusters. We can generalize this model by introducing a larger number of transitional states XI, X2, etc. Within the three-state model, a term is generated whenever we enter state C. The transition probabilities are: c + c : -yc, X -+ C :rx, B --+ c :YE, c --+ x : 1 - -yc; X -+ B: 1-7~; B + B : 1 - YE. We can encode each bit in a bitmap individually, using arithmetic coding based on the probability of a one-bit at each stage; alternatively, it may be more convenient to encode the gaps between one bits. To do this, we need the distribution of gap

5 120 sizes, which is easily derived using the above model. Once we are in a document that contains an occurrence of the designated term, the probability, pn, that the next n documents not have an occurrence is: po = yc; pl = (1 - yc)yx; and for n 2 2: The 7 s are our model s basic parameters. But we will also consider a reparameterization that is suggestive: y~ y, y~ e Oxyc, and y~ E Ogyx, which define the parameters y, Ox, and OB. These parameters satisfy simpler constraints (0 5 OX, OB 5 1) independently of one another and of y, which will ease problems of estimation. We also expect that regularities in the data will be more easily expressed in terms of the 0 s than in terms of the 7 s. It is also useful to define 0 OX&, so that 7~ = 0-y~. 0 is a single value reflecting the strength of clustering: it indicates the relative likelihood of a term appearing if we are inside a cluster as compared to if we are outside a cluster. More applications of Markov models to data compression can be found in [4] and PI. 3.2 Parameter Estimation We can estimate the parameters directly. We assume that before generating the bitmap, we are in state B. The sequence of one s and zero s making up the bitmap completely determines the state at any bitmap position. Thus it is easy to tabulate the number, and hence probability, of each type of transition. For illustration, consider the following bitmap: ~ ~ ~ Since we begin in state B, the initial zero indicates that the first transition is back to state B. Continuing in this manner we find the sequence of states corresponding to the above bitmap is given as follows (with the initial B preceding the colon): B: BBCXCCXB... In this sequence, we are in state B four times, for which we have three transitions. Of these three transitions, one is to state C, so on the basis of the information given, we would estimate M 1/3; similarly 7~ M 1/3, and 7~ M 1/2. Thus each parameter is easily evaluated and these parameters can be used as the basis for compressing the bitmap. Further, the standard deviation of this estimate, for state S, is given by us = d-, if we experience Ns transitions from state S in the pertinent bitmap. The standard deviations make it possible to compute confidence intervals and do tests of hypotheses. For a large bitmap, the?-values can be stored with the bitmap to permit decompression. But several mechanisms can be tried to reduce the cost of storing these parameters. For example, we can save the space for storing these parameters by evaluating them adaptively, beginning with reasonable initial values. We can

6 also lower storage costs if we find the 0's are related in a regular manner over the terms. For example, since state X is intermediate between states C and B, we might find that -yx is reasonably approximated by the average of 7~ and TB, that is, that ex = ( 1 + 0y2. It may also be possible, without serious deterioration in performance, to divide the parameters into a small number of categories. Thus, we might divide our terms into four clustering classes: one class would represent no clustering (0 = 1); the other three states would represent varying clustering strengths, with the values of the clustering parameter, 0, for these states determined empirically. This simplification allows clustering strength to be represented with the cost of just two bits per term. We will explore the value of such simplifications. 3.3 Model Testing The model allows us to make a number of predictions about the properties of the bitmaps, and these can be used to test the model. Test 1: One such test predicts the density of one-bits, NJD; this can be compared to the actual values of N and D, which are available to us without extra cost. The relationship is defined as follows. We assume we first have a span of documents without the designated term; the size of this span is E. This is followed by N spans made up of a document containing the term followed by an inter-term gap (of size G) of documents, and a terminal span of approximate size E. To derive our estimate, we assume that each internal span has a length equal to its expected value, and that the lengths of the terminal spans is of comparable size (say E M ag, for a M 1). As N gets large, the contribution of E becomes negligeable, allowing the following large-n relationship between D, N and the 7's: -- N - 'YB D 'YB -k (1 - 'Yc ) (1 - 'Yx + 'YB ) ' This relation, which can also be derived more formally in terms of Markov chain theory, provides our first test of the model. The general theory of Markov chains allows us to derive the steady state probabilities TC, TX and TB of being in states C, X, and B: 'YB 7rc = (2) 'YB+(~-'YC)(~--'YX+'YB)' (1) 121 and, Since 7rcD = N equations 1 and 2 are very similar. Test 2: As part of the development of test 1, we computed the average number of zeroes between two consecutive one-bits. A similar argument yields the useful

7 122 relation that the average size, SZ, of an actual run of zeroes (that is, assuming at least one zero in the run) is given by SZ = rx+(l -7x)(l+7~)/7~. If there are NZ zeroes in the bitmap, we expect approximately NZ/SZ runs of zeroes. Further, since the number of runs of one's is within one of the number of runs of zeroes, NZ/SZ is also an estimate of the number of runs of one-bits. Each of these relations can be used as a test of the model. 3.4 Performance Estimates If we are in state S E {B,X,C}, we expect that we can encode the next bitmap element in H(7s) bits (where H(z) = -(z log(z) + (1- z) log(1 - z)) ). U sing ' eqs. 2, 3, and 4, we estimate that the D bits of the bitmap can be reduced to BM = (*ch(rc) +ad(%) + TxH(7x))D bits. This estimate of performance can be used as a rough test of the Markov model. But even if the model is valid, we are left with the question of whether the savings of using the full model justifies the additional complexity relative to the independence model. The relative performance of the two models is easily computed. Under the assumption of the independence model, the appearance of a one-bit is determined by a single parameter, p, and the size of the bitmap is reduced to BI = H(p)D bits. But the probability p of generating a one-bit depends on which state S we are in (governed by the known probabilities TS), and the probability of a transition to C from S. If the Markov model is valid, we could combine these to find: P = TB7B + TXTX + TC7C = TC. The equation p = TC could have been asserted directly: the probability of a one-bit is just the probability of going to state C from any of the states; but the long term probability of going to C is the same as the probability of being in C, as the last equation asserts. The ratio BM/BI gives the relative advantage of using the full Markov model. 4 Partial experiment We decided for technical reasons to continue testing our methods on the concordance of the Hebrew Bible, as we did in our earlier paper. The Bible consists of words which are partitioned into 929 chapters. The number of different words is and the number of coordinates of the form (d,m;q,..,tum) is The first step for applying the Markov model is to evaluate the values of the 7's and 0. This was done by scanning the bitmaps generated from the concordance,

8 123 simulating the B, X and C states, and counting the transitions. There is of course no sense in applying the Markov model to the full set of terms; most of these appear in very few documents, and 57% of the terms appear in one document only. Using the Markov model, one would need 9.35 bits on the average to encode the d-fields of terms that appear in less than 60 documents, while one could just use a fixed encoding, using 10 bits to designate in which of the 929 documents the term appears. We therefore restricted our test to the 426 terms appearing in at least 60 documents. Though these are only 1% of the terms, they account for 48% of the N11 II N 1 60 Range 7c 7X 7B 7C 7X 7B Table 1: Distribution of 7 values Table 1 shows the distribution of the values of the estimated parameters ~ c 7x, and TB, both for the full set of coordinates and for the most frequent words. The table shows for each range, the number of terms having their value 7s in the given range, for S E {C, X, B). We see a tendency of having more terms with rc in the higher ranges than with 7~ or YE. The large majority of terms have very low 7 values (less than 0.02), because these terms are so rare that hardly any cluster is found. By comparing the two 7B columns, one sees that no term with N < 60 has a 7~ value larger than 0.1. The average values, 5, of 7s in the given sets, are displayed in Table 2. The last line of Table 2 gives the ratio BM/BI, which estimates the relative advantage of using the Markov model. The fact that these values are so close to 1 might at the first sight indicate that the expected gain is very small. It should, however, be noted that this is probably due to the fact that our test concordance is much too small. Most of the terms, even most of those appearing in more than 60 documents, are still relatively rare; on the other extreme, those few terms appearing in many documents are probably not content bearing words, which therefore will not exhibit any clustering tendency. But even if the BM/BI ratio is not very impressive, there is still evidence that clustering occurs, as suggested by the fact that

9 124 > > 5, for both sets considered; if terms were distributed randomly, the various 7-values would tend to be equal. Parameter N 2 1 N 2 60 BM/BI U Table 2: Parameter values The compression results are listed in Table 3, giving the average number of bits needed to encode the document-field of a coordinate. The first line corresponds to the static independent model of 111, that is, N and D are fixed for each term. The second line are the results of the dynamic independent model of [l], where N and D are updated after each d-field encoded. The last line corresponds to the Markov model. n Model N21 N 2 60 Dynamic indep Table 3: Comparison of bit generation models The first column gives the averages on the full set. We see that using the Markov model would actually incur a loss. The second column gives the averages for the 426 terms appearing in at least 60 documents. These terms are not necessarily clustered, but because of their relatively high density, the Markov model shows here a moderate improvement over the earlier models. The denser the maps, the better does the Markov model perform. For instance, for the 79 words appearing in at least 200 documents, only 2.24 bits are needed on the average to encode the value in the document-field.

10 125 5 Final remarks As noted already in our earlier paper, the Bible is not really a good example of a database on which to apply our new methods. The Bible is too small (just about 1.5 MB), so that hardly any concordance is needed. On real life concordances, of the size of hundreds of MB, the compression savings are usually much more substantial (about 40% by method POM of [3] on large data bases, versus 16% by POM here), so we expect there our method also to perform much better. The fact, however, that by restricting our action only to the most frequent words, we still cover a large part of the concordance, is true also for large systems. For instance, the 100 most frequent words of the Trbor de la Langue FranGaise (TLF) account for 57% of the text. We are currently experimenting with the other tests and models, and shall apply these on the database of the TLF. 6 References 1. Bookstein A., Klein S.T., Raita T., Model based concordance compression, Proc. Data Compression Conference DCC-92, Snowbird, Utah (1992) Bookstein A., Klein S.T., Ziff, D.A., A systematic approach to compressing a full text retrieval system, Information Processing & Management 28 (1992) Choueka Y., Fraenkel A.S., Klein S.T., Compression of Concordances in Full-Text Retrieval Systems, Proc. 11-th ACM-SIGIR Conf., Grenoble (1988) Cormack G.V., Horspool R.N., Data compression using dynamic Markov modeling, The Computer Journal 30 (1987) Klein S.T., Bookstein A., Deerwester S., Storing Text Retrieval Systems on CD-ROM: Compression and Encryption Considerations, ACM Trans. on Information Systems 7 (1989), Llewellyn J.A., Data compression for a source with Markov characteristics, The Computer Journal 30 (1987) Wisniewski J.L., Compression of index term dictionary in an invertedfile oriented database: some efficient algorithms, Information Proc. & Management 22 (1986) Witten I.H., Bell T.C., Nevi11 C.G., Models for compression in full-text retrieval systems, Proc. Data Compression Conference DCC-91, Snowbird, Utah (1991)

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS Yair Wiseman 1* * 1 Computer Science Department, Bar-Ilan University, Ramat-Gan 52900, Israel Email: wiseman@cs.huji.ac.il, http://www.cs.biu.ac.il/~wiseman

More information

Working with Compressed Concordances

Working with Compressed Concordances Miri Ben-Nissan and Shmuel T. Klein Department of Computer Science Bar Ilan University 52 900 Ramat-Gan Israel Tel: (972 3) 531 8865 Fax: (972 3) 736 0498 miribn@gmail.com, tomi@cs.biu.ac.il Abstract.

More information

Cluster based Mixed Coding Schemes for Inverted File Index Compression

Cluster based Mixed Coding Schemes for Inverted File Index Compression Cluster based Mixed Coding Schemes for Inverted File Index Compression Jinlin Chen 1, Ping Zhong 2, Terry Cook 3 1 Computer Science Department Queen College, City University of New York USA jchen@cs.qc.edu

More information

arxiv: v2 [cs.it] 15 Jan 2011

arxiv: v2 [cs.it] 15 Jan 2011 Improving PPM Algorithm Using Dictionaries Yichuan Hu Department of Electrical and Systems Engineering University of Pennsylvania Email: yichuan@seas.upenn.edu Jianzhong (Charlie) Zhang, Farooq Khan and

More information

8 Integer encoding. scritto da: Tiziano De Matteis

8 Integer encoding. scritto da: Tiziano De Matteis 8 Integer encoding scritto da: Tiziano De Matteis 8.1 Unary code... 8-2 8.2 Elias codes: γ andδ... 8-2 8.3 Rice code... 8-3 8.4 Interpolative coding... 8-4 8.5 Variable-byte codes and (s,c)-dense codes...

More information

Data Compression Scheme of Dynamic Huffman Code for Different Languages

Data Compression Scheme of Dynamic Huffman Code for Different Languages 2011 International Conference on Information and Network Technology IPCSIT vol.4 (2011) (2011) IACSIT Press, Singapore Data Compression Scheme of Dynamic Huffman Code for Different Languages Shivani Pathak

More information

Modeling Delta Encoding of Compressed Files

Modeling Delta Encoding of Compressed Files Modeling Delta Encoding of Compressed Files EXTENDED ABSTRACT S.T. Klein, T.C. Serebro, and D. Shapira 1 Dept of CS Bar Ilan University Ramat Gan, Israel tomi@cs.biu.ac.il 2 Dept of CS Bar Ilan University

More information

Accelerating Boyer Moore Searches on Binary Texts

Accelerating Boyer Moore Searches on Binary Texts Accelerating Boyer Moore Searches on Binary Texts Shmuel T. Klein Miri Kopel Ben-Nissan Department of Computer Science, Bar Ilan University, Ramat-Gan 52900, Israel Tel: (972 3) 531 8865 Email: {tomi,kopel}@cs.biu.ac.il

More information

Parallel Lempel Ziv Coding

Parallel Lempel Ziv Coding Parallel Lempel Ziv Coding (Extended Abstract) Shmuel Tomi Klein and Yair Wiseman Dept. of Math. & CS, Bar Ilan University Ramat-Gan 5900, Israel tomi@cs.biu.ac.il Dept. of Math. & CS, Bar Ilan University

More information

A Hybrid Approach to Text Compression

A Hybrid Approach to Text Compression A Hybrid Approach to Text Compression Peter C Gutmann Computer Science, University of Auckland, New Zealand Telephone +64 9 426-5097; email pgut 1 Bcs.aukuni.ac.nz Timothy C Bell Computer Science, University

More information

Engineering Mathematics II Lecture 16 Compression

Engineering Mathematics II Lecture 16 Compression 010.141 Engineering Mathematics II Lecture 16 Compression Bob McKay School of Computer Science and Engineering College of Engineering Seoul National University 1 Lossless Compression Outline Huffman &

More information

File System Interface and Implementation

File System Interface and Implementation Unit 8 Structure 8.1 Introduction Objectives 8.2 Concept of a File Attributes of a File Operations on Files Types of Files Structure of File 8.3 File Access Methods Sequential Access Direct Access Indexed

More information

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and International Journal of Foundations of Computer Science c World Scientific Publishing Company MODELING DELTA ENCODING OF COMPRESSED FILES SHMUEL T. KLEIN Department of Computer Science, Bar-Ilan University

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 410 (2009) 3372 3390 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs An (18/11)n upper bound for sorting

More information

Hierarchical Parallel Evaluation of a Hamming Code

Hierarchical Parallel Evaluation of a Hamming Code algorithms Article Hierarchical Parallel Evaluation of a Hamming Code Shmuel T. Klein 1 and Dana Shapira 2, * 1 Department of Computer Science, Bar Ilan University, Ramat Gan 52900, Israel; tomi@cs.biu.ac.il

More information

Searchable Compressed Representations of Very Sparse Bitmaps (extended abstract)

Searchable Compressed Representations of Very Sparse Bitmaps (extended abstract) Searchable Compressed Representations of Very Sparse Bitmaps (extended abstract) Steven Pigeon 1 pigeon@iro.umontreal.ca McMaster University Hamilton, Ontario Xiaolin Wu 2 xwu@poly.edu Polytechnic University

More information

Slides 11: Verification and Validation Models

Slides 11: Verification and Validation Models Slides 11: Verification and Validation Models Purpose and Overview The goal of the validation process is: To produce a model that represents true behaviour closely enough for decision making purposes.

More information

Scan-Based BIST Diagnosis Using an Embedded Processor

Scan-Based BIST Diagnosis Using an Embedded Processor Scan-Based BIST Diagnosis Using an Embedded Processor Kedarnath J. Balakrishnan and Nur A. Touba Computer Engineering Research Center Department of Electrical and Computer Engineering University of Texas

More information

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II) Chapter 5 VARIABLE-LENGTH CODING ---- Information Theory Results (II) 1 Some Fundamental Results Coding an Information Source Consider an information source, represented by a source alphabet S. S = { s,

More information

a process may be swapped in and out of main memory such that it occupies different regions

a process may be swapped in and out of main memory such that it occupies different regions Virtual Memory Characteristics of Paging and Segmentation A process may be broken up into pieces (pages or segments) that do not need to be located contiguously in main memory Memory references are dynamically

More information

IS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP?

IS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP? Theoretical Computer Science 19 (1982) 337-341 North-Holland Publishing Company NOTE IS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP? Nimrod MEGIDDO Statistics Department, Tel Aviv

More information

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR Detecting Missing and Spurious Edges in Large, Dense Networks Using Parallel Computing Samuel Coolidge, sam.r.coolidge@gmail.com Dan Simon, des480@nyu.edu Dennis Shasha, shasha@cims.nyu.edu Technical Report

More information

Lossless Compression Algorithms

Lossless Compression Algorithms Multimedia Data Compression Part I Chapter 7 Lossless Compression Algorithms 1 Chapter 7 Lossless Compression Algorithms 1. Introduction 2. Basics of Information Theory 3. Lossless Compression Algorithms

More information

Preview. Memory Management

Preview. Memory Management Preview Memory Management With Mono-Process With Multi-Processes Multi-process with Fixed Partitions Modeling Multiprogramming Swapping Memory Management with Bitmaps Memory Management with Free-List Virtual

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

Integrating Error Detection into Arithmetic Coding

Integrating Error Detection into Arithmetic Coding Integrating Error Detection into Arithmetic Coding Colin Boyd Λ, John G. Cleary, Sean A. Irvine, Ingrid Rinsma-Melchert, Ian H. Witten Department of Computer Science University of Waikato Hamilton New

More information

EE67I Multimedia Communication Systems Lecture 4

EE67I Multimedia Communication Systems Lecture 4 EE67I Multimedia Communication Systems Lecture 4 Lossless Compression Basics of Information Theory Compression is either lossless, in which no information is lost, or lossy in which information is lost.

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Practical Fixed Length Lempel Ziv Coding

Practical Fixed Length Lempel Ziv Coding Practical Fixed Length Lempel Ziv Coding Shmuel T. Klein a, Dana Shapira b a Department of Computer Science, Bar Ilan University, Ramat Gan 52900, Israel tomi@cs.biu.ac.il b Dept. of Computer Science,

More information

Basic Concepts of Reliability

Basic Concepts of Reliability Basic Concepts of Reliability Reliability is a broad concept. It is applied whenever we expect something to behave in a certain way. Reliability is one of the metrics that are used to measure quality.

More information

COMP 346 WINTER 2018 MEMORY MANAGEMENT (VIRTUAL MEMORY)

COMP 346 WINTER 2018 MEMORY MANAGEMENT (VIRTUAL MEMORY) COMP 346 WINTER 2018 1 MEMORY MANAGEMENT (VIRTUAL MEMORY) VIRTUAL MEMORY A process may be broken up into pieces (pages or segments) that do not need to be located contiguously in main memory. Memory references

More information

Modeling Delta Encoding of Compressed Files

Modeling Delta Encoding of Compressed Files Shmuel T. Klein 1, Tamar C. Serebro 1, and Dana Shapira 2 1 Department of Computer Science Bar Ilan University Ramat Gan, Israel tomi@cs.biu.ac.il, t lender@hotmail.com 2 Department of Computer Science

More information

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 23 Hierarchical Memory Organization (Contd.) Hello

More information

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland An On-line Variable Length inary Encoding Tinku Acharya Joseph F. Ja Ja Institute for Systems Research and Institute for Advanced Computer Studies University of Maryland College Park, MD 242 facharya,

More information

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER Akhil Kumar and Michael Stonebraker EECS Department University of California Berkeley, Ca., 94720 Abstract A heuristic query optimizer must choose

More information

Video Compression An Introduction

Video Compression An Introduction Video Compression An Introduction The increasing demand to incorporate video data into telecommunications services, the corporate environment, the entertainment industry, and even at home has made digital

More information

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23 FILE SYSTEMS CS124 Operating Systems Winter 2015-2016, Lecture 23 2 Persistent Storage All programs require some form of persistent storage that lasts beyond the lifetime of an individual process Most

More information

Figure-2.1. Information system with encoder/decoders.

Figure-2.1. Information system with encoder/decoders. 2. Entropy Coding In the section on Information Theory, information system is modeled as the generationtransmission-user triplet, as depicted in fig-1.1, to emphasize the information aspect of the system.

More information

Information Theory and Communication

Information Theory and Communication Information Theory and Communication Shannon-Fano-Elias Code and Arithmetic Codes Ritwik Banerjee rbanerjee@cs.stonybrook.edu c Ritwik Banerjee Information Theory and Communication 1/12 Roadmap Examples

More information

MetaData for Database Mining

MetaData for Database Mining MetaData for Database Mining John Cleary, Geoffrey Holmes, Sally Jo Cunningham, and Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand. Abstract: At present, a machine

More information

Statistical Testing of Software Based on a Usage Model

Statistical Testing of Software Based on a Usage Model SOFTWARE PRACTICE AND EXPERIENCE, VOL. 25(1), 97 108 (JANUARY 1995) Statistical Testing of Software Based on a Usage Model gwendolyn h. walton, j. h. poore and carmen j. trammell Department of Computer

More information

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

Expressions that talk about themselves. Maarten Fokkinga, University of Twente, dept. INF, Version of May 6, 1994

Expressions that talk about themselves. Maarten Fokkinga, University of Twente, dept. INF, Version of May 6, 1994 Expressions that talk about themselves Maarten Fokkinga, University of Twente, dept. INF, fokkinga@cs.utwente.nl Version of May 6, 1994 Introduction Self-reference occurs frequently in theoretical investigations

More information

Practical Fixed Length Lempel Ziv Coding

Practical Fixed Length Lempel Ziv Coding Practical Fixed Length Lempel Ziv Coding Shmuel T. Klein 1 and Dana Shapira 2 1 Dept. of Computer Science, Bar Ilan University, Ramat Gan 52900, Israel tomi@cs.biu.ac.il 2 Dept. of Computer Science, Ashkelon

More information

Window Extraction for Information Retrieval

Window Extraction for Information Retrieval Window Extraction for Information Retrieval Samuel Huston Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA, 01002, USA sjh@cs.umass.edu W. Bruce Croft Center

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Virtual Memory 1 Chapter 8 Characteristics of Paging and Segmentation Memory references are dynamically translated into physical addresses at run time E.g., process may be swapped in and out of main memory

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods R. Nigel Horspool Dept. of Computer Science, University of Victoria P. O. Box 3055, Victoria, B.C., Canada V8W 3P6 E-mail address: nigelh@csr.uvic.ca

More information

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Question 1: (4 points) Shown below is a portion of the positional index in the format term: doc1: position1,position2

More information

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson zhuyongxin@sjtu.edu.cn 2 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 11 Coding Strategies and Introduction to Huffman Coding The Fundamental

More information

An Approach to Task Attribute Assignment for Uniprocessor Systems

An Approach to Task Attribute Assignment for Uniprocessor Systems An Approach to ttribute Assignment for Uniprocessor Systems I. Bate and A. Burns Real-Time Systems Research Group Department of Computer Science University of York York, United Kingdom e-mail: fijb,burnsg@cs.york.ac.uk

More information

Probabilistic Modeling of Leach Protocol and Computing Sensor Energy Consumption Rate in Sensor Networks

Probabilistic Modeling of Leach Protocol and Computing Sensor Energy Consumption Rate in Sensor Networks Probabilistic Modeling of Leach Protocol and Computing Sensor Energy Consumption Rate in Sensor Networks Dezhen Song CS Department, Texas A&M University Technical Report: TR 2005-2-2 Email: dzsong@cs.tamu.edu

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

Exploiting a database to predict the in-flight stability of the F-16

Exploiting a database to predict the in-flight stability of the F-16 Exploiting a database to predict the in-flight stability of the F-16 David Amsallem and Julien Cortial December 12, 2008 1 Introduction Among the critical phenomena that have to be taken into account when

More information

An Order-2 Context Model for Data Compression. With Reduced Time and Space Requirements. Technical Report No

An Order-2 Context Model for Data Compression. With Reduced Time and Space Requirements. Technical Report No An Order-2 Context Model for Data Compression With Reduced Time and Space Requirements Debra A. Lelewer and Daniel S. Hirschberg Technical Report No. 90-33 Abstract Context modeling has emerged as the

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 1: Entropy Coding Lecture 1: Introduction and Huffman Coding Juha Kärkkäinen 31.10.2017 1 / 21 Introduction Data compression deals with encoding information in as few bits

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Rate Distortion Optimization in Video Compression

Rate Distortion Optimization in Video Compression Rate Distortion Optimization in Video Compression Xue Tu Dept. of Electrical and Computer Engineering State University of New York at Stony Brook 1. Introduction From Shannon s classic rate distortion

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

COMPUTER SCIENCE 4500 OPERATING SYSTEMS

COMPUTER SCIENCE 4500 OPERATING SYSTEMS Last update: 3/28/2017 COMPUTER SCIENCE 4500 OPERATING SYSTEMS 2017 Stanley Wileman Module 9: Memory Management Part 1 In This Module 2! Memory management functions! Types of memory and typical uses! Simple

More information

Lossless Compression using Efficient Encoding of Bitmasks

Lossless Compression using Efficient Encoding of Bitmasks Lossless Compression using Efficient Encoding of Bitmasks Chetan Murthy and Prabhat Mishra Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 326, USA

More information

VALIDATING AN ANALYTICAL APPROXIMATION THROUGH DISCRETE SIMULATION

VALIDATING AN ANALYTICAL APPROXIMATION THROUGH DISCRETE SIMULATION MATHEMATICAL MODELLING AND SCIENTIFIC COMPUTING, Vol. 8 (997) VALIDATING AN ANALYTICAL APPROXIMATION THROUGH DISCRETE ULATION Jehan-François Pâris Computer Science Department, University of Houston, Houston,

More information

A Comparative Study of Lossless Compression Algorithm on Text Data

A Comparative Study of Lossless Compression Algorithm on Text Data Proc. of Int. Conf. on Advances in Computer Science, AETACS A Comparative Study of Lossless Compression Algorithm on Text Data Amit Jain a * Kamaljit I. Lakhtaria b, Prateek Srivastava c a, b, c Department

More information

Module 9: Selectivity Estimation

Module 9: Selectivity Estimation Module 9: Selectivity Estimation Module Outline 9.1 Query Cost and Selectivity Estimation 9.2 Database profiles 9.3 Sampling 9.4 Statistics maintained by commercial DBMS Web Forms Transaction Manager Lock

More information

Math 182. Assignment #4: Least Squares

Math 182. Assignment #4: Least Squares Introduction Math 182 Assignment #4: Least Squares In any investigation that involves data collection and analysis, it is often the goal to create a mathematical function that fits the data. That is, a

More information

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 29 Source Coding (Part-4) We have already had 3 classes on source coding

More information

UNIT -2 LEXICAL ANALYSIS

UNIT -2 LEXICAL ANALYSIS OVER VIEW OF LEXICAL ANALYSIS UNIT -2 LEXICAL ANALYSIS o To identify the tokens we need some method of describing the possible tokens that can appear in the input stream. For this purpose we introduce

More information

A Simple Lossless Compression Heuristic for Grey Scale Images

A Simple Lossless Compression Heuristic for Grey Scale Images L. Cinque 1, S. De Agostino 1, F. Liberati 1 and B. Westgeest 2 1 Computer Science Department University La Sapienza Via Salaria 113, 00198 Rome, Italy e-mail: deagostino@di.uniroma1.it 2 Computer Science

More information

Suffix Vector: A Space-Efficient Suffix Tree Representation

Suffix Vector: A Space-Efficient Suffix Tree Representation Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,

More information

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Abstract In microprocessor-based systems, data and address buses are the core of the interface between a microprocessor

More information

14.1 Encoding for different models of computation

14.1 Encoding for different models of computation Lecture 14 Decidable languages In the previous lecture we discussed some examples of encoding schemes, through which various objects can be represented by strings over a given alphabet. We will begin this

More information

C. R. Math. Rep. Acad. Sci. Canada - Vol. II (19BO) No. 4 USING DIRECTED GRAPHS FOR TEXT COMPRESSION. G. V. Cormack and G. Gratzer, F.R.S.C.

C. R. Math. Rep. Acad. Sci. Canada - Vol. II (19BO) No. 4 USING DIRECTED GRAPHS FOR TEXT COMPRESSION. G. V. Cormack and G. Gratzer, F.R.S.C. 193 C. R. Math. Rep. Acad. Sci. Canada - Vol. II (19BO) No. 4 USING DIRECTED GRAPHS FOR TEXT COMPRESSION G. V. Cormack and G. Gratzer, F.R.S.C. Abstract*: We define a class of directed graphs and we show

More information

Study of LZ77 and LZ78 Data Compression Techniques

Study of LZ77 and LZ78 Data Compression Techniques Study of LZ77 and LZ78 Data Compression Techniques Suman M. Choudhary, Anjali S. Patel, Sonal J. Parmar Abstract Data Compression is defined as the science and art of the representation of information

More information

Multimedia Networking ECE 599

Multimedia Networking ECE 599 Multimedia Networking ECE 599 Prof. Thinh Nguyen School of Electrical Engineering and Computer Science Based on B. Lee s lecture notes. 1 Outline Compression basics Entropy and information theory basics

More information

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion, Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations

More information

Efficient Building and Querying of Asian Language Document Databases

Efficient Building and Querying of Asian Language Document Databases Efficient Building and Querying of Asian Language Document Databases Phil Vines Justin Zobel Department of Computer Science, RMIT University PO Box 2476V Melbourne 3001, Victoria, Australia Email: phil@cs.rmit.edu.au

More information

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

A Document Image Analysis System on Parallel Processors

A Document Image Analysis System on Parallel Processors A Document Image Analysis System on Parallel Processors Shamik Sural, CMC Ltd. 28 Camac Street, Calcutta 700 016, India. P.K.Das, Dept. of CSE. Jadavpur University, Calcutta 700 032, India. Abstract This

More information

Research Article Does an Arithmetic Coding Followed by Run-length Coding Enhance the Compression Ratio?

Research Article Does an Arithmetic Coding Followed by Run-length Coding Enhance the Compression Ratio? Research Journal of Applied Sciences, Engineering and Technology 10(7): 736-741, 2015 DOI:10.19026/rjaset.10.2425 ISSN: 2040-7459; e-issn: 2040-7467 2015 Maxwell Scientific Publication Corp. Submitted:

More information

Information Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2

Information Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2 Volume 5, Issue 5, May 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Adaptive Huffman

More information

Move-to-front algorithm

Move-to-front algorithm Up to now, we have looked at codes for a set of symbols in an alphabet. We have also looked at the specific case that the alphabet is a set of integers. We will now study a few compression techniques in

More information

LIPT-Derived Transform Methods Used in Lossless Compression of Text Files

LIPT-Derived Transform Methods Used in Lossless Compression of Text Files ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY Volume 14, Number 2, 2011, 149 158 LIPT-Derived Transform Methods Used in Lossless Compression of Text Files Radu RĂDESCU Politehnica University of

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY Rashmi Gadbail,, 2013; Volume 1(8): 783-791 INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK EFFECTIVE XML DATABASE COMPRESSION

More information

Identifying Stable File Access Patterns

Identifying Stable File Access Patterns Identifying Stable File Access Patterns Purvi Shah Jehan-François Pâris 1 Ahmed Amer 2 Darrell D. E. Long 3 University of Houston University of Houston University of Pittsburgh U. C. Santa Cruz purvi@cs.uh.edu

More information

The Cost of Address Translation

The Cost of Address Translation The Cost of Address Translation Tomasz Jurkiewicz Kurt Mehlhorn Pat Nicholson Max Planck Institute for Informatics full version of paper by TJ and KM available at arxiv preliminary version presented at

More information

ADVANCED LOSSLESS TEXT COMPRESSION ALGORITHM BASED ON SPLAY TREE ADAPTIVE METHODS

ADVANCED LOSSLESS TEXT COMPRESSION ALGORITHM BASED ON SPLAY TREE ADAPTIVE METHODS ADVANCED LOSSLESS TEXT COMPRESSION ALGORITHM BASED ON SPLAY TREE ADAPTIVE METHODS RADU RĂDESCU, ANDREEA HONCIUC *1 Key words: Data compression, Splay Tree, Prefix, ratio. This paper presents an original

More information

3.4 Data-Centric workflow

3.4 Data-Centric workflow 3.4 Data-Centric workflow One of the most important activities in a S-DWH environment is represented by data integration of different and heterogeneous sources. The process of extract, transform, and load

More information

Decreasing the Diameter of Bounded Degree Graphs

Decreasing the Diameter of Bounded Degree Graphs Decreasing the Diameter of Bounded Degree Graphs Noga Alon András Gyárfás Miklós Ruszinkó February, 00 To the memory of Paul Erdős Abstract Let f d (G) denote the minimum number of edges that have to be

More information

Lecture 17. Lower bound for variable-length source codes with error. Coding a sequence of symbols: Rates and scheme (Arithmetic code)

Lecture 17. Lower bound for variable-length source codes with error. Coding a sequence of symbols: Rates and scheme (Arithmetic code) Lecture 17 Agenda for the lecture Lower bound for variable-length source codes with error Coding a sequence of symbols: Rates and scheme (Arithmetic code) Introduction to universal codes 17.1 variable-length

More information

(Refer Slide Time 6:48)

(Refer Slide Time 6:48) Digital Circuits and Systems Prof. S. Srinivasan Department of Electrical Engineering Indian Institute of Technology Madras Lecture - 8 Karnaugh Map Minimization using Maxterms We have been taking about

More information

1.2 Numerical Solutions of Flow Problems

1.2 Numerical Solutions of Flow Problems 1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information