Volume 114 No. 12 2017, 145-154 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu An Effective Approach to Improve Storage Efficiency Using Variable bit Representation 1 R. Anoop, 2 Subhadra G. Varma and 3 V.R. Rajalakshmi 1 Department of Computer Science & IT, School of Arts and Sciences, Amrita University, Kochi. 2 Department of Computer Science & IT, School of Arts and Sciences, Amrita University, Kochi. 3 Department of Computer Science & IT, School of Arts and Sciences, Amrita University, Kochi. Abstract Compression techniques can be used for decreasing the cost of data storage and transmission capacity by reducing the redundancy within a dataset. Data compression is applied by encoding information using lesser number of bits than the actual representation. These techniques maybe either lossy or lossless. In lossless compression, the actual data could be completely reconstructed from the compressed data whereas in lossy compression, the data cannot be reverted completely back to 100% of its initial state. In this paper, we suggest a lossless compression method using the concept of variable bits. Here, the distinct repetitive values of an attribute in a database are represented as binary values where the data items with the highest frequency are assigned the lowest binary(0) value and the subsequent data item is assigned the next value(1) and so on. Also, the actual bit length is reduced by removing any redundant leading zeroes whenever necessary. This ensures that the most repeated data item is assigned the lowest possible bit value and with the lowest bit length. Key Words:Database, compression, variable-bit, lossless, PL/ SQL. 145
1. Introduction A relational database is a set of data items systematically organized as tables from which data can be retrieved or reassembled without restructuring the tables. The data items can be large in number with unique and repetitive values which requires a large amount of storage space and high storage cost. The efficiency of storage in a database could be improved by incorporating data compression. Compression reduces the quantity of data used to represent a file without excessively compromising the quality of the original data. Lossless compression:.lossless compression is applied in cases where it is essential that the data derived after decompression be exactly identical with the original data. That is, no portion of the data is lost during compression. Lossless compression focuses more on preserving the integrity of the data rather than the compression efficiency. Lossy compression: In lossy data compression, the data extracted after decompression may not be exactly same as the original i.e. some fractions of the data might be lost (hence the name lossy), but is good enough to be useful for specific purposes. When the compressed message is decoded it does not give back the original message intact. As a matter of fact, lossy compression focuses more on maximal compression rather than the integrity of the data. 2. Related Works Nimisha et al [1] proposed a lossless compression method which makes use of binary values to represent every distinct attribute in a database. In this method, the count of distinct attributes (n) were found and the number of bits needed to represent the attribute values was calculated (using the general rule, with n bits- combinations can be represented ). A new table was created with unique attribute values and their corresponding bit values. The original table was updated with the corresponding bit values. The frequency of the attributes however were not taken into consideration and equal number of bits were used to represent every distinct attribute values. As an example, 4 unique attributes can be represented in 2 bit combinations (.The combinations are 00, 01, 10, 11. S.R. Kodituwakku et al [2] performed experimental comparisons of different lossless compression algorithms for text data. Although they were tested on different type of files, the main interest was on different test patterns. By considering the compression times, decompression times and saving percentages of all the algorithms, the Shannon Fano algorithm was considered as the most efficient algorithm among the selected ones. 146
Amandeep Singh et al [3] developed a dynamic bit reduction algorithm to compress and decompress the text data based on lossless data compression approach. Various experiments were conducted on different datasets such as Random, Alphanumeric, Numeral and Special Characters dataset. The results obtained by the proposed system were compared with the existing data compression techniques- Bit Reduction and Huffman Coding using parameters- Compression Ratio and saving percentage. It was observed that the proposed system shows very good compression results in terms of Compression Ratio and Saving Percentage. Rupinder Singh et al[4] also proposed a new a bit reduction algorithm used for compression of text data based on existing compression algorithms. This algorithm employed the technique of saving bits. The compression algorithm took O (n) time, where n is the total number of characters in the file. Since the differential breaking follows Divide and Conquer policy, it takes O (n log n) time. So, the total computation time required for this algorithm is proportional to O (n log n). Shrusti Porwal et al [5] compared the lossless data compressions (Huffman and arithmetic encoding) and their performances. Stepwise algorithmic processes and various performance measures had been performed according to the criteria, to analysis which technique is better. The performance were calculated based on compression ratio, compression speed, decompression speed, memory space needed, compressed pattern matching and permits random access. It was observed that arithmetic encoding results the best compression ratio compared to Huffman compression. Figure 1: Lossless Compression 3. Methodology A. PL/SQL Figure 2: Losssy Compression It is a procedural language which encompasses SQL statements within its syntax. It was developed by Oracle to extend the features of SQL and to embed the features of procedural programming within SQL. There are generally six types of SQL commands: a. Data Definition Language- DDL can be used for purposes like creation of database objects and also to restructure them. Some common 147
examples are CREATE TABLE, DROP TABLE, ALTER TABLE etc. b. Data Manipulation Language DML commands are used to insert, delete and modify data. The DML commands are: INSERT, UPDATE, DELETE. c. Data Query Language -DQL allows to access/retrieve data from the database. The basic command used is SELECT. d. Data Control Language -DCL commands provides privileges to users to regulate data access within the database.the commands used are: ALTER PASSWORD, GRANT, REVOKE, CREATE SYNONYM. e. Data administration commands - These commands enable users to diagnose system performance by carrying out audits within database. Examples are: START AUDIT, STOP AUDIT f. Transactional control commands (TCL) These commands are used for managing database transactions. A few of these commands are:- COMMIT, ROLLBACK, SAVEPOINT, SET TRANSACTION B. PL/SQL Dynamic SQL Dynamic SQL methodology enables creating and running SQL statements at run time. It is useful for the following a. Writing general-purpose and flexible programs like ad hoc query systems b. Writing programs that must run database definition language (DDL) statements c. When the data type/number of input and output variables is missing or when the whole text of an SQL statement is unavailable during compilation. The steps to be taken when data manipulation operations are performed: 4. Proposed Work Step 1: Select the count of distinct data items from the selected column. Example: select count(distinct col_name) from table name Step 2: Calculate the count (n) of bits needed to represent the attribute (according to general rule, with n bits we can represent unique combinations). Step 3: Calculate the unique data items in for every attribute in the table and sort them in the decreased of their frequency. Example: select col_name, count(col_name) from table name; group by col_name order by count(col_name)desc Step 4: Create a table(index table) with the data items and their frequency along with their corresponding binary values. Attributes are assigned the bit values based on the decreasing order of their frequencies. Step 5: Eliminate the leading zeroes from the binary values. Step 6: Update the database by replacing the data items with their corresponding V-bit (variable bit) values. 148
INSERT: When new values are inserted, calculate the count of repeated values, recalculate the bit values and find the binary equivalent for the same. DELETE: When new values are deleted, same process as insertion to be performed, that is calculating the count, recalculating the bit values and finding the binary equivalent for the current. UPDATE: Updating the index table with the new value. SELECT: We can select and check values from the index table instead of original table. Here, a table is created at run time with attribute values and their corresponding frequencies along with an index table with attribute values and their corresponding binary values. The data item with the highest count is assigned the lowest binary value (0) and following item is allocated the next value (1) and so on. The leading zeroes are eliminated from the binary values (resulting in varying bit values) to reduce the storage space and thus increasing the performance efficiency. 5. Experiments and Results A sample dataset of Superstore Sales is being used for the experimentation of this paper, which contains distinct repetitive values. The table structure is as follows: Figure 3: Dataset: store sales 149
From the given dataset, we select the attributes which have unique repeating values which would require a considerable amount of storage space. Afterwards, we calculate the counts of distinct repetitive values from each selected column. For example, the count of repetitive values of the column Product Category is 3. According to the general rule, with n bits we can represent unique combination. Here we have 3 unique combinations and we need a maximum of two bits to represent all the values. The actual bit values for these combinations are represented as: 00, 01, 10. By further implementing the concept of variable-bits, the values are truncated and represented simply as 0, 1, 10. Now we are substituting these values to the attributes and creating an index table with the attribute values and their corresponding bit values which are assigned based on their frequency. In case the frequencies are the same, the assignment is just done in a sequential fashion. The index table thus created for the single attribute "Product Category" is shown in the table below. Table 1: Index table for Product-Category Similarly, we generate index tables for every attribute in the dataset. These are also used as reference tables to re-create the original dataset during the process of decompression. After creating the index tables for all columns, we have to update the original table by replacing the actual data values with their corresponding bit values. The updated table is shown below. In the original table, column Product Category requires 28 bits for the attribute Office supplies, 18 bits for Furniture, and 20 bits for Technology. In the compressed form, only 1 bit for the attribute office supplies, 1 bit for furniture and 2 bits for Technology are required which adds up to a total of 4 bits in total. This is a sizeable improvement over a similar work done by Nimisha et al [1], which makes use of six bits in a similar scenario since they generated binary values in a standard fashion without eliminating the leading zeroes. This refinement will be further evident in situations where the number of data items are significantly high. 150
Figure 4: Dataset updated with Bit value (Compressed dataset) Figure 5: Bar chart representing the storage space needed before and after compression 6. Conclusion In this paper, we propose the concept of introducing a variable-bit data-type for the purpose of compression in a database. This is an extension of the work done previously by Nimisha et al[1] which uses a uniform-length bit representation but is further improved by truncating the redundant leading zeroes in the binary values used during compression. Also, the assignment of the bits to the data items are based on their frequency of occurrence. This would ensure that the lowest binary number is allotted to the data item with the highest count/frequency and the highest binary value is allotted similarly to the data item with the lowest count. This results in an overall reduction in the number of 151
bits used during compression. This technique could be applied in any dataset and by considering the actual number of bits used, variable-bit is a much better alternative to the conventional bit representation for compression. Acknowledgement Special thanks to RAJALAKSHMI V R, Assistant Professor, Department of Computer Science & IT, Faculty at Amrita School of Arts and Sciences, Kochi, for her guidance and helpful comments on database management and data compression. We also thank the anonymous reviewers for their helpful and constructive comments. References [1] Nimisha E., Shyama P., Rajalakshmi V.R., A New Approach to Increase the Storage Efficiency of Databases Using BIT Representation, Amrita VishwaVidyapeetham, Department of Computer Science and IT, Kochi, India (2016). [2] Kodituwakku S.R., Amarasinghe U.S., Comparison of lossless data compression algorithms for text data, Indian Journal of Computer Science and Engineering 1(4) (2010), 416-425. [3] Amandeep Singh Sidhu, MeenakshiGarg, Research Paper on Text Data Compression Algorithm using Hybrid Approach, International Journal of Computer Science and Mobile Computing (IJCSMC) 3(12) (2014), 01-10. [4] Rupinder SinghBrar, Bikramjeet Singh, A Survey on Different Compression Techniques and Bit Reduction Algorithm for Compression of Text/Lossless Data, International Journal of Advanced Research in Computer Science and Software Engineering 3(3) (2013). [5] ShrustiPorwal, YashiChaudhary, Jitendra Joshi, Manish Jain, Data Compression Methodologies for Lossless Data and Comparison between Algorithms, International Journal of Engineering Science and Innovative Technology(IJESIT) 2(2) (2013). [6] Nishad P.M, ManickaChezian R., Enhanced LZW (Lempel-Ziv- Welch) Algorithm by Binary Search with Multiple Dictionary to Reduce Time Complexity for Dictionary Creation in Encoding and Decoding, International Journal of Advanced Research in Computer Science and Software Engineering 2(3) (2012). [7] Paul G. Howard, Jerey Scott Vitter, Practical Implementations of Arithmetic Coding, A shortened version appears in the proceedings of the International Conference on Advances in Communication and Control (1991), 152
[8] HaroonAltarawneh, Mohammad Altarawneh, Data Compression Techniques on Text Files: A Comparison Study, International Journal of Computer Applications 26(5) (2011). [9] Aarti, Performance Analysis of Huffman Coding Algorithm, International Journal of Advanced Research in Computer Science and Software Engineering 3(50) (2013). 153
154