A Multi-Algorithm, High Reliability, Extensible Steganalyzer Using Services Oriented Architecture

Sacred Heart University From the SelectedWorks of Eman Abdelfattah August, 2011 A Multi-Algorithm, High Reliability, Extensible Steganalyzer Using Services Oriented Architecture Eman Abdelfattah, Sacred Heart University Available at: https://works.bepress.com/eman-abdelfattah/12/

A MULTI-ALGORITHM, HIGH RELIABILITY, EXTENSIBLE STEGANALYZER USING SERVICES ORIENTED ARCHITECTURE Eman Abdelfattah Under the Supervision of Dr. Ausif Mahmood DISSERTATION SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIRMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOHPY IN COMPUTER SCIENCE AND ENGINEERING THE SCHOOL OF ENGINEERING UNIVERSITY OF BRIDGEPORT CONNECTICUT December, 2011

UMI Number: 3492698 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent on the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. UMI 3492698 Copyright 2012 by ProQuest LLC. All rights reserved. This edition of the work is protected against unauthorized copying under Title 17, United States Code. ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106-1346

A MULTI-ALGORITHM, HIGH RELIABILITY, EXTENSIBLE STEGANALYZER USING SERVICES ORIENTED ARCHITECTURE Copyright by Eman Abdelfattah 2011 iii

A MULTI-ALGORITHM, HIGH RELIABILITY, EXTENSIBLE STEGANALYZER USING SERVICES ORIENTED ARCHITECTURE ABSTRACT Network security has received increased attention in the last decades. Encryption has laid itself as the traditional method to transmit information in secrecy. Although strong encryption is a very secure approach for transmitting information, it can be easily identified that transmitted information is encrypted. Once the information is identified as encrypted, an intruder can block the encrypted transmission. In contrast, Steganography is a viable option to hide information in transmission without being identified. It provides a blanket that hides encrypted information. Thus, it becomes essential to develop mechanisms that reveal if the communicated information has any embedded data. Steganalysis is the art of detecting invisible communication and is a very challenging field due to different types of media and embedding techniques involved. Existing research in Steganalysis has focused on developing individual stego detection algorithms for a particular media type or for a particular embedding technique. In this dissertation we are proposing to develop a unified Steganalyzer that can not only work with different media types such as images and audio, but further is capable of providing iv

improved accuracy in stego detection through the use of multiple algorithms. Our proposed system integrates different steganalysis techniques in a reliable Steganalyzer by using a Services Oriented Architecture (SOA). The SOA architecture not only allows for concurrent processing to speed up the system, but also provides higher reliability than those reported in the existing literature because multiple stego detection algorithms are incorporated simultaneously. Furthermore, the extendable nature of the SOA implementation allows for easy addition of new Steganalysis algorithms to the system in terms of services. The universal steganalysis technique proposed in this dissertation involves two processes; feature extraction and feature classification. An improved 2D Mel-Cepstrum implementation is used for wav files feature extraction. Intra-blocks technique is used for jpeg images feature extraction. The feature classification process is implemented using three different classifiers; neural network classifier, Support Vector Machines classifier, and AdaBoost classifier. The unified steganalyzer is tested for jpeg images and wav audio files. The accuracy of classification ranges from 90.0% to 99.9% depending on the object type and the feature extraction method. In particular, an enhancement of 2D Mel-Cepstrum implementation is introduced that achieves an accuracy of 99.9%. This is significantly better result than the average detection accuracy of 89.9% to 96.7% reported by Liu [1]. Finally, an extensible classifier is introduced that allows adding detection of new embedding techniques to the currently supported embedding techniques, so that the framework will maintain its reliability even if new embedding techniques are introduced. v

ACKNOWLEDGEMENTS My thanks are wholly devoted to God who has helped me all the way to complete this work successfully. I owe a debt of gratitude to my family for their understanding and encouragement. I am honored that my work was supervised by Dr. Ausif Mahmood. Dr. Mahmood has provided unparalleled advice throughout the course of this dissertation. His countless hours of work have helped me to shape this dissertation in its final format. His support and encouragement was a major element to finish this dissertation. Also, I would like to thank the committee members for their valuable comments and ideas that helped me to improve the quality of work in this dissertation. Last but not least, I would like to acknowledge the support of the department of Computer Science and Engineering and its faculty for providing all the necessary help and support. vi

TABLE OF CONTENTS ABSTRACT... iv ACKNOWLEDGEMENTS... vi TABLE OF CONTENTS... vii LIST OF TABLES... ix LIST OF FIGURES... x CHAPTER 1: INTRODUCTION... 1 CHAPTER 2: LITERATURE SURVEY... 5 CHAPTER 3: A NEW FRAMEWORK FOR STEGANALYSIS... 17 3.1 Initial Framework Components... 18 3.1.1 Extension service... 19 3.1.2 Complexity service... 20 3.1.3 Mel-Cepstrum service... 22 3.1.4 Markov Service... 24 3.1.5 Intra-Blocks Service... 24 CHAPTER 4: IMPROVED MEL-CEPSTRUM BASED-STEGANALYSIS... 26 4.1 Test Methodology... 27 4.1.1 Support Vector Machines... 27 4.3.2 AdaBoost... 30 4.3.3 Neural Networks... 31 4.3.4 Testing Environment... 32 vii

CHAPTER 5: RESULTS AND ANALYSIS... 35 5.1 Intra-Blocks Technique... 35 5.2 Markov Technique... 36 5.3 Improved 2D Mel-Cepstrum Implementation... 40 5.3.1 Experimentation with Initial data set... 40 5.3.2 Experimentation with large data set... 43 5.4 Extensible classifier to add new embedding techniques... 47 5.5 Enhancements to the initial framework for SOA based Steganalyzer... 50 CHAPTER 6: CONCLUSIONS... 52 6.1 Future work... 55 REFERENCES... 57 APPENDIX A: SAMPLES OF CONFUSION MATRICES... 65 APPENDIX B: SAMPLES OF IMPLEMENTATION CODE... 78 B.1 Matlab Code... 78 B.2 Implementation Details of Services Oriented Architecture... 91 viii

LIST OF TABLES Table 5.1 Testing Accuracy for Markov Technique 39 Table 5.2 Table 5.3 Testing Accuracy for Improved 2D Mel-Cepstrum implementation using Initial Data Set Testing Accuracy for Improved 2D Mel-Cepstrum implementation using Large Data Set 42 45 ix

LIST OF FIGURES Figure 2.1 A general model for steganography 5 Figure 2.2 The effect of F5 embedding on the histogram of the DCT coefficient (2,1) 14 Figure 2.3 LSB Analysis option in StegAlyzerSS tool 15 Figure 3.1 SOA Layers 17 Figure 3.2 Figure 3.3 An Overview of the Initial Services Oriented Architecture Steganalyzer Magnitude distribution versus sample number for two distinct signals with different approximate complexities Figure 4.1 Four hidden layers network for the classification of 169 features 32 Figure 4.2 Overview of training and testing processes 34 Figure 4.3 Overview of a dynamic link library (dll) generation 34 Figure 5.1 Markov transition probability of the second order derivative (Cover 38 Figure 5.2 wave) Markov transition probability of the second order derivative (Stego wave by Steghide) 19 21 38 Figure 5.3 The difference between Markov transition probability between the cover and the stego 39 Figure 5.4 The overall structure of the extensible classifier 48 Figure 5.5 A modified architecture with the extensible classifier 51 x

CHAPTER 1: INTRODUCTION 1.1 Research Problem and Scope Steganography is the art of invisible communication. [2] while Steganalysis is the art of discovering hidden data in cover objects. [2] The cover object is an object of any type that contains no hidden information. However, the stego object is obtained by modifying the cover object using an embedding algorithm. Both an embedding algorithm and an extraction algorithm are to be conducted in each steganographic system. There are three kinds of steganography; pure steganography, private key steganography and public key steganography. The technique for embedding the message in pure steganography is unknown to warden and shared as a secret between sender and receiver. Thus, this technique relies on the secrecy of the algorithm itself which is not a good practice because once the algorithm is known to warden, this king of steganography is no more secure. In private key steganography both the sender and the receiver share a secret key which is used to embed the message. The warden has no knowledge about the secret key. However, the warden is aware of the algorithm that they could be employing for embedding messages. This kind of steganography depends on the secrecy of the key. The key is chosen to be hard to break. Thus, private key steganography is more secure compared to pure steganography. In public key steganography, both the sender and the receiver have private-public key pairs and know each other s public key. The sender 1

encrypts the message using the receiver s public key. Thus, only the intended receiver can decrypt the message using his/her private key. The field of steganalysis has received increased attention in the recent years. The main focus in steganalysis is only to detect the presence of a hidden message in a stego object. A stego object is obtained by modifying the cover object using an embedding technique. The object might be an image, text, audio or video. However, most of the reported techniques in literature deal with images. The steganalysis techniques are classified under two categories; specific and universal steganalysis. The specific steganalysis techniques are designed for a targeted embedding technique. Thus, they yield very accurate decisions when they are used against a particular steganographic technique. In universal techniques, dependency on the behavior of the individual embedding techniques is removed by determining and collecting a set of distinguishing statistics that are sensitive to a wide variety of embedding operations. As an example, universal steganalyzers dealing with images are composed of two important components; feature extraction and feature classification. In feature extraction, a set of distinguishing statistics are obtained from a data set of images by observing general image features that exhibit a strong variation under embedding. However, feature classification uses distinguishing statistics from both cover and stego images to train a classifier. Then, the trained classifier is used to classify an input image as either cover or stego. The purpose of steganalysis is to identify if a carrier (image, text, audio or video) has been manipulated by embedding a secret message using different embedding 2

techniques. These two factors; different carrier types and different embedding techniques, introduce a great complexity in designing a reliable steganalyzer. 1.2 Motivation behind the Research Most of the reported steganalysis techniques in literature handle specific embedding techniques. Other techniques that deal with several embedding techniques are designed to handle a specific data type. Our research focuses on developing a reliable, extensible and unified steganalyzer that can handle multiple data types. In this dissertation, we introduce a framework for a reliable steganalyzer using a Services Oriented Architecture that integrates multiple algorithms and can handle different data types. 1.3 Contributions of the Research In this dissertation we introduce a steganalyzer using a Services Oriented Architecture (SOA). In our SOA design the system is broken down into independent services. SOA provides several advantages such as high flexibility, simplicity, outsourcing, maintainability and reusability. Furthermore, it provides platform independency and distributing the overall load of the process [3]. Communication among services is achieved through common standards. Extensible Marked up Language (XML) is used to communicate all the requests/responses. The proposed architecture includes different services as follows: (1) An extension service that can identify different types of carriers such as: a. Images b. Audios 3

(2) Services that handle different steganalysis techniques. The advantages of such an architecture are: a. A reliable implementation of the steganalysis complex problem. b. Extensibility of the system where more services for new steganalysis techniques can be added later. c. Flexibility where a better service can replace an existing service when an improved steganalysis technique is developed for this specific service. 4

CHAPTER 2: LITERATURE SURVEY Steganalysis is the art of discovering hidden data in cover objects [2]. The main focus in steganalysis is only to detect the presence of a hidden message. In contrast, steganography is the art of invisible communication. Figure 2.1 shows a general model for steganography [4]. Figure 2.1. A general model for steganography [4] Each steganographic system has embedding and extraction algorithms. We consider steganography to be secure, if the set of stego-objects should have the same statistical properties as the set of cover-objects. Thus, we can t distinguish between cover-objects and stego-objects although we might have unlimited computing power. Meghanathan et. al. review some of the existing steganalysis algorithms [5]. The three commonly used domains of steganography are image, audio and video. Image steganography algorithms, audio steganography algorithms and Video steganography algorithms are analyzed for better understanding and developing steganalysis algorithms. 5

Different steganalysis algorithms have been developed because of the diversity in the cover media. In this chapter we divide the steganography and steganalysis techniques into three main categories depending upon the carrier being used. These include techniques based on images, based on multimedia (e.g., audio), and using other carriers such as HTML files or torrent files. We present the existing state of research in these categories in the following sections. 2.1 Image Steganography and Steganalysis With the wide availability of digital images, and the high degree of redundancy present in them despite compression, there has been an increased interest in using digital images as cover-objects for the purpose of steganography [6]. Martín et. al. investigated the effect of embedding a secret message into a natural image on the statistics of the image to examine the possibility of the detection of the presence of this secret message [7]. The three different stego algorithms that are used in the experiments are: Jsteg [8], MHPDM [9], and one of the algorithms in S-Tools [10]. The following five different statistical models of natural images are used: Areas of Connected Components Model [11, 12] Adjacent Pixel Values Model [13, 14, 15] Laplacian Distribution Model [16], Wavelet Coefficients Model and DCT Coefficients Model. Martín et. al concluded that the effect of the embedding operations is insignificant to the natural images when the analysis is independent of the steganograhy algorithms. However, if a prior knowledge of the embedding algorithm is known, a better classification can be obtained. 6

Avcibas et. al. used binary Similarity Measures (BSMs) to calculate three types of features; computed similarity differences, histogram and entropy related features, and a set of measures based on a neighborhood-weighting mask by looking at the seventh and the eighth bit planes of an image [17, 18]. The images are decompressed before being fed into the steganalyzer because this technique operates on spatial domain. The authors conclude that their technique demonstrates comparable results to the results obtained by Farid s scheme which uses a wavelet based decomposition to build higher-order statistical models [19]. Lyu et. al. use Wavelet-Based Steganalysis (WBS) to build a model for natural images by using higher order statistics and then show that images with messages embedded in them deviate from this model [20, 21]. Quadratic mirror filters (QMFs) are used to decompose the image into the wavelet domain, after which statistics such as mean, variance, skewness, and kurtosis are calculated for each sub-band. Additionally, the same statistics are calculated for the error obtained from a linear predictor of the coefficient magnitudes of each sub-band, as the second part of the feature set. The images are decompressed before being fed into the steganalyzer because this technique operates on spatial domain. Lyu et. al. conclude that their approach scores reasonable accuracy results. However, if a small message is embedded, this leads to poor performance detection. Kharrazi et. al. study the performance of three distinct blind steganalysis techniques against four different steganographic embedding techniques: Outguess [22], F5 [23], Model-Based [24] and perturbed quantization PQ [25]. The used cover media is jpeg images. The collected data set is categorized with respect to size, quality, and texture 7

to find out their impact on steganalysis performance. Blind steganalysis is composed of two components: feature extraction and feature classification. The three techniques used for feature extraction are binary similarity measures (BSMs), wavelet-based steganalysis (WBS) and feature-based steganalysis (FBS). A linear support vector machine (SVM) is employed to avoid high computational power if nonlinear kernel SVM is employed. Kharrazi et. al. conclude that FBS achieves superior performance because the used data set is compressed jpeg images. Moreover, PQ steganography embedding technique is the best one because it is the least detectable technique. As the quality factor of images increases, the distinguishability between the cover and stego images decreases. Recompression of jpeg images makes the distinguishability between the cover and stego images harder where the recompressed cover images are obtained by recompressing the original images using their estimated quality factor [26]. Ella conducted a survey of the methodology of information hiding and describes some techniques used in steganography and steganalysis [27]. An experiment is conducted on a set of images from Wikipedia by downloading them using the program Wikix. The program StegAlyzerSS [28] is used to scan the images. The results show that some images were found to have appended information. But, this result is not enough evidence for the existence of stego images because this appended information can be a result of manipulating images by some programs or information left from cameras. However, the author concludes that there were no confirmed instances of steganography found in the scans which makes blind steganalysis difficult problem. Goljan et. al. present a method that calculates the features in the wavelet domain as higher-order absolute moments of the noise residual [29]. The advantage of calculating 8

the features from the noise residual is that it increases the features sensitivity to embedding. Therefore, this method outperforms a previously proposed method by Holotyak et. al. [30]. A classifier using Fisher Linear Discriminant (FLD) is constructed and which is called WAM classifier. Also, WAM classifier is used to examine the security of three steganographic schemes: pseudo-random ±1 embedding using ternary matrix embedding, adaptive ternary ±1 embedding, and perturbed quantization. The authors conclude that perturbed quantization steganography technique is the most secure because it is the least detectable approach. Moreover, the adaptive ternary ±1 embedding scheme is more secure compared to pseudo-random ±1 embedding scheme. Fridrich discusses Feature-Based Steganalysis (FBS) where jpeg images are decompressed, and then crops its spatial representation by four lines of pixels in both horizontal and vertical directions to estimate statistics of the original image, before embedding [2]. Then, the jpeg image is recompressed with the original quantization table. The difference between statistics obtained from the given jpeg image and its original estimated version are obtained through a set of functions that operate on both spatial and DCT domains. According to Fridrich techniques (such as FBS) that rely on DCT based statistical features are expected to perform better than BSM and WBS. 2.2 Audio Steganography and Steganalysis Tian et. al. propose an m-sequence based Steganography technique for Voice over IP [31]. The technique succeeds to achieve good security, sufficient capacity and low latency by using least-significant-bits (LSB) substitution method. Moreover, m-sequence encryption approach is used to eliminate the correlation among secret messages so that the statistical steganalysis algorithm can hardly detect stego-speech. Also, a 9

synchronization mechanism is suggested to guarantee the accurate restoration of secret messages at the receiver side. A technique for the transmission of synchronization patterns (SPs) is proposed that allows online distribution of some important parameters by distributing the SPs among some fields in the IP header that are available for steganography. Thus, it is possible to construct the covert communication in real time. Tian et. al. introduce an adaptive Steganography scheme for Voice over IP (VoIP) [32]. An evaluation for the proposed method is conducted. The evaluation is based on designing five different steganography modes. Two modes are based on the traditional LSBs substitution method and the other three modes are based on the suggested adaptive Steganography scheme. They conclude that the adaptive Steganography approach outperforms the traditional LSBs substitution method since it enhances the embedding transparency by taking into account the similarity between Least Significant Bits (LSBs) and the embedded messages. Liu et. al. present two methods. In the first method, the statistics of the highfrequency spectrum and the Mel-Cepstrum coefficients of the second-order derivative are extracted for audio steganalysis. In the second method, a wavelet-based spectrum and Mel-Cepstrum coefficients are extracted for audio steganalysis [33]. A support vector machine is applied to the extracted features in both cases. A comparison among these two methods and the signal-based Mel-Cepstrum audio steganalysis method is conducted. Liu et. al. conclude that the proposed methods outperform the signal-based Mel-Cepstrum approach. Moreover, the derivative-bases approach outperforms the wavelet-based approach. 10

Liu et. al. suggest a stream data mining approach for audio steganalysis based on second order derivative of audio streams by extracting Mel-Cepstrum coefficients and Markov transition features on the second order derivative [34]. Signal complexity has been taken into consideration as an important parameter for evaluating the performance of audio steganalysis. A support vector machine is applied to the extracted features. Both techniques that apply second order derivative improve the detection performance compared to signal based Mel-Cepstrum audio steganalysis. Moreover, Markov approach based on second order derivative outperforms Mel-Cepstrum approach based on second order derivative in high signal complexity as reported by the authors. Qiao et. al. present an approach of detecting the hidden information in MP3 audio streams [35]. The moment statistical features of Generalized Gaussian Distribution (GGD) shape parameters of the Modified Discrete Cosine Transform (MDCT) sub-band coefficients, as well as the moment statistical features, neighboring joint densities, and Markov transition features of the second order derivatives are merged. Support Vector Machines (SVMs) are applied to the extracted features for detection. A detection accuracy of 94.1% is achieved when the modification density is 16%. Moreover, the percentage detection accuracy is increased to 95.6% when the modification density is increased to 20%. 2.3 Other Steganography and Steganalysis used Media In addition to the previously discussed carriers such as images and audios, some other digital entities can be used as cover media. For example, HTML files (hypertext markup language) have appropriate potentials for information hiding. While processing HTML files, the browser ignores spaces, tabs, certain characters and extra line breaks 11

which could be used as locations for hiding information. Another example, unused or reserved space on a disk is a second type of media that can be used to hide information. Also, data can be hidden in unused space in file headers. Last but not least, network protocols such as TCP, UDP, and/or IP can be used for hiding the messages and transmit them through the network [36]. Li et. al. suggest using torrent files, a crucial part of the BitTorrent P2P network, as host carriers for secret messages [37]. The authors used both Letter Case Change (LCC) and Field Reusage (FR) techniques to produce the stego-torrent files. Letter Case Change (LCC) was suggested based on the knowledge that some fields such as announce and announce-list fields in torrent files are case insensitive. This technique has the advantage of maintaining the size of the stego-torrent file as the original torrent file which provides a high level of transparency and security. Field Reusage method is suggested because of the redundancy of some other fields such as comment and publisher. The advantage of this method is the ability to embed data with huge capacity without rising suspicion. Moreover, FR method guarantees security by encrypting the stego-message with Data Encryption Standard (DES) technique. The detector will have no suspicion because a large portion of the torrent file already contains Secure Hash Algorithm 1 (SHA1) hashed pieces. Both an embedding algorithm and an extraction algorithm are presented to embed and extract secret messages. 2.4 Existing Tools There are many steganography tools reported in literature. In this sub-section we present some steganography tools such as Outguess, F5, Data Stash, S-tools and wbstego4. Moreover, there are few tools reported in literature that are used by 12

steganalysists. These tools are limited in their capabilities and target one or few specific cover objects. In this sub-section we present some steganalysis tools such as Stegdetect and StegAlyzerSS. (1) Outguess [22]: it identifies the redundant DCT coefficients that have minimal effect on the cover image, and based on this information it chooses bits in which it would embed the message. Outguess program recompresses the image with a quality factor defined by the user, and then it uses the obtained DCT coefficient to embed the message. The estimated quality factor of the image is communicated to the Outguess program in order to minimize recompression artifacts. When embedding messages in a clean image, noise is introduced in the DCT coefficient, therefore increasing the spatial discontinuities along the 8x8 jpeg blocks. Given a stego image, if a message is embedded in the image again there is partial cancellation of changes made to the LSB of DCT coefficients, thus the increase in discontinuities will be smaller. This increase or lack of increase in the discontinuities is used to estimate the message size which is being carried by a stego image. (2) F5 [23]: It embeds messages by modifying the DCT coefficients. The main operation done by F5 is matrix embedding with the goal of minimizing the amount of changes made to the DCT coefficients. The method takes n DCT coefficients and hashes them to k bits, where k and n are computed based on the original images as well as the secret message length. If the hash value equals to the message bits, then the next n coefficients are chosen, and so on. Otherwise, one of the n coefficients is modified and the hash is recalculated. The modifications are constrained by the fact that the resulting n DCT coefficients should not have a hamming distance of more than dmax from the 13