Detecting Hidden Information in Images: A Comparative Study

Detecting Hidden Information in Images: A Comparative Study Yanming Di, Huan Liu, Avinash Ramineni, and Arunabha Sen Department of Computer Science and Engineering Arizona State University, Tempe, AZ 8587 yanming.di, huan.liu, avinash.ramineni, arunabha.sen @asu.edu Abstract During the process of information hiding in a cover image, LSB-based steganographic techniques like JSteg change the statistical properties of the cover image. Accordingly, such information hiding techniques are vulnerable to statistical attack. The understanding of steganalysis methods and their effects can help in designing methods and algorithms preserving data privacy. In this paper, we compare some steganalysis methods for attacking LSB-based steganographic techniques (logistic regression, the tree-based method C4.5, and a popular method Stegdetect). Experimental results show that the first two methods, especially the logistic regression method, are able to detect hidden information with high accuracy. We also study the relationship between the number of attributes (the frequencies of quantized DCT coefficients) and the performance of a classifier.. Introduction The last few years have seen a significant rise in interest among the computer security researchers in the science of steganography. Steganography [8, 9] is the art of hiding and transmitting data through apparently innocuous carriers in an effort to conceal the existence of the secret data. The term steganography in Greek means covered writing whereas cryptography means secret writing. Steganography is different from cryptography in that while the goal of a cryptographic system is to conceal the content of the messages, the goal of information hiding or steganography is to conceal their existence. Steganography in essence camouflages a message to hide its existence and make it seem invisible thus concealing the fact that a message is being carried altogether. Steganography provides a plausible deniability to the secret communication which cryptography Yanming Di can also be reached at yanming di@brown.edu. does not provide. Covert information is not necessarily secure and secure information is not necessarily covert. The goal of cryptography is the secure transfer of the secret message where as the goal of steganography is to make sure the transfer of a secret message is undetectable. The understanding of steganalysis methods and their effects can help in designing methods and algorithms preserving data privacy. With the increase of the digital content (and distribution of multimedia data) on the Internet, steganography has become a topic of growing interest. A number of programs for embedding hidden messages in images and audio files are available [8]. Most of these steganographic methods modify the redundant bits in the cover medium (carrier) to hide the secret messages. Redundant bits are those bits that can be modified without degrading the quality of the cover medium. Replacing these redundant bits with message bits creates stego medium. The modification of the redundant bits can change the statistical properties of the cover medium. As a result statistical analysis may reveal the presence of hidden content. Detecting the steganographic content is called steganalysis [7]. It is defined as the art and science of breaking the security of steganographic systems. As the goal of steganography is to conceal the existence of a secret message, a successful attack on a steganographic system consists of detecting that a certain file contains hidden information in it. Detection of steganographic modifications in an image can be made possible by testing its statistical properties. If the statistical properties deviate from a given norm it can be identified as a stego image. In this paper we present two new methods for steganalysis. We try to attack the steganographic tool JSteg and attempt to break it with higher accuracy. There have been many attempts at breaking JSteg but the accuracy is not really high. In this paper we present two novel methods of breaking JSteg which can further be extended to other tools like JPhide and Outguess [8]. The rest of the paper is organized as follows. Section introduces JSteg steganographic method, the idea of the sta-

& $ V > L tistical attack, and some related work. Section presents two new methods for attacking LSB-based steganographic methods. We also discuss how the knowledge of the distribution of the DCT coefficients can help in building our models. Section 4 reports results of our experiments of comparing the two new methods with the popular program Stegdetect [6]. Section 5 shows that careful model selection is needed to achieve high accuracy when using data mining and statistical methods for steganalysis. Section 6 concludes the paper.. Histogram Analysis The JPEG image format [5] uses a discrete cosine transform (DCT) to transform each block of source image pixels into DCT coefficients. The DCT coefficients of an block of image pixels are given by where and! #"%$ & ')(+*,-(!* 5467 0- #?A@ B C D 4 NMPOQF 8:9 /. 0- EGFH for JIKFL for RL)LRL/TSU=L ;4<7 8=9 > The coefficients are then quantized using a -element quantization table V by the following operations: XWY [Z ]\+^J" The least-significant bits (LSB) of those quantized DCT coefficients can be used to embed hidden messages. JSteg hides data in JPEG image files by changing the LSB bits of the quantized DCT coefficients. It replaces the LSB bits of the quantized DCT coefficients with secret message bits. Other similar steganography methods include JPHide and OutGuess 0.. Steganographic techniques like JSteg that modify the LSB bits can be detected by analyzing the frequencies of the quantized DCT coefficient values. Modifying the least significant bits transforms one value into another that differs only by. These pairs of values are called PoVs in [9]. They introduced a powerful statistical attack that can be applied to any steganographic technique in which a fixed set of PoVs are flipped into each other during the process of embedding message bits. The insight of their statistical attack is based on the observation that if the message bits are equally distributed, modifying the least significant bits will reduce the difference in frequency of the PoVs, which would otherwise be unequal with very high probability. This equalization can be detected by appropriate statistical tests. Based on this idea, Provos and Honeyman [6] carried out an extensive analysis of JPEG images using the steganalytic software Stegdetect. Stegdetect is based on Chi-square test. It is able to detect messages hidden in a JPEG image using steganography software such as JSteg, JPHide or Out- Guess 0.. However the Chi-square test used in [6] does not produce results with very high accuracy. In this paper, we will demonstrate that the histogram information can be used more efficiently by using logistic regression or a decision tree method, C4.5 [7]. Similar work in this area is reported by Berg et al. [], Farid [4] and Zhang and Ping []. Berg et al. [] used statistical learning methods in steganalysis. The attributes used in their learning procedure are unconditional entropy, conditional entropies and transition probabilities. Farid [4] built higher order statistical models of the images using a type of wavelet decomposition. Zhang and Ping [] proposed an attack on the JSteg method based on a different Chi-square test. [5] is a very good survey on the state of art of the steganalysis.. Classification Methods To distinguish between the images with and without hidden data can naturally be viewed as a classification problem. We refer to the classes as stego and normal. We use a set of images as the training data to construct classifiers. When a classification algorithm is run on a data set (stego and normal images), it needs to find a boundary between the two classes and create a model. Given a set of images, the model learned can be used to predict the class to which each image belongs to. Here we propose to apply two classification methods to detecting hidden messages in JPEG images, the logistic regression method and the tree-based method C4.5. We discuss next the attributes to be used in the statistical models... Predicting Variables As discussed earlier, LSB-based steganography techniques such as JSteg insert information into images by replacing the LSB of the quantized DCT coefficients with the secret text. This changes the frequency of the quantized DCT coefficient values. Therefore, the frequencies of quantized DCT coefficient values are natural candidates for predicting variables to be used in the statistical models. However, for JPEG images, the DCT coefficients can have a

Z Z 4 4 wide range of values. Using the frequencies of all these values to build a model is not practical. More importantly, using more variables than needed may introduce several problems. For a regression model, it may lead to ill-conditioned matrices and unstable estimate of the model parameters. In general, adding attributes that are not really important into a statistical model is like adding noise to the model, thereby degrading the performance of the model. Therefore careful model selection is very important. Research shows for normal images, the distribution of the DCT coefficients can be approximately modeled as a Laplacian or a generalized Gaussian distribution [, 4, 0]. These models on the distribution of the DCT coefficients suggest that the frequencies of DCT coefficients with small values are more unevenly distributed. For example, the difference between the frequency of value and value 4 is generally greater than that between value and value. So the DCT coefficient values with small magnitude are more sensitive to information inserted using a JSteg-like method. (See Figure in [6] for histograms of the DCT coefficients before and after messages are inserted into a JPEG image using JSteg-like method). We use to denote the frequency of the DCT coefficient with value, i.e., the number of DCT coefficients that take the value. In this study, we use the central six frequencies,,,, and to build our models. We do not use the frequencies of the values F and, since the JSteg method does not modify DCT coefficients with these values. Our experiments demonstrate (results presented in Section 5) that the use of more variables does not improve the performance of the models... Logistic Regression Logistic models are widely used in statistical applications where binary responses (two classes) occur. In our case, an image is either stego or normal. We can assume that the probability of an image being stego is a function of some image characteristics (a vector). For example, can be the frequencies of the quantized DCT coefficient values. In the case of two classes, the logistic model has a very simple form. The probability function 7 7 * is modeled by 7%.R.). 7+! where s are the components of the vector. The model parameters s are usually estimated by maximum likelihood. Note that the function ' is a monotone func- ' '%$ tion and as such its inverse function '&! #" ')$ guarantees Z #! (" that the values of are between 0 and. More details on logistic regression can be found in [6]. We use the S-PLUS [] function glm [] to estimate the probability function. As mentioned in the previ- h - 0 00 400 600 800 000 00 400 0 000 000 000 Figure. The frequencies of the quantized DCT coefficients having values X and. Black dots are for the normal JPEG images; Symbols, and are for stego images with 0, 00 and 00 bytes hidden messages respectively. ous subsection, we use the frequencies,,,, and * as predicting variables. Addition of more variables increases the complexity of the model but does not necessarily guarantee higher accuracy. In fact, we show in Section 5, in some instances, accuracy can actually deteriorate. We include only linear terms of the variables in the model. This is based on the following argument. Take the pair and for example. A plot of these pairs is shown in Figure. The plot suggests that in a normal JPEG image, the frequency of the DCT coefficients having a X value is almost always higher than the ones with a value. This is true because the distribution of each individual DCT coefficients tends to have a mode in the center (see [, 4, 0] for modeling of the DCT coefficients). If we have a very large image with large number of DCT coefficients, this trend should also hold for other coefficient value pairs. However, for small size images, occasionally this trend may not be seen. As the image sizes in our experiments was rather small, 4#+ 4+, we chose the frequencies,,,, and as predicting varialbes in the statistical models, because the trend seem to be correct in this case. This important piece of extra information is not utilized by any known steganalysis method. An implication of this observation is that linear logistic models should work well in detecting JSteg-like steganographic techniques. The above discussion also suggests that the methods presented in this paper should work better for larger images. Compared with a Chi-square test, logistic regression is h -

F more refined. Logistic regression has many advantages,. It is fast. Logistic regression is implemented efficiently in almost all the professional statistical package, such as SPLUS and SAS.. It is easy to interpret. We can actually derive a closed form expression for the probability function.. It is flexible. We can adjust one simple parameter in the model to meet different accuracy needs... Tree based Method error rates 0.0 0. 0. 0. 0.4 0.5 0.6 Tree based methods can also be used. We present a tree based model, C4.5 [7], to fit a tree structure to the training data. C4.5 builds classification models called decision trees from the training data. Each internal node in the tree specifies a binary test on a single attribute, using thresholds on numeric attributes. If as a result of tests conducted at internal nodes, an image ends up in a leaf node where majority of the images are stego, it is classified as a stego image; otherwise it is classified as a normal image. The tree is constructed by the following procedure:. Choose the best attribute for splitting the data into groups at the root node.. Determine a splitting point by maximizing some specified criterion (say, information gain).. Recursively carry out the first two steps until information gained by the process cannot be improved any further. Information gained by splits can be used as the criterion for determining the attributes and the splitting points. Once the tree is constructed it can be used for classifying the test data. Tree based methods can be more flexible than logistic regression. It makes less assumptions about the data, so can be generalized to other situations more easily. One disadvantage of the tree based method is that the decision regions for classifications are constrained to be hyper-rectangles with boundaries constrained to be parallel to the input variable axes. As in the case of logistic model, the training data in this case also consist of the frequencies of quantized DCT coefficient values,,,, and * of images, as attributes or predicting variables. We use the data mining tool WEKA [0] to run the C4.5 algorithm on the training data. 4. Experiments and Results We have a data set of 80 normal JPEG images. The images are downloaded from the Internet. All the images logistic tree stegdetect logistic tree stegdetect logistic tree stegdetect Figure. Boxplots of the error rates. The left three bars are error rates for experiment, the middle three are for experiment, and the right three are for experiment. Smaller values are better. have been cropped 4F to 4+ 4#+ in size. We used the JSteg method to insert bytes, F F 4 F F bytes, and bytes of text messages 4 into the images. The JPEG image sizes range from to kilobytes. According to the author of JSteg, the maximum size of the message that can be inserted in F a cover image is approximately of the size of the image file. For some image files, this limit is only 00 bytes. The secret message used in our experiments for insertion in cover images was taken from Gutenberg s Etext of Shakespeare s First Folio. We use 0-fold validation to compare the three steganalysis methods, the logistic regression, the tree-based method C4.5, and the Stegdetect method. In each of our experiments, we take the original 80 JPEG images, and one 4F F group of 80 images with embedded text messages., F=F 4 F and bytes of texts were embedded in cover images in our experiments, and respectively. The 0-fold validation results are summarized in Tables,,, and Figure. From the results, we can see that the logistic regression method performs better than the other two methods in all three experiments. The tree-based method C4.5 performs better than the Stegdetect method in experiment. The performance of the latter two methods is similar in experiment. While in experiment both methods fail to perform better than random guess. The performance of the logistic regression method is noteworthy. When 4F byte messages are embedded in the images, the mean error rate is FL for this method. This implies even when only 4F bytes of message is embedded, the logistic regression method is able to perform better than

Table. Error rates for the logistic regression, the tree-based method C4.5 and the Stegdetect method in experiment (in the stego images, 00 bytes of text message is embedded using JSteg). Mean and standard deviation of the error rates are shown in the bottom of the table. run logistic tree stegdetect 0.000000 0.055556 0.94444 0.000000 0.055556 0.05556 0.07778 0.08 0. 4 0.000000 0.055556 0.50000 5 0.000000 0.000000 0. 6 0.000000 0.07778 0.94444 7 0.07778 0.94444 0.66667 8 0.000000 0.07778 0. 9 0.000000 0.07778 0. 0 0.000000 0.07778 0.77778 mean 0.005556 0.055556 0.94444 stdev 0.07 0.05990 0.07056 Table. Error rates for the logistic regression, the tree-based method C4.5 and the Stegdetect method in experiment (in the stego images, 00 bytes text message is embedded using JSteg). Mean and standard deviation of the error rates for each method are shown in the bottom of the table. run logistic tree stegdetect 0.055556 0.7778 0.94444 0.08 0.0 0.50000 0. 0.0 0.66667 4 0.055556 0.8889 0.50000 5 0.055556 0. 0.94444 6 0.000000 0.94444 0.8889 7 0.08 0.444444 0.50000 8 0.055556 0.66667 0. 9 0.08 0.05556 0.50000 0 0.000000 0.08 0. mean 0.058 0.8778 0.5000 stdev 0.0574 0.09894 0.054700 Table. Error rates for the logistic regression, the tree-based method C4.5 and the Stegdetect method in experiment (in the stego images, 0 bytes text message is embedded using JSteg). Mean and standard deviation of the error rates for each method are shown in the bottom of the table. run logistic tree stegdetect 0.88889 0.57778 0.555556 0.05556 0.555556 0.57778 0.444444 0.555556 0.47 4 0.500000 0.6 0.46667 5 0.88889 0.57778 0.444444 6 0.88889 0.58 0.05556 7 0. 0.46667 0.68889 8 0.46667 0.58 0.555556 9 0.444444 0.507778 0.500000 0 0. 0.6 0.555556 mean 0.94444 0.548000 0.497 stdev 0.05970 0.057986 0.09008 random guess whereas the other two methods perform no better than random guess. In the tree-based method the boundaries of the decision regions are constrained to be parallel to the input variable axes. However, it may be observed in Figure, the true boundary in terms of the attributes of the normal and the stego images is not parallel to the input variable axes. For this reason the performance of the treebased method is not as good as that of logistic regression. Our chosen methods do not rely on the knowledge of the locations where the information is hidden. As such they can be effectively utilized to break similar LSB-based methods that use random bit selection, e.g., OutGuess 0.. 5. On the number of predicting variables We indicated in Section that the use of excessive variables may not lead to better results. We illustrate this phenomenon with the help of results from our experiments. In Figure, we present the results of our experiments where a varying number of predicting variables (, 4, 6,...,0) were used instead of just 6 (,,,, and * ). The estimation error rates for logistic regression method using, 4, 6,..., 0 variables (0-fold cross validation) are summarized in Figure. The figure indicates that using only the center two frequencies and, it may not be possible to capture all the information in the data. Increasing the number of variables to or improves the accuracy. However, the use of more than variables does

Error rates 0.5 0.50 Error rates 0.0 0.0 Error rates 0.0 0.04 4 6 8 0 4 6 8 0 Number of variables used (a) 4 6 8 0 4 6 8 0 Number of variables used (b) 4 6 8 0 4 6 8 0 Number of variables used (c) Figure. Comparison of logistic models with different number of predicting variables. The sizes of hidden messages in the stego images are 0, 00 and 00 bytes in figures (a), (b) and (c) respectively. not improve the accuracy but causes slightly larger variance in the estimation error rates. We observe similar trends in the tree-based methods. In our experiments, therefore, we use a model with predicting variables. More sophisticated feature selection algorithms can be found in the literature [6,,, ]. We intend to explore if applying these feature selection algorithms can lead to further performance improvement. 6. Conclusion The tree-based method C4.5 outperforms Stegdetect in the experiment where a relatively large amount of information is hidden. However, it does not perform well when the amount of hidden information is small. We suggest that one reason for C4.5 not performing as well as the logistic regression is that it tends to produce boundaries that are parallel to the input variable axes, which in this case may not be appropriate. We also pointed out that the number of attributes used in classification can be related to a classifier s performance. Many present steganalysis methods do not consider this as a serious problem. Hence they tend to use all attributes that are related. However using more variables than needed does not necessarily lead to good performance and may even significantly degrade the performance of a statistical learning model. When selecting the predicting variables for our model, we take the distribution of the DCT coefficients into consideration. The understanding of steganalysis methods and their effects can help in designing methods and algorithms preserving data privacy. Our experiments were carried out to break methods like JSteg that are employed to hide information. Our methods do not rely on the placement of the hidden information. Therefore they can be used without any modification on LSB based steganographic techniques that use random bit selection. 7. Acknowledgements The authors would like to thank Sidi Goutam and Amit Mandvikar for their help in this project. The authors also wish to thank the reviewers for their helpful comments in the preparation of this manuscript. LSB-based steganographic techniques like JSteg change the statistical properties of the cover image when it embeds secret message in the image. Accordingly, such methods are vulnerable to statistical attack. Previous methods such as Stegdetect are based on Chi-square test. The accuracy of Stegdetect can be improved. When the size of the hidden message is small, it performs no better than random guess. In this paper we have proposed two new steganalysis methods based on the logistic regression and the tree-based method C4.5 for attacking LSB-based steganographic techniques. We conducted experiments to evaluate the performance of the two data mining techniques and compared them with the performance of the well known method Stegdetect. The experiments demonstrated that the performance of the logistic regression based technique is very impressive. When large amount of information is hidden, it can detect with very high accuracy. Even when the amount of hidden information is very small, it performs better than random guess. References [] G. Berg, I. Davidson, M.-Y. Duan, and G. Paul. Searching for hidden messages: automatic detection of steganography. In 5th AAAI Innovative Applications of Artifical Intelligence (IAAI) Conference 00, 00. [] J. M. Chambers and T. Hastie, editors. Statistical models in S. London: Chapman & Hall, 99. [] R. J. Clarke. Transform Coding of Images. London: Academic Press, 985. [4] H. Farid. Detecting hidden messages using higher-order statistical models. In International Conference on Image Processing (ICIP), Rochester, NY, 00, 00. [5] J. Fridrich and M. Goljan. Practical steganalysis of digital images - state of the art. In Proc. SPIE Photonics West, Vol. 4675, Electronic Imaging 00, Security and Watermarking of Multimedia Contents, San Jose, California, January, 00, pp. -., 00. [6] T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of statistical learning: data mining, inference, and predic-

tion: with 00 full-color illustrations. New York: Springer- Verlag, 00. [7] N. F. Johnson and S. Jajodia. Steganalysis of images created using current steganography software. In D. Aucsmith, editor, Information Hiding: Second International Workshop, volume 55 of Lecture Notes in Computer Science, pages 7 89. Springer-Verlag, Berlin, Germany, 998. [8] D. Kahn. The Codebreakers The Story of Secret Writing. Scribner, New York, New York, U.S.A., 996. [9] D. Kahn. The history of steganography. In R. J. Anderson, editor, Information Hiding, First International Workshop, volume 74 of Lecture Notes in Computer Science, pages 5. Springer-Verlag, Berlin, Germany, 996. [0] E. Y. Lam and J. W. Goodman. A mathematical analysis of the DCT coefficient distributions for images. IEEE Transactions on Image Processing, 9(0):66 666, 000. [] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery & Data Mining. Boston: Kluwer Academic Publishers, 998. [] P. McCullagh and J. A. Nelder. Generalized linear models (Second edition). London: Chapman & Hall, 989. [] A. Miller. Subset Selection in Regression. Chapman & Hall/CRC, edition, 00. [4] F. Müller. Distribution shape of two-dimensional DCT coefficients of natural images. ELECTRONICS LETTERS, 9():95 96, 99. [5] W. B. Pennebaker and J. L. Mitchell. JPEG Still Image Data Compression Standard. Van Nostrand Reinhold, New York, NY, USA, 99. [6] N. Provos and P. Honeyman. Detecting steganography content on the Internet. Technical report, CITI, 00. [7] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 99. [8] Steganography software. www.stegoarchive.com, 997-00. [9] A. Westfeld and A. Pfitzmann. Attacks on steganography systesms, 999. [0] I. Witten and E. Frank. Data Mining - Practical Machine Learning Tools and Techniques with JAVA Implementations. Morgan Kaufmann Publishers, 000. [] L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In T. Fawcett and N. Mishra, editors, Proceedings of the 0th International Conference on Machine Learning (ICML-0), August -4, 00, pages 856 86, Washington, D.C., 00. Morgan Kaufmann. [] T. Zhang and X. Ping. A fast and effective steganalytic technique against JSteg-like algorithms. In ACM Symposium on Applied Computing, March 9 to, 00, Florida, USA, 00.