BYTE FREQUENCY ANALYSIS DESCRIPTOR WITH SPATIAL INFORMATION FOR FILE FRAGMENT CLASSIFICATION

Size: px
Start display at page:

Download "BYTE FREQUENCY ANALYSIS DESCRIPTOR WITH SPATIAL INFORMATION FOR FILE FRAGMENT CLASSIFICATION"

Transcription

1 BYTE FREQUENCY ANALYSIS DESCRIPTOR WITH SPATIAL INFORMATION FOR FILE FRAGMENT CLASSIFICATION HongShen Xie 1, Azizi Abdullah 2, Rossilawati Sulaiman 3 School of Computer Science, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, MALAYSIA 1 xie2876@126.com 2 azizi@ftsm.ukm.my 3 rossilawati@gmail.com Abstract Digital forensic is generally about recovering and investigating digital devices such as PC and mobile phones. Examining information and extracting evidences from the digital devices are not an easy task. In data recovery for example, the successful of recovering the digital information is highly dependent on how a method is able to understand the content of a document effectively. The more the system is able to understand the content of documents the more effective is will be in recovering the desired documents. One of the challenging issues in recovering documents is to determine the type of file fragments from an incomplete structure of documents. One possible solution to the problem is based on statistical analysis such as the byte frequency analysis for feature description. The byte frequency analysis computers a global descriptor and provides a statistical distribution from file fragments. However, one possible problem of this solution is to create a global histogram input vector for a machine learning classifier, such as support vector machine that is insensitive to small changes in the file fragment content. Besides, it does not include any spatial information, and liable to false positive especially for large datasets. Therefore, the byte frequency analysis with circular representation is proposed, where a set of file fragments is divided into several blocks using a fixed partitioning scheme. Then, for each block the lower-level byte frequency analysis descriptor feature is used to represent the partitions. After that, all features are combined to create one large input vector for machine learning classifier for classification. We have performed experiments on 10 different file categories at three different resolutions i.e. level0, level1, level2 and combination of several these resolutions. The results show that the proposed method slightly outperforms the single byte frequency analysis distribution. Keyword: digital forensic, byte frequency analysis, support vector machine, spatial information circular scheme 1. Introduction One of the important tasks in digital forensic is file type classification. File fragment type classification is significant in data carving which make a contribution to data recovery before implemented recovery technique. Fragment classification aims to classify different categories of file fragments which is termed extension-based, magic bytes-based and content-based methods (Amirani, Toorani, & Mihandoost, 2013). Content-based method is the most challenging in the parts of the meta-data which have been lost or corrupted. However, the fragments type still can be predicted due to different statistical feature distribution. In previous work, statistical methods have been widely applied in file fragment classification, such as byte frequency analysis (McDaniel & Heydari, 2003), file entropy (Calhoun & Coles, 2008) and standard deviation(axelsson, 2010) of the byte frequencies. Some of researchers get promising results. Recently, applying supervised(li, Ong, Suganthan, & Thing, 2010) and unsupervised (Axelsson, 2010) machine learning technique which boosts file Organized by WorldConferences.net 249

2 fragment classification is widely used. The solution often selects some file fragment features such as BFA, entropy and compression distance. After features are selected the classification is a common machine learning problem. The finding of the research showed that the applying machine learning perform well(fitzgerald, Mathews, Morris, & Zhulyn, 2012), (Beebe, Maddox, Liu, & Sun, 2013). However, there are not enough the techniques which used spatial information to describe file fragment. In this paper, we applied spatial information in enriching file fragment feature. A contentbased type classification method that deploys spatial information and supervised support vector machine for an automatic feature extraction is proposed. The spatial information combining with SVM technique applies statistical analysis of byte frequency of the file fragment in such a way that the accuracy of the technique does not rely on the potential metadata information but rather the values of data itself. The extracted features are then applied in a classifier for file fragment type classification. Table 4, 5 and 6 show that the proposed method gives promising results in both binary file fragment and textual file fragment, when processing the fragment in random size and using a multi dimensional support vector machine classifier. 2. Related Works Previous works that explored non machine learning or machine learning techniques to the problem of file fragment classification appears in the literature. Mcdaniel and Heydari (McDaniel & Heydari, 2003) proposed three algorithms, byte Frequency analysis(bfa),byte Frequency Crosscorrelation(BFC) and File Header/Trailer(FHT), in order to construct the characteristic fingerprints to identify different file types. The BFA algorithm calculates the frequency distribution of each file type by counting the number of occurrences of each byte value. The BFC algorithm captures the relationship between the byte value frequencies to strengthen the file type identification. The correlation strength determined by the average difference in their frequencies The FHT algorithm focuses on calculating byte distribution of the file header and file footer. The experiment explores 30 kinds of files; each file has 4 files sample. The classification accuracies were approximately by 27.5%, 45.83%, and 95.83% for BFA, BFC, and FHT algorithm, respectively. However, this approach fails to get a high accuracy, except that it relies on the header information contained in the fragment Meta information. Hence, it is not applicable to most of situations which are not included in the header information. Besides that, Karresand and Shahmehri (Karresand & Shahmehri, 2006b) proposed a file type identification method, and named it Oscar. They built centroids of the mean and standard deviation of byte frequency distribution of different file types. A weighted quadratic distance metric was used to measure the distance between the centroids and the test data fragments, so as to identify JPEG fragments. In addition, the detection capability of Oscar was enhanced by taking into consideration that byte OXFF was only allowed in combination with a few other bytes (i.e.0x00, 0XD0..D7) within the data part of JPEG files. Using a test data set of KB blocks, the classification accuracy was 97.9%. The authors extended the Oscar method in his previous research (Karresand & Shahmehri, 2006a) by incorporating the byte ordering information through calculating the rate of change of the data bytes. Using a test data of 72.2 MB, the classification accuracy for JPEG fragments was 99% with no false positive. However, for windows executable, the false positive rate increased tremendously to exceed the detection rate. The detection rate for zip files was between 46% and 84%, with false positive rates in the range of 11% to 37%. Calhoun and Coles (Calhoun & Coles, 2008) proposed the application of the Fisher linear discriminate to a set of different statistics based on four file types (namely jpg bmp gif and PDF). Specifically, the experiment compared between jpg vs.pdf jpg vs.gif, PDF vs. bmp. The test set is 50 fragments for each file. After data preprocessing, the combination of the ASCII-Entropy-Low-High- Modesfreq-Sdfreq statistical feature achieved the highest average accuracy 88.3%. It is caused by only take four file fragment types into consideration and based on 1vs1 model. However, there was no attempt at multi-type classification. Testing in such a way gives less chance of misclassification and the results should be interpreted with false positive or true negatives, finally they noted that a modification to their methodology would be required to avoid the situation where the method fails and all fragments are classified as one type but which gives high accuracy. Organized by WorldConferences.net 250

3 Veenman (2007) extracts three features from file content which include byte frequency complexity. After these features were retrieved, the author applied linear discriminate analysis to classify file fragments. In his experiments the type-x and type-all map to binary and multi-class classification. And the author used a large private data set consisting of between 3000 and fragments per file type, for 11 file types. This method achieved an average classification accuracy of 45%. However, this method gets a poor performance in dealing with the compressed files had only 18% accuracy with 80% false positives. The prediction accuracy decreased as the numbers of clusters increased. The possible reason is that the compressed files are not sensitive to the increase in clusters content. Li et al. (2010) proposed support vector machine (SVM) applied in high entropy file fragment classification based on byte frequency analysis. In this experiment the training data have 5 kind of file types.(dll,mp3,exe,jpeg,pdf) For each file have 800 file training and 80 file for testing, each file was split into 4096 byte fragments and the first and last fragment discarded to ensure that header and footer meta data and any possible file padding was excluded. They achieved an average classification accuracy of 81.5%. The result shows the classification results are acceptable. However, due to only implementation on five types files. These results are not so comprehensive. The datasets should extend to more file fragment types, which also have a high entropy values. At the same time a drawback in this experiment is that the data set is a private datasets. It is difficult to extend the experiment to other researchers. Fitzgerald et al., (2012) proposed machine learning algorithm (SVM) applied in natural language processing based on byte frequency bigram counts, entropy of bigram counts, hamming weight and compressed length as a feature. As to data set, this experiment downloads from a public dataset. ( GovDocs 2009). They selected up to 4000 fragments for the data set and apportioned these approximately in the ratio 9:1 as training and test sets respectively for the SVM. Finally according to SVM predict the results which show average predict accuracy is 48.3%. And with the number of file fragment increasing, the prediction accuracy increasing too. The result is achieved promising. However, the paper did not mention which feature belongs to most powerful one in classifying file fragment. Sportiello and Zanero (2011), proposed a method which have multi features combining solution. These features include Mean Byte Value, Entropy (E) and Complexity(C). And some specialized classifiers such as the distribution of ASCII character codes which would characterize text based file types, and Rate of Change which we have seen is a good classifier for JPEG files. The corpus consisted of nine file types download from the internet and these were decomposed into a total of blocks of 52 bytes for each file type. The experiment show E-C-byte frequency usually is a powerful feature vector to achieve a high accuracy. And feature of rate of Complexity and Complexity- byte frequency is also a good feature vector to get high classification accuracy. Finally the truth prediction rate range from 71.1% to 98.1%, and false prediction rate range from 3.6% to 32.1%. The experiment have a contribution is clearly explore what kinds of combining feature vector is useful to file block classification. However, there is no multi-class classification was attempted. For each file type a separate SVM model was created to classify a fragment type against each of the other types individually. And there is no confusion matrix of the results and no mention of false negative results, the table of results is arranged by fragment type, feature and feature parameter (c and r). It is difficult to compare the results with previous research. Gopal, Yang, Salomatin, and Carbonell (2011), proposed a multiple file corruption situation, where Type3 corresponds to file fragment classification they then evaluated the performance of several statistical classification methods(support vector machine with n-gram feature, and k-nearestneighbors) and several commercial off-the-shelf solution (for example: including Libmagic, Trid, Outsidein and Droid) in classifying files under several corruption scenarios. At the same time they made use of public data ( GovDocs 2009) and consisted over files of 316 file types, as to specification of file that is compressed or encrypted files not mention in the experiment. The result shows file fragment is more difficult to classify than completely file type. And the performance on file fragment was for the SVM only and it is shown that they achieved 40% accuracy measured the Macro-F1 measure. However, the true file types were taken to be those reported by Libmagic using Organized by WorldConferences.net 251

4 the Linux file command and so no take the meta-data for example file header or footer into consideration which may have resulted in bias in the experiment results. 3. Proposed Method The proposed scheme includes two phases: the training phase and the testing phase. During the training phase, the support vector machine is trained, and the system is initialized using sample data with known types. The output will be developed with a fileprint from different file types. In the testing phase, trained support vector machine and fileprints will be used for the fragment type testing. We assumed that to each input file fragment the model would automatic partition into several blocks based on 4096 byte. After we processed this splitting, we removed the first file fragment, because it included meta-data of header information, which normally affects classification accuracy. At the same time, we deleted the last file fragment because it also included metadata of footer information. When the real world file fragment was inserted into this system, it was quickly converted to a 4k based file fragment. After data preprocessing and BFA extraction, the left part data fragment was divided into several blocks using 2 i algorithm. The circular scheme maps to level 0, level 1, and level 2, when i equals to 0, 1, and 2.In this study we focus on implementation the experiment on multi-class classification and using one verse one model. In Figure 2: a, b, c show the circular scheme constructing based on 50 file fragments at level 0, 1, Byte Frequency Analysis (BFA) In file fragment classification, the histogram of BFA is one of the most important file fragment features (Veenman, 2007). BFA is an algorithm, which calculates byte frequency distribution that includes eight-bit (unigram count) or sixteen-bit (bigram count) numbers capable of representing numeric values in a file. However, a drawback in current BFA algorithm that uses unigram or bigram count is that the connecting relationship between different fragmentation blocks is missing, which causes the BFA algorithm failing to capture enough fragment information. This is observed especially in high entropy file fragment or similar byte frequency distribution among different file fragments. Therefore, in this experiment we applied spatial information to enrich feature vector. First, we count single byte (unigram count) to extract byte frequency distribution, which correspond to eight-bit numbers capable of representing numeric values from 0 to 255 inclusive. By counting the number of occurrences of each byte values in a file, a frequency distribution can be retrieved. Different file types have consistent pattern to their frequency distributions (McDaniel & Heydari, 2003), furthermore a connection is constructed by extracting neighborhood fragment information. As following we present a brief BFA algorithm BFA algorithm: Calculation of a fragment histogram. Organized by WorldConferences.net 252

5 Table 1: BFA Algorithm Definition BFA algorithm: Calculation of a fragment histogram. Create an array histogram with 2 n elements (n = 8 bits) For all data byte value, i do Histogram[i] =0 End for For all data types X do Increment histogram [f(x)] by 1 End for 3.2 BFA with Circular Scheme Mirroring of the fragment at its border Figure 1: A 28KB file fragment is developed based on seven file fragments in level 1 Figure 2: A circular scheme constructing based on eight blocks in level 1 In our proposed method, we represent our file fragment representation by partitioning a file fragment. Partitioning is a process to capture the spatial information of file fragments. However, the problem with partitioning is the way to get an even distribution for calculating neighborhood information of file fragments. Therefore, we introduce the circular scheme to get a holistic of neighborhood information. The idea is to borrow information from the opposite site partition of the file fragment. The algorithm to implement circular scheme is here under presented in Table 2. Organized by WorldConferences.net 253

6 Table 2: An algorithm definition of circular scheme under mod algorithm ALGORITHM: Circular scheme under mod algorithm Let M be with fragment number M= n*p +r, where P is number of partitions. P= 1, 2, 4, when level i= 0, 1, 2. r is undivided number of fragments. r < n, n. If M is odd number then M=M+1 (i=1); M=M+1, M=M+2 or M=M+3, (i=2) Else if M is a even number then End if M= M (i=1); M=M or M=M+2 (i=2) 3.3 BFA with Spatial Level Circular Scheme We introduce the spatial level circular scheme to enrich spatial byte frequency features by looking at several resolutions of partitions in the circular scheme representation, such as shown in Figure 3 (a) to (c). Each file fragment is divided into a sequence of increasingly finer spatial partition by repeatedly doubling the number of BFA distribution in the circular scheme, such as shown in Figure 3. Figure 3: The circular scheme constructing in level 0, 1, 2. Figure 3: Spatial Level Circular Scheme representation. Different levels give different numbers of BFA distributions. (a) Level 0 uses single BFA histogram distribution, (b) Level 1 uses two BFA histogram distributions, and (c) Level 2 uses four BFA histogram distributions Before we applied spatial information to extract features, it is necessary to do a data pre-processing. At first, we use 2 i algorithm to partition the file fragment. In order to construct multi-resolution description the file fragment. Number of partitions P is equal to 2 i. P = 2 i (i= 0, 1, 2) (1) Organized by WorldConferences.net 254

7 Where I equal to different levels, from level 0 to level 2.At the same time, the spatial information layout approach uses the fixed partitioning scheme (2 i ) to construct multiple spatial resolution levels in the file fragment. Each histogram in each partition is used to capture spatial information of the fragment. In this case, a BFA vector is computed for each grid cell at each different resolution level. The final BFA vector descriptor for fragment is a concatenation of all BFA vectors. In forming the multi-resolution BFA, the grid at level L has 2L cells along the dimensional. Consequently, level 0 is represented by a K-vector corresponding to the k bins of histogram, level 1by a 2k-vector etc, and the combination of BFA vector descriptor of file fragment is a vector with dimensionality: For example, for levels up to L=1 and k=256 bin. It will be a 768-vector. In the study we limit the number of levels to L=2 to prevent over fitting. (2) 4. Experimental Setup and Results 4.1 SVM Classifiers Support Vector Machine (SVM) is a machine learning algorithm that is very useful in solving classification problems(hsu, Chang, & Lin, 2003). In this study, we applied Radial-Basis-Function (RBF) kernel in developing the fileprint. Furthermore, one-vs-one approach is used to train and classify file fragment. Initially, all attributes in training and testing were normalized to the interval [-1, +1] by using this equation: X= 2(x-min)/ (max-min) -1 (3) The normalization is a process of scaling data into a small interval where it scales in the range of [-1, 1]. This process is the key point of having better classification performance. Data normalization is used to avoid numerical difficulties during the calculation and to make sure the largest values do not dominate the smaller ones. For a C-SVM type, a parameter C is introduced. This parameter is intended to handle misclassification, thus lesson the training error rate (penalty) while maximizing margin between two classes. Misclassification can occur as there are possibilities that some positive classes are biased to the negative class while some negative class may be biased to positive class. To optimize the classification performance, the kernel parameters are determined by using the libsvm grid-search algorithm [index]. The C and values can be tried to get the best accuracy performance. However, we tried the following values { 2-5, 2-3,,2 15 } and {2-15, 2-13,,2 3 } for C and respectively. We select the best accuracy value that is used in training set. The training file we used to create classifier to get the optimal learning parameter C and. 4.2 Dataset During the experiment evaluation phase, we download two categories of datasets. The first one is binary file, 400 JPEG images, 400 MP3 music files, 400 PDF documents, 400 dynamic link library files (DLLs) and 400 Microsoft windows executable (EXEs). As to textual file, we download 400 comma-separated values files (CSV) and 400 extensible mark-up language file (XML), 400 Hypertext Mark-up Language files (HTML), 400 log files (LOG) and 400 text file (TEXT). Note that all of the files randomly downloaded from public dataset. Which are available to download at specification of data set used is shown in Table 3. Organized by WorldConferences.net 255

8 Table 3: A specification of dataset used in experiments File types File size range Numbers of files DLL 18KB-13287KB 400 EXE 18KB-11287KB 400 MP3 18KB-14547KB 400 JPEG 18KB-3247KB 400 PDF 18KB-16297KB 400 CSV 18KB-4287KB 400 HTML 18KB-3288KB 400 LOG 18KB-2657KB 400 TXT 18KB-3998KB 400 XML 18KB-2979KB Result In this study, we focus on three types of experiment. The first one is binary file fragment-based experiment. The second one is textual file fragment-based experiment. The last one is combination of binary and textual file fragment. In each experiment, we randomly downloaded 400 files for each type. And 200 files for training, and 200 file for testing. At the same time each experiment was repeated five times. For each file, first, we used split software to divide each file into a set of 4096 byte partitions. The size is widely applied in some typical file system as a cluster size. This is the reason why we implemented on 4096 byte. At the same time, we removed the first and the last fragment from dataset. Due to the file header and footer will disturb the classification results, which include important file information, e.g. file type, file creation time, and file store location and so on. First of all, we present the results that are divided into three parts. The first diagram is an average classification accuracy of binary fragment, the second diagram is textual fragment, and the last one is combination binary with textual fragment. The average classification accuracy of the Binary file fragment is dll+ exe+ jpg+mp3+pdf / 5 (Notification: Spatial1 is level0 +level1. Spatial2 is level0 +level1 +level2). Besides that, the Textual file fragment classification used the average accuracy is log + xml + html +CSV + text / 5. While the average classification accuracy of the Binary file fragment combination with textual file fragment is dll+exe+jpg+mp3+pdf+csv+html+log+txt+xml / 10. All results were obtained by repeating the experiment five times. Table 4: The average classification accuracy and distinguish an individual file results Levels Level 0 Level 1 Level 2 Spatial 1 Spatial 2 Types Fragment DLL EXE JPEG MP PDF Average accuracy Organized by WorldConferences.net 256

9 Table 5: The average classification accuracy and distinguish an individual file results Levels Level 0 Level 1 Level 2 Spatial 1 Spatial 2 Types Fragment CSV HTML LOG TXET XML Average accuracy Table 6: The average classification accuracy and distinguish an individual file results Levels Level 0 Level 1 Level 2 Spatial 1 Spatial 2 Types Fragment DLL EXE JPEG MP PDF CSV HTML LOG TXT XML Average accuracy Discussions and Conclusion Looking at the results of true positive rate of classification, it is obvious that when more spatial information is added, no further definite level scheme could be identified. (increasing more dimensional vector) in our hypothesis, we expected that adding more spatial information data to the classifier would allow for a better classification because more spatial information data distribution should allow for the classifier to better represent the characteristic features of the fragment type. Even in the best case scenario, since the spatial information construct best neighbourhood connection. We would expect the true positive rate to increase level by level. Only a small number of results actually such a behaviour, which includes DLL file fragment, JPEG file fragment and MP3 file fragment in binary file fragment experiment (Table 4). As to textual file fragment experiment the CSV file fragment and TEXT file fragment have a significant results. (Table 5) Finally, in combining binary and textual file fragment experiment, the DLL,JPEG, MP3,CSV,TXET, AND XML get a significant results. (Table 6) In stark contrast to our expectations, the results show cases where classification results actually deteriorated with addition of more dimensional vector. This can be seem to some extent in the classification results of EXE, PDF file fragment type in binary file fragment experiment. HTML, LOG, XML, file fragment type in textual file fragment experiment. EXE, PDF, HTML, LOG, file fragment type in combining binary and textual file fragment experiment. The goal of this research was to investigate whether technique from spatial information layout could be applied successfully to file fragment classification. We found that this is indeed the case. However, the prediction accuracy was not as we expected, especially in textual file fragment experiment. The Organized by WorldConferences.net 257

10 possible reason is unigram counting based on 8 bit rich enough. And the selecting dataset should consider more comprehensively. In this paper, we researched the problem of file type classification of digital forensic evidence in the absence of header footer and file system information. Although some of research techniques got promising results. It is a gap that using spatial information to build the fragment connection in order to enrich the discrimination(ahmed, Lhee, Shin, & Hong, 2009). Recently, there have been attempts to solve the problem with machine learning techniques such as support vector machine, k-nearest neighbor. Despite the improved performance over previous methods, the classification model becomes complex and inefficient. We proposed to utilize support vector machines that are very powerful supervised learning algorithms that have been intensively applied in contend based classification. At the same time we employed a simple feature vector (byte frequency distribution) combination with spatial information implementation on circular scheme. And trained the SVM with large amount of data and performed parameter optimization to achieve high accuracy. The results show that spatial information has a slight improvement in average accuracy. The possible reason is that unigram count is sufficiently rich based on 8 bit byte frequency. One possible future direction in this regard is to consider bigram count or trigram counts combination with 16 bit byte frequency to classification (Fitzgerald et al., 2012). References Amirani, Mehdi Chehel, Toorani, Mohsen, & Mihandoost, Sara. (2013). Feature-based Type Identification of File Fragments. Security and Communication Networks, 6(1), Axelsson, Stefan. (2010). The Normalised Compression Distance as a file fragment classifier. digital investigation, 7, S24-S31. Beebe, N, Maddox, L, Liu, Lishu, & Sun, Minghe. (2013). Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification. Calhoun, William C, & Coles, Drue. (2008). Predicting the types of file fragments. Digital investigation, 5, S14-S20. Fitzgerald, Simran, Mathews, George, Morris, Colin, & Zhulyn, Oles. (2012). Using NLP techniques for file fragment classification. Digital Investigation, 9, S44-S49. Gopal, Siddharth, Yang, Yiming, Salomatin, Konstantin, & Carbonell, Jaime. (2011). Statistical learning for file-type identification. Paper presented at the Machine Learning and Applications and Workshops (ICMLA), th International Conference on. Hsu, Chih-Wei, Chang, Chih-Chung, & Lin, Chih-Jen. (2003). A practical guide to support vector classification. Karresand, Martin, & Shahmehri, Nahid. (2006a). File type identification of data fragments by their binary structure. Paper presented at the Information Assurance Workshop, 2006 IEEE. Karresand, Martin, & Shahmehri, Nahid. (2006b). Oscar file type identification of binary data in disk clusters and ram pages Security and privacy in dynamic environments (pp ): Springer. Li, Qiming, Ong, A, Suganthan, P, & Thing, V. (2010). A novel support vector machine approach to high entropy data fragment classification. Paper presented at the Proceedings of the South African Information Security Multi-Conference (SAISMC 2010). McDaniel, Mason, & Heydari, Mohammad Hossain. (2003). Content based file type detection algorithms. Paper presented at the System Sciences, Proceedings of the 36th Annual Hawaii International Conference on. Organized by WorldConferences.net 258

11 Sportiello, Luigi, & Zanero, Stefano. (2011). File Block Classification by Support Vector Machine. Paper presented at the Availability, Reliability and Security (ARES), 2011 Sixth International Conference on. Veenman, Cor J. (2007). Statistical disk cluster classification for file carving. Paper presented at the Information Assurance and Security, IAS Third International Symposium on. Ahmed, Irfan, Lhee, Kyung-suk, Shin, Hyunjung, & Hong, ManPyo. (2009). On improving the accuracy and performance of content-based file type identification. Paper presented at the Information Security and Privacy. Organized by WorldConferences.net 259

A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification

A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification Q. Li 1, A. Ong 2, P. Suganthan 2 and V. Thing 1 1 Cryptography & Security Dept., Institute for Infocomm Research, Singapore

More information

Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin and Man-Pyo Hong

Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin and Man-Pyo Hong Chapter 5 FAST CONTENT-BASED FILE TYPE IDENTIFICATION Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin and Man-Pyo Hong Abstract Digital forensic examiners often need to identify the type of a file or file

More information

Predicting the Types of File Fragments

Predicting the Types of File Fragments Predicting the Types of File Fragments William C. Calhoun and Drue Coles Department of Mathematics, Computer Science and Statistics Bloomsburg, University of Pennsylvania Bloomsburg, PA 17815 Thanks to

More information

File-type Detection Using Naïve Bayes and n-gram Analysis. John Daniel Evensen Sindre Lindahl Morten Goodwin

File-type Detection Using Naïve Bayes and n-gram Analysis. John Daniel Evensen Sindre Lindahl Morten Goodwin File-type Detection Using Naïve Bayes and n-gram Analysis John Daniel Evensen Sindre Lindahl Morten Goodwin Faculty of Engineering and Science, University of Agder Serviceboks 509, NO-4898 Grimstad, Norway

More information

On Improving the Accuracy and Performance of Content-Based File Type Identification

On Improving the Accuracy and Performance of Content-Based File Type Identification On Improving the Accuracy and Performance of Content-Based File Type Identification Irfan Ahmed 1, Kyung-suk Lhee 1, Hyunjung Shin 2, and ManPyo Hong 1 1 Digital Vaccine and Internet Immune System Lab

More information

Code Type Revealing Using Experiments Framework

Code Type Revealing Using Experiments Framework Code Type Revealing Using Experiments Framework Rami Sharon, Sharon.Rami@gmail.com, the Open University, Ra'anana, Israel. Ehud Gudes, ehud@cs.bgu.ac.il, Ben-Gurion University, Beer-Sheva, Israel. Abstract.

More information

Statistical Disk Cluster Classification for File Carving

Statistical Disk Cluster Classification for File Carving Statistical Disk Cluster Classification for File Carving Cor J. Veenman,2 Intelligent System Lab, Computer Science Institute, University of Amsterdam, Amsterdam 2 Digital Technology and Biometrics Department,

More information

Novel Approaches To The Classification Of High Entropy File Fragments

Novel Approaches To The Classification Of High Entropy File Fragments Novel Approaches To The Classification Of High Entropy File Fragments Submitted in partial fulfilment of the requirements of Edinburgh Napier University for the Degree of Master of Science May 2013 By

More information

Feature-based Type Identification of File Fragments

Feature-based Type Identification of File Fragments SECURITY AND COMMUNICATION NETWORKS Security Comm. Networks 2013; 6:115 128 Published online 18 April 2012 in Wiley Online Library (wileyonlinelibrary.com)..553 RESEARCH ARTICLE Feature-based Type Identification

More information

Classification of packet contents for malware detection

Classification of packet contents for malware detection J Comput Virol (211) 7:279 295 DOI 1.17/s11416-11-156-6 ORIGINAL PAPER Classification of packet contents for malware detection Irfan Ahmed Kyung-suk Lhee Received: 15 November 21 / Accepted: 5 October

More information

The Normalized Compression Distance as a File Fragment Classifier

The Normalized Compression Distance as a File Fragment Classifier DIGITAL FORENSIC RESEARCH CONFERENCE The Normalized Compression Distance as a File Fragment Classifier By Stefan Axelsson Presented At The Digital Forensic Research Conference DFRWS 2010 USA Portland,

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

Comparison of Classification Algorithms for File Type Detection A Digital Forensics Perspective

Comparison of Classification Algorithms for File Type Detection A Digital Forensics Perspective Comparison of Classification Algorithms for File Type Detection A Digital Forensics Perspective Konstantinos Karampidis, Ergina Kavallieratou, and Giorgos Papadourakis Abstract Digital forensics is a relatively

More information

Segmenting Lesions in Multiple Sclerosis Patients James Chen, Jason Su

Segmenting Lesions in Multiple Sclerosis Patients James Chen, Jason Su Segmenting Lesions in Multiple Sclerosis Patients James Chen, Jason Su Radiologists and researchers spend countless hours tediously segmenting white matter lesions to diagnose and study brain diseases.

More information

Beyond Bags of Features

Beyond Bags of Features : for Recognizing Natural Scene Categories Matching and Modeling Seminar Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015

More information

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped

More information

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION Panca Mudjirahardjo, Rahmadwati, Nanang Sulistiyanto and R. Arief Setyawan Department of Electrical Engineering, Faculty of

More information

Video Inter-frame Forgery Identification Based on Optical Flow Consistency

Video Inter-frame Forgery Identification Based on Optical Flow Consistency Sensors & Transducers 24 by IFSA Publishing, S. L. http://www.sensorsportal.com Video Inter-frame Forgery Identification Based on Optical Flow Consistency Qi Wang, Zhaohong Li, Zhenzhen Zhang, Qinglong

More information

Accelerometer Gesture Recognition

Accelerometer Gesture Recognition Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Random Sampling with Sector Identification

Random Sampling with Sector Identification Random Sampling with Sector Identification Simson L. Garfinkel Associate Professor, Naval Postgraduate School February 11, 2010, 0900 http://domex.nps.edu/deep/ 1 NPS is the Navyʼs Research University.

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

Learning to Recognize Faces in Realistic Conditions

Learning to Recognize Faces in Realistic Conditions 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

CS 221: Object Recognition and Tracking

CS 221: Object Recognition and Tracking CS 221: Object Recognition and Tracking Sandeep Sripada(ssandeep), Venu Gopal Kasturi(venuk) & Gautam Kumar Parai(gkparai) 1 Introduction In this project, we implemented an object recognition and tracking

More information

File Type Identification - Computational Intelligence for Digital Forensics

File Type Identification - Computational Intelligence for Digital Forensics Journal of Digital Forensics, Security and Law Volume 12 Number 2 Article 6 6-30-2017 File Type Identification - Computational Intelligence for Digital Forensics Konstantinos Karampidis Technological Educational

More information

A Practical Guide to Support Vector Classification

A Practical Guide to Support Vector Classification A Practical Guide to Support Vector Classification Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin Department of Computer Science and Information Engineering National Taiwan University Taipei 106, Taiwan

More information

Content Based File Type Detection Algorithms

Content Based File Type Detection Algorithms Content Based File Type Detection Algorithms Mason McDaniel and M. Hossain Heydari 1 Computer Science Department James Madison University Harrisonburg, VA 22802 Abstract Identifying the true type of a

More information

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation. Equation to LaTeX Abhinav Rastogi, Sevy Harris {arastogi,sharris5}@stanford.edu I. Introduction Copying equations from a pdf file to a LaTeX document can be time consuming because there is no easy way

More information

Automated Data Type Identification And Localization Using Statistical Analysis Data Identification

Automated Data Type Identification And Localization Using Statistical Analysis Data Identification Utah State University DigitalCommons@USU All Graduate Theses and Dissertations Graduate Studies 12-2008 Automated Data Type Identification And Localization Using Statistical Analysis Data Identification

More information

Introduction. Collecting, Searching and Sorting evidence. File Storage

Introduction. Collecting, Searching and Sorting evidence. File Storage Collecting, Searching and Sorting evidence Introduction Recovering data is the first step in analyzing an investigation s data Recent studies: big volume of data Each suspect in a criminal case: 5 hard

More information

Multi-version Data recovery for Cluster Identifier Forensics Filesystem with Identifier Integrity

Multi-version Data recovery for Cluster Identifier Forensics Filesystem with Identifier Integrity Multi-version Data recovery for Cluster Identifier Forensics Filesystem with Identifier Integrity Mohammed Alhussein, Duminda Wijesekera Department of Computer Science George Mason University Fairfax,

More information

CHAPTER 4 SEMANTIC REGION-BASED IMAGE RETRIEVAL (SRBIR)

CHAPTER 4 SEMANTIC REGION-BASED IMAGE RETRIEVAL (SRBIR) 63 CHAPTER 4 SEMANTIC REGION-BASED IMAGE RETRIEVAL (SRBIR) 4.1 INTRODUCTION The Semantic Region Based Image Retrieval (SRBIR) system automatically segments the dominant foreground region and retrieves

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

Automatic Fatigue Detection System

Automatic Fatigue Detection System Automatic Fatigue Detection System T. Tinoco De Rubira, Stanford University December 11, 2009 1 Introduction Fatigue is the cause of a large number of car accidents in the United States. Studies done by

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

Distance Weighted Discrimination Method for Parkinson s for Automatic Classification of Rehabilitative Speech Treatment for Parkinson s Patients

Distance Weighted Discrimination Method for Parkinson s for Automatic Classification of Rehabilitative Speech Treatment for Parkinson s Patients Operations Research II Project Distance Weighted Discrimination Method for Parkinson s for Automatic Classification of Rehabilitative Speech Treatment for Parkinson s Patients Nicol Lo 1. Introduction

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Introduction to carving File fragmentation Object validation Carving methods Conclusion

Introduction to carving File fragmentation Object validation Carving methods Conclusion Simson L. Garfinkel Presented by Jevin Sweval Introduction to carving File fragmentation Object validation Carving methods Conclusion 1 Carving is the recovery of files from a raw dump of a storage device

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

TEXT CATEGORIZATION PROBLEM

TEXT CATEGORIZATION PROBLEM TEXT CATEGORIZATION PROBLEM Emrah Cem Department of Electrical and Computer Engineering Koç University Istanbul, TURKEY 34450 ecem@ku.edu.tr Abstract Document categorization problem gained a lot of importance

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Final Report for cs229: Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Abstract. The goal of this work is to use machine learning to understand

More information

Spam Filtering Using Visual Features

Spam Filtering Using Visual Features Spam Filtering Using Visual Features Sirnam Swetha Computer Science Engineering sirnam.swetha@research.iiit.ac.in Sharvani Chandu Electronics and Communication Engineering sharvani.chandu@students.iiit.ac.in

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Ivans Lubenko & Andrew Ker

Ivans Lubenko & Andrew Ker Ivans Lubenko & Andrew Ker lubenko@ comlab.ox.ac.uk adk@comlab.ox.ac.uk Oxford University Computing Laboratory SPIE/IS&T Electronic Imaging, San Francisco 25 January 2011 Classification in steganalysis

More information

Fuzzy C-means Clustering with Temporal-based Membership Function

Fuzzy C-means Clustering with Temporal-based Membership Function Indian Journal of Science and Technology, Vol (S()), DOI:./ijst//viS/, December ISSN (Print) : - ISSN (Online) : - Fuzzy C-means Clustering with Temporal-based Membership Function Aseel Mousa * and Yuhanis

More information

Ensemble of Bayesian Filters for Loop Closure Detection

Ensemble of Bayesian Filters for Loop Closure Detection Ensemble of Bayesian Filters for Loop Closure Detection Mohammad Omar Salameh, Azizi Abdullah, Shahnorbanun Sahran Pattern Recognition Research Group Center for Artificial Intelligence Faculty of Information

More information

BRIEF Features for Texture Segmentation

BRIEF Features for Texture Segmentation BRIEF Features for Texture Segmentation Suraya Mohammad 1, Tim Morris 2 1 Communication Technology Section, Universiti Kuala Lumpur - British Malaysian Institute, Gombak, Selangor, Malaysia 2 School of

More information

Journal of Asian Scientific Research FEATURES COMPOSITION FOR PROFICIENT AND REAL TIME RETRIEVAL IN CBIR SYSTEM. Tohid Sedghi

Journal of Asian Scientific Research FEATURES COMPOSITION FOR PROFICIENT AND REAL TIME RETRIEVAL IN CBIR SYSTEM. Tohid Sedghi Journal of Asian Scientific Research, 013, 3(1):68-74 Journal of Asian Scientific Research journal homepage: http://aessweb.com/journal-detail.php?id=5003 FEATURES COMPOSTON FOR PROFCENT AND REAL TME RETREVAL

More information

Automatic Colorization of Grayscale Images

Automatic Colorization of Grayscale Images Automatic Colorization of Grayscale Images Austin Sousa Rasoul Kabirzadeh Patrick Blaes Department of Electrical Engineering, Stanford University 1 Introduction ere exists a wealth of photographic images,

More information

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes 2009 10th International Conference on Document Analysis and Recognition Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes Alireza Alaei

More information

Figure 1: Workflow of object-based classification

Figure 1: Workflow of object-based classification Technical Specifications Object Analyst Object Analyst is an add-on package for Geomatica that provides tools for segmentation, classification, and feature extraction. Object Analyst includes an all-in-one

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN

Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN Presentation Overview - Background - Preprocessing - Data Mining Methods to Determine Outliers - Finding Outliers - Outlier Validation -Summary

More information

PARALLEL CLASSIFICATION ALGORITHMS

PARALLEL CLASSIFICATION ALGORITHMS PARALLEL CLASSIFICATION ALGORITHMS By: Faiz Quraishi Riti Sharma 9 th May, 2013 OVERVIEW Introduction Types of Classification Linear Classification Support Vector Machines Parallel SVM Approach Decision

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Chapter 9 Chapter 9 1 / 50 1 91 Maximal margin classifier 2 92 Support vector classifiers 3 93 Support vector machines 4 94 SVMs with more than two classes 5 95 Relationshiop to

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

SOFTWARE DEFECT PREDICTION USING IMPROVED SUPPORT VECTOR MACHINE CLASSIFIER

SOFTWARE DEFECT PREDICTION USING IMPROVED SUPPORT VECTOR MACHINE CLASSIFIER International Journal of Mechanical Engineering and Technology (IJMET) Volume 7, Issue 5, September October 2016, pp.417 421, Article ID: IJMET_07_05_041 Available online at http://www.iaeme.com/ijmet/issues.asp?jtype=ijmet&vtype=7&itype=5

More information

Video Aesthetic Quality Assessment by Temporal Integration of Photo- and Motion-Based Features. Wei-Ta Chu

Video Aesthetic Quality Assessment by Temporal Integration of Photo- and Motion-Based Features. Wei-Ta Chu 1 Video Aesthetic Quality Assessment by Temporal Integration of Photo- and Motion-Based Features Wei-Ta Chu H.-H. Yeh, C.-Y. Yang, M.-S. Lee, and C.-S. Chen, Video Aesthetic Quality Assessment by Temporal

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Advance Indexing. Limock July 3, 2014

Advance Indexing. Limock July 3, 2014 Advance Indexing Limock July 3, 2014 1 Papers 1) Gurajada, Sairam : "On-line index maintenance using horizontal partitioning." Proceedings of the 18th ACM conference on Information and knowledge management.

More information

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation M. Blauth, E. Kraft, F. Hirschenberger, M. Böhm Fraunhofer Institute for Industrial Mathematics, Fraunhofer-Platz 1,

More information

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Bagging and Boosting Algorithms for Support Vector Machine Classifiers Bagging and Boosting Algorithms for Support Vector Machine Classifiers Noritaka SHIGEI and Hiromi MIYAJIMA Dept. of Electrical and Electronics Engineering, Kagoshima University 1-21-40, Korimoto, Kagoshima

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Improving the detection of excessive activation of ciliaris muscle by clustering thermal images

Improving the detection of excessive activation of ciliaris muscle by clustering thermal images 11 th International Conference on Quantitative InfraRed Thermography Improving the detection of excessive activation of ciliaris muscle by clustering thermal images *University of Debrecen, Faculty of

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

File Fragment Encoding Classification: An Empirical Approach

File Fragment Encoding Classification: An Empirical Approach DIGITAL FORENSIC RESEARCH CONFERENCE File Fragment Encoding Classification: An Empirical Approach By Vassil Roussev and Candice Quates From the proceedings of The Digital Forensic Research Conference DFRWS

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information

Combining SVMs with Various Feature Selection Strategies

Combining SVMs with Various Feature Selection Strategies Combining SVMs with Various Feature Selection Strategies Yi-Wei Chen and Chih-Jen Lin Department of Computer Science, National Taiwan University, Taipei 106, Taiwan Summary. This article investigates the

More information

Lecture 7: Decision Trees

Lecture 7: Decision Trees Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Feature Ranking Using Linear SVM

Feature Ranking Using Linear SVM JMLR: Workshop and Conference Proceedings 3: 53-64 WCCI2008 workshop on causality Feature Ranking Using Linear SVM Yin-Wen Chang Chih-Jen Lin Department of Computer Science, National Taiwan University

More information

Scalable Coding of Image Collections with Embedded Descriptors

Scalable Coding of Image Collections with Embedded Descriptors Scalable Coding of Image Collections with Embedded Descriptors N. Adami, A. Boschetti, R. Leonardi, P. Migliorati Department of Electronic for Automation, University of Brescia Via Branze, 38, Brescia,

More information

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task BUAA AUDR at ImageCLEF 2012 Photo Annotation Task Lei Huang, Yang Liu State Key Laboratory of Software Development Enviroment, Beihang University, 100191 Beijing, China huanglei@nlsde.buaa.edu.cn liuyang@nlsde.buaa.edu.cn

More information

Short Survey on Static Hand Gesture Recognition

Short Survey on Static Hand Gesture Recognition Short Survey on Static Hand Gesture Recognition Huu-Hung Huynh University of Science and Technology The University of Danang, Vietnam Duc-Hoang Vo University of Science and Technology The University of

More information

Web Usage Mining: A Research Area in Web Mining

Web Usage Mining: A Research Area in Web Mining Web Usage Mining: A Research Area in Web Mining Rajni Pamnani, Pramila Chawan Department of computer technology, VJTI University, Mumbai Abstract Web usage mining is a main research area in Web mining

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Mobile Human Detection Systems based on Sliding Windows Approach-A Review Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg

More information

Audio Engineering Society. Conference Paper. Presented at the Conference on Audio Forensics 2017 June Arlington, VA, USA

Audio Engineering Society. Conference Paper. Presented at the Conference on Audio Forensics 2017 June Arlington, VA, USA Audio Engineering Society Conference Paper Presented at the Conference on Audio Forensics 2017 June 15 17 Arlington, VA, USA This paper was peer-reviewed as a complete manuscript for presentation at this

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

AN EXAMINING FACE RECOGNITION BY LOCAL DIRECTIONAL NUMBER PATTERN (Image Processing)

AN EXAMINING FACE RECOGNITION BY LOCAL DIRECTIONAL NUMBER PATTERN (Image Processing) AN EXAMINING FACE RECOGNITION BY LOCAL DIRECTIONAL NUMBER PATTERN (Image Processing) J.Nithya 1, P.Sathyasutha2 1,2 Assistant Professor,Gnanamani College of Engineering, Namakkal, Tamil Nadu, India ABSTRACT

More information

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Offer Sharabi, Yi Sun, Mark Robinson, Rod Adams, Rene te Boekhorst, Alistair G. Rust, Neil Davey University of

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Classification of objects from Video Data (Group 30)

Classification of objects from Video Data (Group 30) Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information