Scripting DNA: Identifying the JavaScript Programmer

Size: px
Start display at page:

Download "Scripting DNA: Identifying the JavaScript Programmer"

Transcription

1 Scripting DNA: Identifying the JavaScript Programmer Wilco Wisse Delft University of Technology, Software Engineering Research Group, Mekelweg 5, 2628 CD Delft, Netherlands Cor Veenman Netherlands Forensic Institute, Knowledge and Expertise Centre for Intelligent Data Analysis, Laan van Ypenburg 6, 2497 GB, The Hague, Netherlands Abstract The attribution of authorship is required in diverse applications, ranging from ancient novels (Shakespeares work, Federalist papers) for historical interest to recent novels for linguistic research or out of curiosity (Robert Galbraith alias J.K.Rowling). For this problem extensive research has resulted in effective general purpose methods. Also, other language utterances can be questioned for the original author. Especially, we are interested in identifying the offender who produced malicious software on a website. So far, for this hardly studied problem, mainly general purpose methods from natural language authorship attribution have been applied. Moreover, no suitable reference dataset is available to allow for method evaluation and method development in a supervised machine learning approach. In this work we first obtain a reference dataset of substantial size and quality. Further, we propose to extract structural features from the abstract syntax tree (AST) to describe the coding style of an author. In the experiments, we show that the specifically designed features indeed improve the authorship attribution of scripting code to programmers, especially in addition to character n-gram features. Keywords: Authorship identification, Authorship verification, Source code, JavaScript, Abstract Syntax Tree, Syntactic features 1. Introduction Authorship identification is roughly defined as the task of identifying the true author of a document given samples with undisputed authorship from a Corresponding author addresses: wilco.wisse@gmail.com (Wilco Wisse), c.veenman@nfi.minvenj.nl (Cor Veenman) Preprint submitted to Digital Investigation June 7, 2015

2 finite set of candidate authors [1]. The fundamental basis of authorship identification is that language is flexible enough in its expression to identify authors purely on the basis of their individual writing style. Traditionally, this task has mainly been used for historical interests. Arguably the most well-known and most influential work is the authorship identification of the Federalist Papers by Mosteller and Wallace [2]. The successes in authorship identification have led to the development of more advanced automated identification techniques, that play a crucial role in various applications, ranging from cases of academic dishonesty to forensic investigations [3]. With the rapid growth and popularity of the Internet an increasing number of criminals employ the Web to illegal ends, such as sharing child pornography and committing cyber crimes. The ease of hiding your real identity on the Web has heightened the need for effective identification techniques in recent years [4, 5, 6]. The use of the Internet technology by criminals also provides additional opportunities to detect the author of fraudulent content. To that end we address the authorship identification of source code that is embedded within web-pages. In particular, we focus on authorship identification of JavaScript code. JavaScript is an interpreted scripting language that it is commonly embedded within web-pages. Most source code authorship identification studies in literature focus on languages that are encountered in compiled form. In contrast, JavaScript source code is not transformed into machine language instruction, so that the source code is attainable in its original form. In addition, the focus on JavaScript as target language is of great interest when considering the wide-spread use of JavaScript on the Web. The identification of a JavaScript developer may serve as a useful means in cybercrime investigations, such as tracing the authors of malicious websites. The most straightforward authorship identification task is its closed-set form, where reference data is available for a finite set of candidate authors, which surely includes documents of the true author. More difficult is the open-set form, where the questioned document may be written by an unknown, previously unseen author. In this work we consider the closed-set form and a special case of the open-set identification problem in which the set of candidate authors is singleton. The latter problem boils down to the question whether a document has or has not been written by an author and is known as authorship verification [1]. Previous studies have shown that especially character n-gram based approaches are effective in source code authorship identification [7, 8]. Such general purpose methods consider a source file as a mere sequence of characters. In this study we propose a set of language specific features that express JavaScript language properties at a higher syntactic complexity, by parsing the source into an Abstract Syntax Tree (AST). The AST lends itself in particular for the extraction of structural features. Our assumption is that such features capture different properties, and thus would either individually or in addition to general purpose n-gram features improve the authorship identification task. Moreover, structural features are robust against layout modification of the source code. As a result, the layout features can easily be omitted, such 2

3 that the classification will be less susceptible to pretty printing or minification of the source code. A key issue in authorship identification studies is a reliable dataset for validation purposes. Unfortunately, no labeled JavaScript source code datasets are readily available. In this paper we propose a way to obtain a reference dataset of substantial size and quality from the world s largest repository hosting service GitHub. With this dataset we evaluate the language specific method and compare the performance with two general purpose identification techniques, which have been reported as effective in recent literature [9, 10]. The layout of the paper is as follows. In the next section, we first describe related work in authorship attribution, while we focus on programmer identification. Then we describe our programmer identification approach. The main part of this section is concerned with AST feature extraction. The following section describes the corpus construction, which is further detailed in Appendix A. In the experiments the proposed method is tested on the described corpus. We finalize the paper with conclusions and discussions on the obtained results. 2. Related Work Below we describe previous research that has been done in relation to our work. We first elaborate on previous work in authorship identification with a focus on programmer identification. Then we describe research in which parse tree features were used in similar context Authorship Identification Authorship identification in general can be viewed as a text classification task where a set of reference samples is given for the candidate authors. One major subtask in authorship identification is the extraction of stylistic characteristics (known as stylometric features) that differ between documents written by different authors. For natural language authorship identification Stamatatos [5] distinguished techniques on the syntactical complexity of the extracted stylometric features. These range from lexical [11] and character features [12, 13] to syntactic [14] and semantic features [15]. Also combinations of these feature types have been implemented, for instance in [16]. We refer to this survey [5] for een overview on authorship attribution for natural language. In this work, the focus is on authorship attribution for program source code, i.e. programmer identification. Also for this authorship identification problem features with various syntactic complexity have been used. Most early programmer identification studies attempted to capture programming characteristics at a high level complexity by extracting a few dozen of specific textual measurements and software metrics. Such features may be divided into layout features (that relate to typographic aspects such as indentation and spacing), style features (such as naming conventions and variable length) and structural features (which express the structural decomposition of the code) [17]. Since detailed text analysis is required to extract such features, we refer to these features 3

4 x 2 x a2 x 2 x a1 x u x u x 1 (a) Profile based classification. x 1 (b) Instance based classification. Figure 1: Profile based versus instance based authorship identification. Source file u is an unlabeled file which has to be attributed to one of the candidate authors. The profile based model has a data points for each author. The instance based approach has a data point for each individual sample as language-specific features. Languages specific features have been utilized in several language domains, including Pascal [18], Java [19], C [17] and C++ [20]. Next to language specific methods, source code features may be extracted at a low level complexity by considering each source file as a mere sequence of characters or tokens. The application of character n-grams has shown to be among the most effective identification methods in both natural language [21, 22] and source code [8]. Character n-grams implicitly capture lexical, syntactic and structural writing characteristics by representing each source file as relative frequencies of occurrence of character sequences. Since no deep linguistic analysis is required to extract character n-grams, such approaches are language independent. Different classification techniques have been utilized to attribute an unlabeled document to a candidate author. In general, profile based approaches produce a single representation (profile) for all the documents per author and the authorship identification is based on a similarity function that quantifies the degree of shared information between the author profiles and the unlabeled document [5]. Instance based techniques, on the contrary, produce an individual representation for each document. Commonly, these approaches represent each source file by a vector in a multivariate space, and employ a machine learn- ing algorithm for classification. Figure 1 illustrates the difference between the instance based and profile based approach. It should be noted that similarity based classification models have been used in the instance based paradigm as well. The latter methods are sometimes referred to as nearest neighbor or information retrieval approaches, since the identification takes place by a similarity measure that is used to obtain the class labels of the reference data most similar to the unlabeled source file [23, 24]. 4

5 Parse Tree Features In natural language, syntactical structures have been proven to be good features for authorship identification [25]. A parse tree is a convenient way to determine the syntactic structure of the sentences [14]. Baayen [26] was the first who extracted rewrite-rule frequencies from the parse tree for the purpose of authorship identification. Rewrite-rules represent the combinations of a node and its immediate constituents in the tree. Later authorship identification studies have examined related tree characteristics such as the depth of the syntax trees [27], the frequency parent-child node types [14], n-grams defined on node types [28], and frequent subtree patterns [25]. In source code, the Abstract Syntax Tree (AST) is a convenient way to represent the syntactic structure of a program. ASTs are usually employed for code analysis in compilers and developer tools, but also enable to select detailed structural features for programming style characterization. To this end, AST subtrees have been used in code clone and plagiarism detection applications to compute the structural similarity between programs [29, 30]. However, these plagiarism studies deal with the detection of approximate matches of larger chunks of code, which are the result of copy paste modifications. 3. Approach We now describe our approach for programmer identification of JavaScript source code. We deal with two programmer identification tasks: closed-set identification and programmer verification. For closed-set identification reference data is available for a finite set of authors. The disputed JavaScript file has to be assigned to one of these candidate authors. For the verification task, reference data is only available for one candidate programmer. We approach these problems as machine learning problems in which a properly designed feature set is essential. In our proposal, the feature set consists of features derived from the parse tree. In the next section, we describe how we obtain these features. After that, we work out our machine learning model for closed-set identification and programmer verification Feature Extraction In this work we utilize the parse tree for the extraction of stylometric features. The parse trees are obtained by the Esprima [31] parser, which generates a Mozilla compatible JavaScript AST [32]. Figure 2 shows an example of such an AST. Every node has a corresponding node object, that indicates the node type and has a number of corresponding attributes (child-nodes) which are either expression or statement nodes. In the AST, the name of the attributes are indicated as labels on the edges. For instance, the function call node in figure 2 is represented by a CallExpression object that implements the following interface [32]: 5

6 callee MemberExpression object property Identifier Identifier name name CallExpression arguments 1 2 Identifier Literal name value foo bar a 1 Figure 2: An AST corresponding to the JavaScript code foo.bar(a,1) i n t e r f a c e C a l l E x p r e s s i o n <: E x p r e s s i o n { t y p e : C a l l E x p r e s s i o n ; c a l l e e : E x p r e s s i o n ; arguments : [ E x p r e s s i o n ] ; } Traversing the parse tree nodes allows to extract detailed language specific features from the node objects, which are detailed in the remainder of this section Structural features The AST lends itself in particular for the extraction of features related to the tree structure. First, we tracked the length of node lists that are present in the AST nodes. The length of these lists reflects the number of children of a node, such as the number of arguments defined in a function declaration and the number of elements initialized in an array. Additionally, we tracked the number of descendants nodes of particular node types (i.e. the number of nodes having a common ancestor). This is depicted in figure 3a. The number of descendant nodes indicates the complexity of node attributes, e.g. whether identifiers or comprehensive objects are passed as arguments in a CallExpression. To capture how the nodes are structured in the AST, we maintain the frequency of the most frequent node n-grams for n = 1, 2 and 3. We define node n-grams as contiguous sequences of nodes in the AST, where each node in this sequence is a child of its preceding node (see figure 3b). The frequency of node uni-grams (i.e. n = 1) captures the frequency of individual node types. This may for instance indicate the preference for different loop types. Also, the appearance of NewExpression and MemberExpression nodes may be an indicator of an object-oriented programming style. Furthermore, with node 2 and 3-grams we aim to capture how nodes are connected to each other in the tree. For example, different expression types may be used as callee in a CallExpression, such as a member expression or an identifier. The appearance of such features may reflect a characteristic habits in program structure between different programmers. 6

7 callee C args foo I object callee property object M property I I args a 1 I bar (b) Node 2-gram. I a foo bar (a) Number of descendant nodes of an argument. foo. bar ( a, 1 ) C M I L M object callee property args foo bar (c) Adding layout nodes to the AST. C I I I L a 1 Figure 3: Examples of language specific properties related to the code of figure Layout features Program layout features deal specifically with the layout of the program, such as the use of indentation and spacing. In the AST, this information is disregarded, since it is irrelevant for program analysis during code compilation or interpretation. Since the layout may be an important marker of coding style, we added layout information to the AST by introducing additional nodes of the type Layout, BlockComment and LineComment. The process is clarified in figure 3c. Adding the layout nodes was done by inspecting the raw source code while traversing the parse tree. The BlockComment and LineComment nodes contain the raw comments as attribute, while the Layout nodes contain the spacing that is used at the given position within the source code, e.g. spaces before and behind brackets. For every layout position in each node type, we recorded the number of times that zero, one, two or many spaces, tabs and line breaks were used. A number of layout positions with roughly the same meaning were grouped, such as the layout slots before closing parenthesis and the layout slots before different operators in binary expressions Style features A disadvantage of layout features is that layout may easily be obfuscated by pretty printers and source code formatters. Style features concern stylistic characteristics that are less susceptible to be changed automatically. In this feature category, we recorded style information of comments, naming conventions 7

8 and data types. A number of nodes in the AST contain textual information that can be used for this end. For comments we maintain the length of block and line comments, the ratio between line and block comments and the parent node type of comments. The latter should reflect where comments are placed in the source code. Next, naming conventions and the use of literal data types are extracted by applying regular expressions on identifier names and literal values. We measured the length of literal values and identifier names, the use of captital letters in identifiers names and the use of different literal data types (see table 1) Feature representation The discussed features are used to represent each JavaScript project as a feature vector in multivariate space. Table 1 presents the number of dimensions of the feature space that correspond to the discussed layout, style and structural features. For each feature we maintain a histogram distribution, which records the frequency of observed values related to the feature. This approach is comparable to the approach of Lange [33]. For instance, to express the length of identifiers in the source code, the x-axis of the histogram corresponds to every recorded identifier length, while the y-axis represents the number of times an identifier of that length existed in the source code. After generating the raw histogram distributions for each project, we normalize the bin values. This enables us to compare distributions of features between different programmers. We either evaluated binary normalization (such that each histogram value becomes 1 if it is observed at least once and 0 otherwise) and feature wise normalization by dividing the value of each bin by the sum of the other bin values of the same measure (such that the sum of occurrences in each histogram becomes 1) Classification in Closed-set Programmer Identification In closed-set programmer identification, reference data is available for each candidate programmer and the questioned source file surely belongs to one of the candidate programmers. The described multivariate representation of the available source files enables us to adopt a machine learning approach for classification. We modeled the source code programmer identification problem as a multi-class classification problem, where the reference data of the candidate programmers acts as training data. The trained multi-class classifier predicts the authorship of previously unseen documents Classification in Programmer Verification Next to closed-set programmer identification we address programmer verification. The goal of programmer verification is to determine whether or not a source file is written by a certain target programmer. Consequently, the questioned documents may be written by a previously unseen programmer for which no training data is provided. The absence of these data raises the question how to train the classifier. In literature two approaches have commonly be used, namely intrinsic and extrinsic classification models [1]. In the intrinsic classification model only the target class (corresponding to the target programmer) is 8

9 Table 1: Number of defined measurements on the AST. Structural Node 1-grams (expressions) 19 Node 1-grams (statements) 17 Node 2-grams 433 Node 3-grams 546 Descendant count of nodes 110 Length of lists defined in nodes 126 Style String patterns on identifiers (naming conventions) 16 Type of comments (block/line) 2 Type of parent node of comments 33 Length of comments 18 Literal data types (string with double/single quotes, null value, number, boolean or regex) Layout Number of tabs at a postition 121 Number of spaces at a postition 121 Number of returns at a postition modeled by the available training data. In this case the problem is handled as a one-class classification problem. In this work we adopt the extrinsic classification model where the problem is modeled as a two-class classification problem. In this case both the target and the outlier classes are modeled. The outlier class represents source code developed by programmers different from the target programmer. Since the outlier class is very heterogeneous, there should be enough and representative source code samples available as reference data. Then, an unlabeled document is attributed to one of the two classes when the confidence exceeds a certain threshold. An ROC curve is usually used to characterizes the performance of the classifier as this threshold is varied. 4. Corpus Construction To evaluate the performance of the proposed method, a labeled set of JavaScript source code samples is needed. Unfortunately, no such datasets are readily available. A naive approach would be to collect JavaScript code manually from websites. A problem is however that it is often unclear who is the true programmer of the source code, especially when multiple programmers collaborated in the same code base. Therefore, we propose an automated approach by collecting JavaScript projects from GitHub. GitHub is the worlds largest online repository hosting service and includes many JavaScript repositories [34]. The collected GitHub repositories can be used as reference data to model the programmer style of each of the programmers. The history information in the Git repository allows to determine the author of each line of code in a project. 9

10 Table 2: Statistics about the validation corpus after cleaning the dataset. Min. Max. Mean Median Std. Author size (repos) Author size (kb) Author size (LOC) Repo size (files) Repo size (kb) Repo size (LOC) In this paper we define a source code sample as a single repository. Also, we only use projects that have been developed by a single programmer since the unit of recognition is a project. In addition, we need ground truth of the true original programmer per project. To enable a machine learning approach, multiple source code samples per programmer are required to develop an accurate identification model. This means that programmers should be selected that own a large number of repositories exclusively developed by themselves. To be able to find such GitHub users, we used the database of GitHub metadata provided by the GHTorrent project [34]. The published GHTorrent relational database includes the information necessary to select appropriate GitHub projects with the corresponding collaborators. A more detailed description of the corpus construction is found in Appendix A. Table 2 details statistics about the authors and projects that were included in the corpus. After cloning the repositories we removed the files that were irrelevant for our experiments. These are the JavaScript files that are not parsable. Also, many repositories included source code of JavaScript libraries such as JQuery in their code base. Since such libraries are developed by other programmers they do not reflect the coding style of the owner of the repository and need to be excluded. We first attempted to remove libraries by a predefined blacklist of file names of well-known libraries. However, we found that this was not sufficient, as there was much variation in the name of the library files. Therefore we adopted the following strategy to eliminate libraries from the code base. First, we observed that in general, libraries were much larger than other JavaScript files and removed all files that did not have a size between 1 and 100kB. Secondly, the entire content of JavaScript libraries was usually committed in a single commit. Therefore, we determine whether all lines in a file were committed in one single commit, and removed the file if this was the case. We were confident that this approach was satisfactory after manually inspecting the remaining code. Figure 4 presents the cumulative density function of the number of files per repository and the size in kb of repositories in the constructed dataset, before and after the cleaning phase. 10

11 Fn Number of files Size (kb) Figure 4: The cumulative density function of the size of the collected source code projects. The dashed lines correspond to the size of the repositories before cleaning the dataset. 5. Experiments In the following we describe the experiments we did to evaluate the proposed programming style features with the constructed dataset. We diffentiate the experiments between closed-set recognition and programmer verification experiments. To be able to focus on the contribution of the proposed features and to make the experiments computationally feasible, we did initial tests on which classifier to use for the elaborate experiments and to tune the hyper parameters and other model parameters. We tested standard classifiers like linear discriminant analysis (LDA), logistic regression, and state-of-the-art classifiers like adaboost, support vector machine with various kernels and regularisations, random forest, and regularized logistic regression. In these intial experiments we also tested the best normalisation of the feature vectors, either binarisation or division by the sum of all feature frequencies. There were several methods with comparable good performance. It turned out that the support vector machine with linear kernel and L 2 regularisation gave robust competitive results. This is in line with various other text mining and authorship attribution research, e.g. [4, 35]. We used the LibSVM implementation [36] for all experiments below. As normalisation, the binarisation turned out to give the best results Closed-set Programmer Identification We first address closed-set programmer identification. We test the performance for an increasing number of candidate authors in the identification task. This was done by repeatedly selecting a random subset of programmers from the whole dataset. For each programmer the samples are randomly split into 25 training samples and the remaining samples for validation. To accurately 11

12 outlier class target class Figure 5: Validation of the programmer verification task as a binary classification problem. The target class is modeled with projects of one programmer. The outlier class is constructed by projects of different programmers divided into 5 random folds. The red areas in this figure indicate the code that is used for testing, while the remaining code is used for training the classifier estimate the accuracy, 10 iterations are performed using a different set of programmers. We defined samples in our corpus to be entire Git repositories. This ensures that source code of the same repository is not used both for training and validation. This enables us to validate how well the identification algorithms generalize beyond the training set to source code of previously unseen software projects. As a result, the classification results will be more representative for varying coding style between projects and application domains. One of the proposed ways to quantify the coding style of the programmers is by node n-grams defined on the AST. However, the JavaScript Abstact syntax defines a large number of expression and statement types, which makes the number of node n-grams to be tracked extremely large. A number of infrequent node n-grams are therefore removed from the feature space to keep the problem computationally feasible. To eliminate infrequent n-grams we empirically determined the most frequent node n-grams in the JavaScript in source code dataset. To keep this selection process computationally feasible, the process was carried out recursively. First, we started with node 1-grams, of which the most frequently observed n-grams were extended to node 2-grams. Similarly, the most frequently observed node 2 grams were extended to node 3-grams. We ended up with 36 node 1-grams, 433 node 2-grams and 582 node 3-grams. The accuracy of the proposed technique is compared to two existing character n-gram based approaches. First, we evaluate the performance of the Scap [9] approach. This is a profile based method that creates for each programmer a profile that contains the most frequently observed n-grams in the programmer s training data. Classification takes place by a similarity measure that quantifies the degree of shared information between the program profile (which characterizes the questioned source file) and the author profiles (which characterize the source code of the programmers). The adopted similarity measure is the SPI measure, which is the cardinality of the intersection between the program profile and the author profiles. In our study, we set the profile size L to the 1500 most frequent n-grams as was originally proposed by the authors [9]. Next to 12

13 Scap, we compare our method to an n-gram based approach that sticks to the instance based paradigm as is described in [10]. In this method, each document is represented individually in the same dimensional vector space, where each dimension corresponds to the relative frequency of occurrences of a particular n-gram. Also for the character n-gram feature representation, the classification is performed by a Support Vector Machine (SVM) with linear kernel [36]. The document representation in multivariate space may be sparse, since not all selected n-grams are necessarily observed in each document. Therefore we chose a larger number of n-grams than was done in the Scap method and defined the feature space by the 7000 most frequent n-grams found in the whole corpus. Since the machine learning classification techniques are both feature vector based, the feature spaces of the n-gram based and language specific approach can be combined into a single higher dimensional representation. The two feature spaces may express different programmer information, since they are extracted from a different language representation. For instance, domain specific features may be better able to describe the structural aspects, while the n-grams may be better able to describe layout related aspects. As a consequence, the two feature spaces may complement each other, resulting in a more expressive characterization of coding style. Figure 6a presents the effectivity of the techniques. The presented results are obtained by using binary feature normalization. Overall, the instance based approach that combines the domain specific and n-gram features achieved the highest accuracy. The figure shows that in a classification problem with 10 programmers, the combined method achieves an accuracy of 0.914, while the n-gram based method, the domain specific method and the Scap method respec- tively achieve an accuracy of 0.898, and The difference between accuracy of the combined method and the n-gram based method is small, but becomes larger when the number of candidate authors increases. For instance, with 34 candidate authors the n-gram based method achieves an accuracy of 0.851, while the combined method has an accuracy of When considering the the feature sets separately the n-gram based technique achieved the best performance. Additionally, we were interested in the contribution of the individual language specific feature types of the proposed approach. This is especially of importance if some feature types are suspected of being changed by external tools. For instance, layout and naming style may be imposed by editors or pretty printers, so that they do not reflect the particular coding style of a programmer. Figure 6b shows the accuracy of the language specific approach, when including one feature type and removing the other ones from the feature space. The feature types on the x-axis correspond to the features presented in table 1. As the results show, the structural features achieved a relatively high accuracy, which supports our hypothesis that structures in the AST are effective in distinguishing developer styles. However, the accuracy of using only layout features is higher than the accuracy of all the remaining features together (i.e. excluding the layout features from the feature space). The importance of layout features is in line with earlier results that indicated that the layout plays a crucial role 13

14 1 0.9 Accuracy Scap Domain specific features n-gram based with SVM Domain specific and n-gram features Number of authors (a) 1 Feature type 0.8 Accuracy Comment types Literal data types Naming style Expression types Statement types Comment length Comment parent types Descendant count Children count Node 3-gram Node 2-gram Layout excluded Layout All together (b) Figure 6: Accuracy of the techniques in closed-set programmer identification. Figure 6a present the results by training with 25 samples per programmer. Figure 6b presents the achieved accuracy by the individual feature types, which can be found in table 1. in programmer identification of source code [19, 37]. 14

15 TPR Domain specific features n-gram features Domain specific and n-gram features FPR (a) AUC Domain specific features n-gram features Domain specific and n-gram features Training samples (b) Figure 7: Validation results in programmer verification. Figure 7a shows the ROC-curves when trained with 7 samples per programmer. The AUC-values for various training samples are presented in figure 7b. 15

16 Programmer Verification For programmer verification we used the same dataset as was used for programmer identification. In the experiments we repeatedly selected one programmer as target programmer. The projects of this programmer are randomly split into training and validation samples. The outlier class is modeled by the projects of the other programmers in the dataset. By applying 5-fold cross validation on the outlier class we examine the performance of the classification process, i.e. the projects in the outlier class are randomly partitioned into 5 equally size subsets of which one set is used for validation and the others are used for training. This is done in such a way that the source code in the different subsets is written by different programmers. In this way we modeled the case that the programmer of negative test samples had not been observed before, which is the most realistic situation in practice. Figure 5 illustrates one fold in this process. In the experiments the 5-fold cross validation process is repeated multiple times in such a way that each programmer in the corpus is used once as target. The resulting performance results are averaged. Figure 7a shows the ROC-curves of the verification techniques. We did the test with 7 samples for the target user in order to be able to differentiate between the methods. That is, with more than 25 samples per target programmer all methods perform almost equally well. The ROC curve characterizes the compromise between the false positive rate (FPR) and true positive rate (TPR). Often the Area Under the Curve value (AUC-value) is taken as a scalar for comparison. The AUC-value can be interpreted as the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance [38]. Figure 7b compares the AUC values of the verification techniques with a varying number of training samples. The AUC values with 7 samples used to model the target author for the combined technique, the n-gram technique and the language specific technique are respectively 0.962, and The results indicate, as was the case in closed-set programmer identification, that the feature vector based classification with character n-gram and domain specific features was the most successful technique, while the machine learning approach with character n-grams was the most effective when considering the feature sets separately. In the authorship verification task we did not include the Scap method, since it was designed to be used in a closed-set authorship identification setting. 6. Conclusions In this work, we proposed and evaluated a language specific programmer identification technique for JavaScript source code. The first contribution of this work is the use of features directly extracted from the AST. We showed that structures in the AST, of which especially node 2- and 3-grams, are effective markers of a programmer s coding style. The proposed technique was compared to two techniques that are based on character n-grams. The results point out that features that exploit language structure through the AST improve 16

17 the programmer identification tasks, both closed-set programmer identification and programmer verification. Accordingly, the best performance is obtained by fusing the AST features to the character n-gram features. For closed-set identification the achieved accuracy is 0.91 for 35 authors with 25 samples per author. For the programmer verification task the combined features achieve AUC=0.96 with 7 training samples, while with more than 25 samples all methods have similar very good performance. Similar to our study, a recent (at time of writing unpublished) study also proposed to employ structures in the AST to quantify coding style [39]. Instead of adding character features to the AST features, they enrich the AST features with lexical features. The node n-grams, which play a central role in our study, were not considered. Further, their focus is on C++ source code that is distributed in binary form after compilation. In contrast, our method has broader applicability since it focuses on JavaScript, which is commonly encountered in its source code form on webpages. The second contributions of this research is the design of a labeled Javascript code dataset of substantial size, by utilizing source code from repository hosting service GitHub. The proposed method is generic and flexible enough to be used with different selection criteria, such as different programming languages and projects that have been developed by multiple programmers. For the current research, it was important to have projects developed by single programmers. This was established by collecting source files committed by the same GitHub user. Although GitHub provides specific organization accounts for teams, we cannot be sure that ordinary user accounts have not been used by multiple programmers. If this was the case, the results could even have been better, since the identification task would be more difficult than with true single users. The research presented in paper has raised several questions that provide the basis for further research. It would be interestiong to see how well the method would work with short scripts that could be found embedded in web pages. With the constructed corpus we could not test this, because no such small projects were available. Taking snippets from the available projects instead was not possible, because these were generally not parsable. We consider Javascript the most suitable programming language to consider, as it can be found as source code on web pages. Still the method can easily been used for other programming languages as well. Indeed, the proposed structural features open up the possibility of cross language programmer identification. That is, even when the syntax definitions of the languages are different, the underlying programming structures remain similar per programmer over different programming languages. Finally, the current study shows the effectivity of features based on structures in the AST, the selection of more sophisticated structural features could be investigated, such as frequent subtree patterns in the AST [25]. 17

18 References [1] N. Potha, E. Stamatatos, A profile-based method for authorship verification, in: Artificial Intelligence: Methods and Applications, Springer, 2014, pp [2] F. Mosteller, D. L. Wallace, Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed federalist papers, Journal of the American Statistical Association 58 (302) (1963) [3] S. Burrows, Source code authorship attribution, Ph.D. thesis, Ph. D. thesis, School of Computer Science and Information Technology, RMIT University, Melbourne, Australia (2010) [4] R. Zheng, J. Li, H. Chen, Z. Huang, A framework for authorship identification of online messages: Writing-style features and classification techniques, Journal of the American Society for Information Science and Technology 57 (3) (2006) [5] E. Stamatatos, A survey of modern authorship attribution methods, Journal of the American Society for information Science and Technology 60 (3) (2009) [6] M. Lambers, C. J. Veenman, Forensic authorship attribution using compression distances to prototypes, in: Computational Forensics, Springer, 2009, pp [7] S. Burrows, A. L. Uitdenbogerd, A. Turpin, Comparing techniques for authorship attribution of source code, Software: Practice and Experience 44 (1) (2014) [8] M. F. Tennyson, On improving authorship attribution of source code, in: Digital Forensics and Cyber Crime, Springer, 2013, pp [9] G. Frantzeskou, E. Stamatatos, S. Gritzalis, C. E. Chaski, B. S. Howald, Identifying authorship by byte-level n-grams: The source code author profile (scap) method, International Journal of Digital Evidence 6 (1) (2007) [10] E. Chatzicharalampous, G. Frantzeskou, E. Stamatatos, Author identification in imbalanced sets of source code samples, in: Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on, Vol. 1, IEEE, 2012, pp [11] F. Iqbal, R. Hadjidj, B. C. Fung, M. Debbabi, A novel approach of mining write-prints for authorship attribution in forensics, Digital Investigation 5, Supplement (0) (2008) S42 S51, the Proceedings of the Eighth Annual {DFRWS} Conference. 18

19 [12] G. Mikros, K. Perifanos, Authorship attribution in greek tweets using author s multilevel n-gram profiles, in: AAAI Spring Symposium Series, [13] E. M. Magdalena Jankowska, V. Keselj, Author verification using common n-gram profiles of text documents, in: Proceedings of COLING 2014, [14] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, E. C. R. Shin, D. Song, On the feasibility of internet-scale author identification, in: Security and Privacy (SP), 2012 IEEE Symposium on, IEEE, 2012, pp [15] P. M. McCarthy, G. A. Lewis, D. F. Dufty, D. S. McNamara, Analyzing writing styles with coh-metrix., in: G. Sutcliffe, R. Goebel (Eds.), FLAIRS Conference, AAAI Press, 2006, pp [16] A. Abbasi, H. Chen, Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace, ACM Trans. Inf. Syst. 26 (2) (2008) 7:1 7:29. [17] I. Krsul, E. H. Spafford, Authorship analysis: Identifying the author of a program, Computers & Security 16 (3) (1997) [18] P. W. Oman, C. R. Cook, Programming style authorship analysis, in: Proceedings of the 17th conference on ACM Annual Computer Science Conference, ACM, 1989, pp [19] H. Ding, M. H. Samadzadeh, Extraction of java program fingerprints for software authorship identification, Journal of Systems and Software 72 (1) (2004) [20] S. G. MacDonell, A. R. Gray, G. MacLennan, P. J. Sallis, Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis, in: Neural Information Processing, Proceedings. ICONIP 99. 6th International Conference on, Vol. 1, IEEE, 1999, pp [21] E. Stamatatos, On the robustness of authorship attribution based on character n-gram features. [22] J. Houvardas, E. Stamatatos, N-gram feature selection for authorship identification, in: Artificial Intelligence: Methodology, Systems, and Applications, Springer, 2006, pp [23] S. Burrows, A. L. Uitdenbogerd, A. Turpin, Application of information retrieval techniques for source code authorship attribution, in: Database Systems for Advanced Applications, Springer, 2009, pp [24] M. Koppel, J. Schler, S. Argamon, E. Messeri, Authorship attribution with thousands of candidate authors, in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 2006, pp

20 [25] S. Kim, H. Kim, T. Weninger, J. Han, H. D. Kim, Authorship classification: a discriminative syntactic tree mining approach, in: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, ACM, 2011, pp [26] H. Baayen, H. Van Halteren, F. Tweedie, Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution, Literary and Linguistic Computing 11 (3) (1996) [27] A. Kaster, S. Siersdorfer, G. Weikum, Combining text and linguistic document representations for authorship attribution, in: SIGIR workshop: Stylistic Analysis of Text For Information Access, [28] M. Tschuggnall, G. Specht, Enhancing authorship attribution by utilizing syntax tree profiles, EACL 2014 (2014) 195. [29] M. Chilowicz, E. Duris, G. Roussel, Syntax tree fingerprinting: a foundation for source code similarity detection, Universitye Paris-Est. [30] I. D. Baxter, A. Yahin, L. Moura, M. Sant Anna, L. Bier, Clone detection using abstract syntax trees, in: Software Maintenance, Proceedings., International Conference on, IEEE, 1998, pp [31] A. Hidayat, Esprima: Ecmascript parsing infrastructure for multipurpose analysis, accessed: (2015). [32] Mozilla Developer Network and individual contributors, Parser api, SpiderMonkey/Parser_API, accessed: (2015). [33] R. C. Lange, S. Mancoridis, Using code metric histograms and genetic algorithms to perform author identification for software forensics, in: Proceedings of the 9th annual conference on Genetic and evolutionary computation, ACM, 2007, pp [34] G. Gousios, B. Vasilescu, A. Serebrenik, A. Zaidman, Lean ghtorrent: Github data on demand, in: Proceedings of the 11th Working Conference on Mining Software Repositories, ACM, 2014, pp [35] G. Sidorov, F. Velasquez, E. Stamatatos, A. F. Gelbukh, L. Chanona- Hernández, Syntactic n-grams as machine learning features for natural language processing, Expert Syst. Appl. 41 (3) (2014) [36] C.-C. Chang, C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011) 27:1 27:27, software available at libsvm. [37] G. Frantzeskou, S. MacDonell, E. Stamatatos, S. Gritzalis, Examining the significance of high-level programming features in source code author classification, Journal of Systems and Software 81 (3) (2008)

21 610 [38] T. Fawcett, An introduction to roc analysis, Pattern recognition letters 27 (8) (2006) [39] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi, R. Greenstadt, De-anonymizing programmers via code stylometry. In review. 21

22 Appendix A. Approach to Construct the Corpus In this appendix we describe the followed process to construct a dataset with GitHub 1 JavaScript repositories. The result is a collection of source code repository labeled by the appropriate GitHub user name. To prevent topic biasses during validation, we defined each code sample in the corpus be a complete JavaScript repository. In this way the training and validation data is taken from different JavaScript projects. As a result, classification will less likely be based on characteristics of a project such as specific variable names and project specific style conventions. In the remainder of this appendix we first describe how we selected appropriate GitHub users in the first section. Then we detail how we cloned and included suitable repositories in the corpus. Mining appropriate GitHub users using GHTorrent In this work we limit ourselves to projects that have been developed by a single programmer. To select GitHub users we use the GitHub meta data provided by the GHTorrent project 2. We used the relational repository metadata that is offered as download in a MySQL database. This database contains the information necessary to select appropriate GitHub projects with the corresponding collaborators. Figure A.8 shows the database tables which are relevant in our context. The database table project members links the users to the project which they have commit access to. We removed all projects from the dataset which are no JavaScript projects (as our target language is JavaScript). To enable a machine learning approach many samples per author are required. Consequently, we need to find users which own a large number of projects that are exclusively developed by themselves. To find such users we imported the GHTorrent database in network analysis tool Gephi 3. In this tool we represented users and projects in a directed graph, where a directed edge (u, v) represents a user u with commit access to repository v. Because we only include samples in our corpus which were developed by a single programmer, or single user in GitHub terminology, we removed all edges v with in-degree larger than 1. For team work GitHub has special organization accounts, so account sharing is not to be expected. Then we listed the user nodes u in descending order on their out-degree. The result is a list of GitHub users sorted on the number of repositories they own. This list is used in the next step where we download the repositories of the users with the largest out-degree. Downloading repositories Unfortunately we cannot be sure that the repositories selected in Gephi were developed by a single author. Besides the collaborators (which have direct

Source Code Author Identification Based on N-gram Author Profiles

Source Code Author Identification Based on N-gram Author Profiles Source Code Author Identification Based on N-gram Author files Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, Sokratis Katsikas Laboratory of Information and Communication Systems Security

More information

Source Code Author Identification Based on N-gram Author Profiles

Source Code Author Identification Based on N-gram Author Profiles Source Code Author Identification Based on N-gram Author s Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, Sokratis Katsikas Laboratory of Information and Communication Systems Security

More information

A Novel Approach of Mining Write-Prints for Authorship Attribution in Forensics

A Novel Approach of Mining Write-Prints for Authorship Attribution in  Forensics DIGITAL FORENSIC RESEARCH CONFERENCE A Novel Approach of Mining Write-Prints for Authorship Attribution in E-mail Forensics By Farkhund Iqbal, Rachid Hadjidj, Benjamin Fung, Mourad Debbabi Presented At

More information

A Language Independent Author Verifier Using Fuzzy C-Means Clustering

A Language Independent Author Verifier Using Fuzzy C-Means Clustering A Language Independent Author Verifier Using Fuzzy C-Means Clustering Notebook for PAN at CLEF 2014 Pashutan Modaresi 1,2 and Philipp Gross 1 1 pressrelations GmbH, Düsseldorf, Germany {pashutan.modaresi,

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Authorship Disambiguation and Alias Resolution in Data

Authorship Disambiguation and Alias Resolution in  Data Authorship Disambiguation and Alias Resolution in Email Data Freek Maes Johannes C. Scholtes Department of Knowledge Engineering Maastricht University, P.O. Box 616, 6200 MD Maastricht Abstract Given a

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear Using Machine Learning to Identify Security Issues in Open-Source Libraries Asankhaya Sharma Yaqin Zhou SourceClear Outline - Overview of problem space Unidentified security issues How Machine Learning

More information

How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects?

How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects? How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects? Saraj Singh Manes School of Computer Science Carleton University Ottawa, Canada sarajmanes@cmail.carleton.ca Olga

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Tutorials Case studies

Tutorials Case studies 1. Subject Three curves for the evaluation of supervised learning methods. Evaluation of classifiers is an important step of the supervised learning process. We want to measure the performance of the classifier.

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

1 Lexical Considerations

1 Lexical Considerations Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.035, Spring 2013 Handout Decaf Language Thursday, Feb 7 The project for the course is to write a compiler

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Comparison of FP tree and Apriori Algorithm

Comparison of FP tree and Apriori Algorithm International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti

More information

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

SNS College of Technology, Coimbatore, India

SNS College of Technology, Coimbatore, India Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,

More information

Designing a Semantic Ground Truth for Mathematical Formulas

Designing a Semantic Ground Truth for Mathematical Formulas Designing a Semantic Ground Truth for Mathematical Formulas Alan Sexton 1, Volker Sorge 1, and Masakazu Suzuki 2 1 School of Computer Science, University of Birmingham, UK, A.P.Sexton V.Sorge@cs.bham.ac.uk,

More information

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

A Novel Approach of Mining Write-Prints for Authorship Attribution in Forensics

A Novel Approach of Mining Write-Prints for Authorship Attribution in  Forensics A Novel Approach of Mining Write-Prints for Authorship Attribution in E-mail Forensics Farkhund Iqbal Rachid Hadjidj Benjamin C. M. Fung Mourad Debbabi Concordia Institute for Information Systems Engineering

More information

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS Objective PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS Explain what is meant by compiler. Explain how the compiler works. Describe various analysis of the source program. Describe the

More information

Hybrid Obfuscated Javascript Strength Analysis System for Detection of Malicious Websites

Hybrid Obfuscated Javascript Strength Analysis System for Detection of Malicious Websites Hybrid Obfuscated Javascript Strength Analysis System for Detection of Malicious Websites R. Krishnaveni, C. Chellappan, and R. Dhanalakshmi Department of Computer Science & Engineering, Anna University,

More information

Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process

Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process KITTISAK KERDPRASOP and NITTAYA KERDPRASOP Data Engineering Research Unit, School of Computer Engineering, Suranaree

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Noise-based Feature Perturbation as a Selection Method for Microarray Data Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering

More information

[Raghuvanshi* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Raghuvanshi* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY A SURVEY ON DOCUMENT CLUSTERING APPROACH FOR COMPUTER FORENSIC ANALYSIS Monika Raghuvanshi*, Rahul Patel Acropolise Institute

More information

The PCAT Programming Language Reference Manual

The PCAT Programming Language Reference Manual The PCAT Programming Language Reference Manual Andrew Tolmach and Jingke Li Dept. of Computer Science Portland State University September 27, 1995 (revised October 15, 2002) 1 Introduction The PCAT language

More information

Open Source Software Recommendations Using Github

Open Source Software Recommendations Using Github This is the post print version of the article, which has been published in Lecture Notes in Computer Science vol. 11057, 2018. The final publication is available at Springer via https://doi.org/10.1007/978-3-030-00066-0_24

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

Fraud Detection using Machine Learning

Fraud Detection using Machine Learning Fraud Detection using Machine Learning Aditya Oza - aditya19@stanford.edu Abstract Recent research has shown that machine learning techniques have been applied very effectively to the problem of payments

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

Rearranging the Order of Program Statements for Code Clone Detection

Rearranging the Order of Program Statements for Code Clone Detection Rearranging the Order of Program Statements for Code Clone Detection Yusuke Sabi, Yoshiki Higo, Shinji Kusumoto Graduate School of Information Science and Technology, Osaka University, Japan Email: {y-sabi,higo,kusumoto@ist.osaka-u.ac.jp

More information

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science 310 Million + Current Domain Names 11 Billion+ Historical Domain Profiles 5 Million+ New Domain Profiles Daily

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

Hidden Loop Recovery for Handwriting Recognition

Hidden Loop Recovery for Handwriting Recognition Hidden Loop Recovery for Handwriting Recognition David Doermann Institute of Advanced Computer Studies, University of Maryland, College Park, USA E-mail: doermann@cfar.umd.edu Nathan Intrator School of

More information

Lexical Considerations

Lexical Considerations Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.035, Spring 2010 Handout Decaf Language Tuesday, Feb 2 The project for the course is to write a compiler

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization

More information

Associating Terms with Text Categories

Associating Terms with Text Categories Associating Terms with Text Categories Osmar R. Zaïane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane@cs.ualberta.ca Maria-Luiza Antonie Department of Computing Science

More information

Repeating Segment Detection in Songs using Audio Fingerprint Matching

Repeating Segment Detection in Songs using Audio Fingerprint Matching Repeating Segment Detection in Songs using Audio Fingerprint Matching Regunathan Radhakrishnan and Wenyu Jiang Dolby Laboratories Inc, San Francisco, USA E-mail: regu.r@dolby.com Institute for Infocomm

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

CS 6353 Compiler Construction Project Assignments

CS 6353 Compiler Construction Project Assignments CS 6353 Compiler Construction Project Assignments In this project, you need to implement a compiler for a language defined in this handout. The programming language you need to use is C or C++ (and the

More information

Toward Part-based Document Image Decoding

Toward Part-based Document Image Decoding 2012 10th IAPR International Workshop on Document Analysis Systems Toward Part-based Document Image Decoding Wang Song, Seiichi Uchida Kyushu University, Fukuoka, Japan wangsong@human.ait.kyushu-u.ac.jp,

More information

On the Use of Discretized Source Code Metrics for Author Identification

On the Use of Discretized Source Code Metrics for Author Identification On the Use of Discretized Source Code Metrics for Author Identification Maxim Shevertalov, Jay Kothari, Edward Stehle, and Spiros Mancoridis Department of Computer Science College of Engineering Drexel

More information

AN EFFICIENT BINARIZATION TECHNIQUE FOR FINGERPRINT IMAGES S. B. SRIDEVI M.Tech., Department of ECE

AN EFFICIENT BINARIZATION TECHNIQUE FOR FINGERPRINT IMAGES S. B. SRIDEVI M.Tech., Department of ECE AN EFFICIENT BINARIZATION TECHNIQUE FOR FINGERPRINT IMAGES S. B. SRIDEVI M.Tech., Department of ECE sbsridevi89@gmail.com 287 ABSTRACT Fingerprint identification is the most prominent method of biometric

More information

Predicting connection quality in peer-to-peer real-time video streaming systems

Predicting connection quality in peer-to-peer real-time video streaming systems Predicting connection quality in peer-to-peer real-time video streaming systems Alex Giladi Jeonghun Noh Information Systems Laboratory, Department of Electrical Engineering Stanford University, Stanford,

More information

Beijing University of Posts and Telecommunications, Beijing, , China

Beijing University of Posts and Telecommunications, Beijing, , China CAR:Dictionary based Software Forensics Method 12 Beijing University of Posts and Telecommunications, Beijing, 100876, China E-mail: yangxycl@bupt.edu.cn Hewei Yu National Computer Network and Information

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072

More information

Filtering Bug Reports for Fix-Time Analysis

Filtering Bug Reports for Fix-Time Analysis Filtering Bug Reports for Fix-Time Analysis Ahmed Lamkanfi, Serge Demeyer LORE - Lab On Reengineering University of Antwerp, Belgium Abstract Several studies have experimented with data mining algorithms

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Lexical Considerations

Lexical Considerations Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.035, Fall 2005 Handout 6 Decaf Language Wednesday, September 7 The project for the course is to write a

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning Timothy Glennan, Christopher Leckie, Sarah M. Erfani Department of Computing and Information Systems,

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

Math Search with Equivalence Detection Using Parse-tree Normalization

Math Search with Equivalence Detection Using Parse-tree Normalization Math Search with Equivalence Detection Using Parse-tree Normalization Abdou Youssef Department of Computer Science The George Washington University Washington, DC 20052 Phone: +1(202)994.6569 ayoussef@gwu.edu

More information

Bipartite Graph Partitioning and Content-based Image Clustering

Bipartite Graph Partitioning and Content-based Image Clustering Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the

More information

Automatic Labeling of Issues on Github A Machine learning Approach

Automatic Labeling of Issues on Github A Machine learning Approach Automatic Labeling of Issues on Github A Machine learning Approach Arun Kalyanasundaram December 15, 2014 ABSTRACT Companies spend hundreds of billions in software maintenance every year. Managing and

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,

More information

Similarities in Source Codes

Similarities in Source Codes Similarities in Source Codes Marek ROŠTÁR* Slovak University of Technology in Bratislava Faculty of Informatics and Information Technologies Ilkovičova 2, 842 16 Bratislava, Slovakia rostarmarek@gmail.com

More information

Object Purpose Based Grasping

Object Purpose Based Grasping Object Purpose Based Grasping Song Cao, Jijie Zhao Abstract Objects often have multiple purposes, and the way humans grasp a certain object may vary based on the different intended purposes. To enable

More information

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Indoor Object Recognition of 3D Kinect Dataset with RNNs

Indoor Object Recognition of 3D Kinect Dataset with RNNs Indoor Object Recognition of 3D Kinect Dataset with RNNs Thiraphat Charoensripongsa, Yue Chen, Brian Cheng 1. Introduction Recent work at Stanford in the area of scene understanding has involved using

More information

Examining the significance of high-level programming features in source code author classification

Examining the significance of high-level programming features in source code author classification Available online at www.sciencedirect.com The Journal of Systems and Software 81 (2008) 447 460 www.elsevier.com/locate/jss Examining the significance of high-level programming features in source code

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

A Framework for Securing Databases from Intrusion Threats

A Framework for Securing Databases from Intrusion Threats A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:

More information

A Mining Algorithm to Generate the Candidate Pattern for Authorship Attribution for Filtering Spam Mail

A Mining Algorithm to Generate the Candidate Pattern for Authorship Attribution for Filtering Spam Mail A Mining Algorithm to Generate the Candidate Pattern for Authorship Attribution for Filtering Spam Mail Khongbantabam Susila Devi #1, Dr. R. Ravi *2 1 Research Scholar, Department of Information & Communication

More information

3.4 Data-Centric workflow

3.4 Data-Centric workflow 3.4 Data-Centric workflow One of the most important activities in a S-DWH environment is represented by data integration of different and heterogeneous sources. The process of extract, transform, and load

More information

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments WHITE PAPER Application Performance Management The Case for Adaptive Instrumentation in J2EE Environments Why Adaptive Instrumentation?... 3 Discovering Performance Problems... 3 The adaptive approach...

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Cost-sensitive Boosting for Concept Drift

Cost-sensitive Boosting for Concept Drift Cost-sensitive Boosting for Concept Drift Ashok Venkatesan, Narayanan C. Krishnan, Sethuraman Panchanathan Center for Cognitive Ubiquitous Computing, School of Computing, Informatics and Decision Systems

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report Team Member Names: Xi Yang, Yi Wen, Xue Zhang Project Title: Improve Room Utilization Introduction Problem

More information

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Offer Sharabi, Yi Sun, Mark Robinson, Rod Adams, Rene te Boekhorst, Alistair G. Rust, Neil Davey University of

More information

Multimodal Information Spaces for Content-based Image Retrieval

Multimodal Information Spaces for Content-based Image Retrieval Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due

More information

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Markus Krötzsch Pascal Hitzler Marc Ehrig York Sure Institute AIFB, University of Karlsruhe, Germany; {mak,hitzler,ehrig,sure}@aifb.uni-karlsruhe.de

More information