Scripting DNA: Identifying the JavaScript Programmer

Size: px

Start display at page:

Download "Scripting DNA: Identifying the JavaScript Programmer"

Rudolf Wright
6 years ago
Views:

1 Scripting DNA: Identifying the JavaScript Programmer Wilco Wisse Delft University of Technology, Software Engineering Research Group, Mekelweg 5, 2628 CD Delft, Netherlands Cor Veenman Netherlands Forensic Institute, Knowledge and Expertise Centre for Intelligent Data Analysis, Laan van Ypenburg 6, 2497 GB, The Hague, Netherlands Abstract The attribution of authorship is required in diverse applications, ranging from ancient novels (Shakespeares work, Federalist papers) for historical interest to recent novels for linguistic research or out of curiosity (Robert Galbraith alias J.K.Rowling). For this problem extensive research has resulted in effective general purpose methods. Also, other language utterances can be questioned for the original author. Especially, we are interested in identifying the offender who produced malicious software on a website. So far, for this hardly studied problem, mainly general purpose methods from natural language authorship attribution have been applied. Moreover, no suitable reference dataset is available to allow for method evaluation and method development in a supervised machine learning approach. In this work we first obtain a reference dataset of substantial size and quality. Further, we propose to extract structural features from the abstract syntax tree (AST) to describe the coding style of an author. In the experiments, we show that the specifically designed features indeed improve the authorship attribution of scripting code to programmers, especially in addition to character n-gram features. Keywords: Authorship identification, Authorship verification, Source code, JavaScript, Abstract Syntax Tree, Syntactic features 1. Introduction Authorship identification is roughly defined as the task of identifying the true author of a document given samples with undisputed authorship from a Corresponding author addresses: wilco.wisse@gmail.com (Wilco Wisse), c.veenman@nfi.minvenj.nl (Cor Veenman) Preprint submitted to Digital Investigation June 7, 2015

2 finite set of candidate authors [1]. The fundamental basis of authorship identification is that language is flexible enough in its expression to identify authors purely on the basis of their individual writing style. Traditionally, this task has mainly been used for historical interests. Arguably the most well-known and most influential work is the authorship identification of the Federalist Papers by Mosteller and Wallace [2]. The successes in authorship identification have led to the development of more advanced automated identification techniques, that play a crucial role in various applications, ranging from cases of academic dishonesty to forensic investigations [3]. With the rapid growth and popularity of the Internet an increasing number of criminals employ the Web to illegal ends, such as sharing child pornography and committing cyber crimes. The ease of hiding your real identity on the Web has heightened the need for effective identification techniques in recent years [4, 5, 6]. The use of the Internet technology by criminals also provides additional opportunities to detect the author of fraudulent content. To that end we address the authorship identification of source code that is embedded within web-pages. In particular, we focus on authorship identification of JavaScript code. JavaScript is an interpreted scripting language that it is commonly embedded within web-pages. Most source code authorship identification studies in literature focus on languages that are encountered in compiled form. In contrast, JavaScript source code is not transformed into machine language instruction, so that the source code is attainable in its original form. In addition, the focus on JavaScript as target language is of great interest when considering the wide-spread use of JavaScript on the Web. The identification of a JavaScript developer may serve as a useful means in cybercrime investigations, such as tracing the authors of malicious websites. The most straightforward authorship identification task is its closed-set form, where reference data is available for a finite set of candidate authors, which surely includes documents of the true author. More difficult is the open-set form, where the questioned document may be written by an unknown, previously unseen author. In this work we consider the closed-set form and a special case of the open-set identification problem in which the set of candidate authors is singleton. The latter problem boils down to the question whether a document has or has not been written by an author and is known as authorship verification [1]. Previous studies have shown that especially character n-gram based approaches are effective in source code authorship identification [7, 8]. Such general purpose methods consider a source file as a mere sequence of characters. In this study we propose a set of language specific features that express JavaScript language properties at a higher syntactic complexity, by parsing the source into an Abstract Syntax Tree (AST). The AST lends itself in particular for the extraction of structural features. Our assumption is that such features capture different properties, and thus would either individually or in addition to general purpose n-gram features improve the authorship identification task. Moreover, structural features are robust against layout modification of the source code. As a result, the layout features can easily be omitted, such 2

3 that the classification will be less susceptible to pretty printing or minification of the source code. A key issue in authorship identification studies is a reliable dataset for validation purposes. Unfortunately, no labeled JavaScript source code datasets are readily available. In this paper we propose a way to obtain a reference dataset of substantial size and quality from the world s largest repository hosting service GitHub. With this dataset we evaluate the language specific method and compare the performance with two general purpose identification techniques, which have been reported as effective in recent literature [9, 10]. The layout of the paper is as follows. In the next section, we first describe related work in authorship attribution, while we focus on programmer identification. Then we describe our programmer identification approach. The main part of this section is concerned with AST feature extraction. The following section describes the corpus construction, which is further detailed in Appendix A. In the experiments the proposed method is tested on the described corpus. We finalize the paper with conclusions and discussions on the obtained results. 2. Related Work Below we describe previous research that has been done in relation to our work. We first elaborate on previous work in authorship identification with a focus on programmer identification. Then we describe research in which parse tree features were used in similar context Authorship Identification Authorship identification in general can be viewed as a text classification task where a set of reference samples is given for the candidate authors. One major subtask in authorship identification is the extraction of stylistic characteristics (known as stylometric features) that differ between documents written by different authors. For natural language authorship identification Stamatatos [5] distinguished techniques on the syntactical complexity of the extracted stylometric features. These range from lexical [11] and character features [12, 13] to syntactic [14] and semantic features [15]. Also combinations of these feature types have been implemented, for instance in [16]. We refer to this survey [5] for een overview on authorship attribution for natural language. In this work, the focus is on authorship attribution for program source code, i.e. programmer identification. Also for this authorship identification problem features with various syntactic complexity have been used. Most early programmer identification studies attempted to capture programming characteristics at a high level complexity by extracting a few dozen of specific textual measurements and software metrics. Such features may be divided into layout features (that relate to typographic aspects such as indentation and spacing), style features (such as naming conventions and variable length) and structural features (which express the structural decomposition of the code) [17]. Since detailed text analysis is required to extract such features, we refer to these features 3

4 x 2 x a2 x 2 x a1 x u x u x 1 (a) Profile based classification. x 1 (b) Instance based classification. Figure 1: Profile based versus instance based authorship identification. Source file u is an unlabeled file which has to be attributed to one of the candidate authors. The profile based model has a data points for each author. The instance based approach has a data point for each individual sample as language-specific features. Languages specific features have been utilized in several language domains, including Pascal [18], Java [19], C [17] and C++ [20]. Next to language specific methods, source code features may be extracted at a low level complexity by considering each source file as a mere sequence of characters or tokens. The application of character n-grams has shown to be among the most effective identification methods in both natural language [21, 22] and source code [8]. Character n-grams implicitly capture lexical, syntactic and structural writing characteristics by representing each source file as relative frequencies of occurrence of character sequences. Since no deep linguistic analysis is required to extract character n-grams, such approaches are language independent. Different classification techniques have been utilized to attribute an unlabeled document to a candidate author. In general, profile based approaches produce a single representation (profile) for all the documents per author and the authorship identification is based on a similarity function that quantifies the degree of shared information between the author profiles and the unlabeled document [5]. Instance based techniques, on the contrary, produce an individual representation for each document. Commonly, these approaches represent each source file by a vector in a multivariate space, and employ a machine learn- ing algorithm for classification. Figure 1 illustrates the difference between the instance based and profile based approach. It should be noted that similarity based classification models have been used in the instance based paradigm as well. The latter methods are sometimes referred to as nearest neighbor or information retrieval approaches, since the identification takes place by a similarity measure that is used to obtain the class labels of the reference data most similar to the unlabeled source file [23, 24]. 4

5 Parse Tree Features In natural language, syntactical structures have been proven to be good features for authorship identification [25]. A parse tree is a convenient way to determine the syntactic structure of the sentences [14]. Baayen [26] was the first who extracted rewrite-rule frequencies from the parse tree for the purpose of authorship identification. Rewrite-rules represent the combinations of a node and its immediate constituents in the tree. Later authorship identification studies have examined related tree characteristics such as the depth of the syntax trees [27], the frequency parent-child node types [14], n-grams defined on node types [28], and frequent subtree patterns [25]. In source code, the Abstract Syntax Tree (AST) is a convenient way to represent the syntactic structure of a program. ASTs are usually employed for code analysis in compilers and developer tools, but also enable to select detailed structural features for programming style characterization. To this end, AST subtrees have been used in code clone and plagiarism detection applications to compute the structural similarity between programs [29, 30]. However, these plagiarism studies deal with the detection of approximate matches of larger chunks of code, which are the result of copy paste modifications. 3. Approach We now describe our approach for programmer identification of JavaScript source code. We deal with two programmer identification tasks: closed-set identification and programmer verification. For closed-set identification reference data is available for a finite set of authors. The disputed JavaScript file has to be assigned to one of these candidate authors. For the verification task, reference data is only available for one candidate programmer. We approach these problems as machine learning problems in which a properly designed feature set is essential. In our proposal, the feature set consists of features derived from the parse tree. In the next section, we describe how we obtain these features. After that, we work out our machine learning model for closed-set identification and programmer verification Feature Extraction In this work we utilize the parse tree for the extraction of stylometric features. The parse trees are obtained by the Esprima [31] parser, which generates a Mozilla compatible JavaScript AST [32]. Figure 2 shows an example of such an AST. Every node has a corresponding node object, that indicates the node type and has a number of corresponding attributes (child-nodes) which are either expression or statement nodes. In the AST, the name of the attributes are indicated as labels on the edges. For instance, the function call node in figure 2 is represented by a CallExpression object that implements the following interface [32]: 5

6 callee MemberExpression object property Identifier Identifier name name CallExpression arguments 1 2 Identifier Literal name value foo bar a 1 Figure 2: An AST corresponding to the JavaScript code foo.bar(a,1) i n t e r f a c e C a l l E x p r e s s i o n <: E x p r e s s i o n { t y p e : C a l l E x p r e s s i o n ; c a l l e e : E x p r e s s i o n ; arguments : [ E x p r e s s i o n ] ; } Traversing the parse tree nodes allows to extract detailed language specific features from the node objects, which are detailed in the remainder of this section Structural features The AST lends itself in particular for the extraction of features related to the tree structure. First, we tracked the length of node lists that are present in the AST nodes. The length of these lists reflects the number of children of a node, such as the number of arguments defined in a function declaration and the number of elements initialized in an array. Additionally, we tracked the number of descendants nodes of particular node types (i.e. the number of nodes having a common ancestor). This is depicted in figure 3a. The number of descendant nodes indicates the complexity of node attributes, e.g. whether identifiers or comprehensive objects are passed as arguments in a CallExpression. To capture how the nodes are structured in the AST, we maintain the frequency of the most frequent node n-grams for n = 1, 2 and 3. We define node n-grams as contiguous sequences of nodes in the AST, where each node in this sequence is a child of its preceding node (see figure 3b). The frequency of node uni-grams (i.e. n = 1) captures the frequency of individual node types. This may for instance indicate the preference for different loop types. Also, the appearance of NewExpression and MemberExpression nodes may be an indicator of an object-oriented programming style. Furthermore, with node 2 and 3-grams we aim to capture how nodes are connected to each other in the tree. For example, different expression types may be used as callee in a CallExpression, such as a member expression or an identifier. The appearance of such features may reflect a characteristic habits in program structure between different programmers. 6

7 callee C args foo I object callee property object M property I I args a 1 I bar (b) Node 2-gram. I a foo bar (a) Number of descendant nodes of an argument. foo. bar ( a, 1 ) C M I L M object callee property args foo bar (c) Adding layout nodes to the AST. C I I I L a 1 Figure 3: Examples of language specific properties related to the code of figure Layout features Program layout features deal specifically with the layout of the program, such as the use of indentation and spacing. In the AST, this information is disregarded, since it is irrelevant for program analysis during code compilation or interpretation. Since the layout may be an important marker of coding style, we added layout information to the AST by introducing additional nodes of the type Layout, BlockComment and LineComment. The process is clarified in figure 3c. Adding the layout nodes was done by inspecting the raw source code while traversing the parse tree. The BlockComment and LineComment nodes contain the raw comments as attribute, while the Layout nodes contain the spacing that is used at the given position within the source code, e.g. spaces before and behind brackets. For every layout position in each node type, we recorded the number of times that zero, one, two or many spaces, tabs and line breaks were used. A number of layout positions with roughly the same meaning were grouped, such as the layout slots before closing parenthesis and the layout slots before different operators in binary expressions Style features A disadvantage of layout features is that layout may easily be obfuscated by pretty printers and source code formatters. Style features concern stylistic characteristics that are less susceptible to be changed automatically. In this feature category, we recorded style information of comments, naming conventions 7

8 and data types. A number of nodes in the AST contain textual information that can be used for this end. For comments we maintain the length of block and line comments, the ratio between line and block comments and the parent node type of comments. The latter should reflect where comments are placed in the source code. Next, naming conventions and the use of literal data types are extracted by applying regular expressions on identifier names and literal values. We measured the length of literal values and identifier names, the use of captital letters in identifiers names and the use of different literal data types (see table 1) Feature representation The discussed features are used to represent each JavaScript project as a feature vector in multivariate space. Table 1 presents the number of dimensions of the feature space that correspond to the discussed layout, style and structural features. For each feature we maintain a histogram distribution, which records the frequency of observed values related to the feature. This approach is comparable to the approach of Lange [33]. For instance, to express the length of identifiers in the source code, the x-axis of the histogram corresponds to every recorded identifier length, while the y-axis represents the number of times an identifier of that length existed in the source code. After generating the raw histogram distributions for each project, we normalize the bin values. This enables us to compare distributions of features between different programmers. We either evaluated binary normalization (such that each histogram value becomes 1 if it is observed at least once and 0 otherwise) and feature wise normalization by dividing the value of each bin by the sum of the other bin values of the same measure (such that the sum of occurrences in each histogram becomes 1) Classification in Closed-set Programmer Identification In closed-set programmer identification, reference data is available for each candidate programmer and the questioned source file surely belongs to one of the candidate programmers. The described multivariate representation of the available source files enables us to adopt a machine learning approach for classification. We modeled the source code programmer identification problem as a multi-class classification problem, where the reference data of the candidate programmers acts as training data. The trained multi-class classifier predicts the authorship of previously unseen documents Classification in Programmer Verification Next to closed-set programmer identification we address programmer verification. The goal of programmer verification is to determine whether or not a source file is written by a certain target programmer. Consequently, the questioned documents may be written by a previously unseen programmer for which no training data is provided. The absence of these data raises the question how to train the classifier. In literature two approaches have commonly be used, namely intrinsic and extrinsic classification models [1]. In the intrinsic classification model only the target class (corresponding to the target programmer) is 8

9 Table 1: Number of defined measurements on the AST. Structural Node 1-grams (expressions) 19 Node 1-grams (statements) 17 Node 2-grams 433 Node 3-grams 546 Descendant count of nodes 110 Length of lists defined in nodes 126 Style String patterns on identifiers (naming conventions) 16 Type of comments (block/line) 2 Type of parent node of comments 33 Length of comments 18 Literal data types (string with double/single quotes, null value, number, boolean or regex) Layout Number of tabs at a postition 121 Number of spaces at a postition 121 Number of returns at a postition modeled by the available training data. In this case the problem is handled as a one-class classification problem. In this work we adopt the extrinsic classification model where the problem is modeled as a two-class classification problem. In this case both the target and the outlier classes are modeled. The outlier class represents source code developed by programmers different from the target programmer. Since the outlier class is very heterogeneous, there should be enough and representative source code samples available as reference data. Then, an unlabeled document is attributed to one of the two classes when the confidence exceeds a certain threshold. An ROC curve is usually used to characterizes the performance of the classifier as this threshold is varied. 4. Corpus Construction To evaluate the performance of the proposed method, a labeled set of JavaScript source code samples is needed. Unfortunately, no such datasets are readily available. A naive approach would be to collect JavaScript code manually from websites. A problem is however that it is often unclear who is the true programmer of the source code, especially when multiple programmers collaborated in the same code base. Therefore, we propose an automated approach by collecting JavaScript projects from GitHub. GitHub is the worlds largest online repository hosting service and includes many JavaScript repositories [34]. The collected GitHub repositories can be used as reference data to model the programmer style of each of the programmers. The history information in the Git repository allows to determine the author of each line of code in a project. 9

10 Table 2: Statistics about the validation corpus after cleaning the dataset. Min. Max. Mean Median Std. Author size (repos) Author size (kb) Author size (LOC) Repo size (files) Repo size (kb) Repo size (LOC) In this paper we define a source code sample as a single repository. Also, we only use projects that have been developed by a single programmer since the unit of recognition is a project. In addition, we need ground truth of the true original programmer per project. To enable a machine learning approach, multiple source code samples per programmer are required to develop an accurate identification model. This means that programmers should be selected that own a large number of repositories exclusively developed by themselves. To be able to find such GitHub users, we used the database of GitHub metadata provided by the GHTorrent project [34]. The published GHTorrent relational database includes the information necessary to select appropriate GitHub projects with the corresponding collaborators. A more detailed description of the corpus construction is found in Appendix A. Table 2 details statistics about the authors and projects that were included in the corpus. After cloning the repositories we removed the files that were irrelevant for our experiments. These are the JavaScript files that are not parsable. Also, many repositories included source code of JavaScript libraries such as JQuery in their code base. Since such libraries are developed by other programmers they do not reflect the coding style of the owner of the repository and need to be excluded. We first attempted to remove libraries by a predefined blacklist of file names of well-known libraries. However, we found that this was not sufficient, as there was much variation in the name of the library files. Therefore we adopted the following strategy to eliminate libraries from the code base. First, we observed that in general, libraries were much larger than other JavaScript files and removed all files that did not have a size between 1 and 100kB. Secondly, the entire content of JavaScript libraries was usually committed in a single commit. Therefore, we determine whether all lines in a file were committed in one single commit, and removed the file if this was the case. We were confident that this approach was satisfactory after manually inspecting the remaining code. Figure 4 presents the cumulative density function of the number of files per repository and the size in kb of repositories in the constructed dataset, before and after the cleaning phase. 10

11 Fn Number of files Size (kb) Figure 4: The cumulative density function of the size of the collected source code projects. The dashed lines correspond to the size of the repositories before cleaning the dataset. 5. Experiments In the following we describe the experiments we did to evaluate the proposed programming style features with the constructed dataset. We diffentiate the experiments between closed-set recognition and programmer verification experiments. To be able to focus on the contribution of the proposed features and to make the experiments computationally feasible, we did initial tests on which classifier to use for the elaborate experiments and to tune the hyper parameters and other model parameters. We tested standard classifiers like linear discriminant analysis (LDA), logistic regression, and state-of-the-art classifiers like adaboost, support vector machine with various kernels and regularisations, random forest, and regularized logistic regression. In these intial experiments we also tested the best normalisation of the feature vectors, either binarisation or division by the sum of all feature frequencies. There were several methods with comparable good performance. It turned out that the support vector machine with linear kernel and L 2 regularisation gave robust competitive results. This is in line with various other text mining and authorship attribution research, e.g. [4, 35]. We used the LibSVM implementation [36] for all experiments below. As normalisation, the binarisation turned out to give the best results Closed-set Programmer Identification We first address closed-set programmer identification. We test the performance for an increasing number of candidate authors in the identification task. This was done by repeatedly selecting a random subset of programmers from the whole dataset. For each programmer the samples are randomly split into 25 training samples and the remaining samples for validation. To accurately 11

12 outlier class target class Figure 5: Validation of the programmer verification task as a binary classification problem. The target class is modeled with projects of one programmer. The outlier class is constructed by projects of different programmers divided into 5 random folds. The red areas in this figure indicate the code that is used for testing, while the remaining code is used for training the classifier estimate the accuracy, 10 iterations are performed using a different set of programmers. We defined samples in our corpus to be entire Git repositories. This ensures that source code of the same repository is not used both for training and validation. This enables us to validate how well the identification algorithms generalize beyond the training set to source code of previously unseen software projects. As a result, the classification results will be more representative for varying coding style between projects and application domains. One of the proposed ways to quantify the coding style of the programmers is by node n-grams defined on the AST. However, the JavaScript Abstact syntax defines a large number of expression and statement types, which makes the number of node n-grams to be tracked extremely large. A number of infrequent node n-grams are therefore removed from the feature space to keep the problem computationally feasible. To eliminate infrequent n-grams we empirically determined the most frequent node n-grams in the JavaScript in source code dataset. To keep this selection process computationally feasible, the process was carried out recursively. First, we started with node 1-grams, of which the most frequently observed n-grams were extended to node 2-grams. Similarly, the most frequently observed node 2 grams were extended to node 3-grams. We ended up with 36 node 1-grams, 433 node 2-grams and 582 node 3-grams. The accuracy of the proposed technique is compared to two existing character n-gram based approaches. First, we evaluate the performance of the Scap [9] approach. This is a profile based method that creates for each programmer a profile that contains the most frequently observed n-grams in the programmer s training data. Classification takes place by a similarity measure that quantifies the degree of shared information between the program profile (which characterizes the questioned source file) and the author profiles (which characterize the source code of the programmers). The adopted similarity measure is the SPI measure, which is the cardinality of the intersection between the program profile and the author profiles. In our study, we set the profile size L to the 1500 most frequent n-grams as was originally proposed by the authors [9]. Next to 12

13 Scap, we compare our method to an n-gram based approach that sticks to the instance based paradigm as is described in [10]. In this method, each document is represented individually in the same dimensional vector space, where each dimension corresponds to the relative frequency of occurrences of a particular n-gram. Also for the character n-gram feature representation, the classification is performed by a Support Vector Machine (SVM) with linear kernel [36]. The document representation in multivariate space may be sparse, since not all selected n-grams are necessarily observed in each document. Therefore we chose a larger number of n-grams than was done in the Scap method and defined the feature space by the 7000 most frequent n-grams found in the whole corpus. Since the machine learning classification techniques are both feature vector based, the feature spaces of the n-gram based and language specific approach can be combined into a single higher dimensional representation. The two feature spaces may express different programmer information, since they are extracted from a different language representation. For instance, domain specific features may be better able to describe the structural aspects, while the n-grams may be better able to describe layout related aspects. As a consequence, the two feature spaces may complement each other, resulting in a more expressive characterization of coding style. Figure 6a presents the effectivity of the techniques. The presented results are obtained by using binary feature normalization. Overall, the instance based approach that combines the domain specific and n-gram features achieved the highest accuracy. The figure shows that in a classification problem with 10 programmers, the combined method achieves an accuracy of 0.914, while the n-gram based method, the domain specific method and the Scap method respec- tively achieve an accuracy of 0.898, and The difference between accuracy of the combined method and the n-gram based method is small, but becomes larger when the number of candidate authors increases. For instance, with 34 candidate authors the n-gram based method achieves an accuracy of 0.851, while the combined method has an accuracy of When considering the the feature sets separately the n-gram based technique achieved the best performance. Additionally, we were interested in the contribution of the individual language specific feature types of the proposed approach. This is especially of importance if some feature types are suspected of being changed by external tools. For instance, layout and naming style may be imposed by editors or pretty printers, so that they do not reflect the particular coding style of a programmer. Figure 6b shows the accuracy of the language specific approach, when including one feature type and removing the other ones from the feature space. The feature types on the x-axis correspond to the features presented in table 1. As the results show, the structural features achieved a relatively high accuracy, which supports our hypothesis that structures in the AST are effective in distinguishing developer styles. However, the accuracy of using only layout features is higher than the accuracy of all the remaining features together (i.e. excluding the layout features from the feature space). The importance of layout features is in line with earlier results that indicated that the layout plays a crucial role 13

14 1 0.9 Accuracy Scap Domain specific features n-gram based with SVM Domain specific and n-gram features Number of authors (a) 1 Feature type 0.8 Accuracy Comment types Literal data types Naming style Expression types Statement types Comment length Comment parent types Descendant count Children count Node 3-gram Node 2-gram Layout excluded Layout All together (b) Figure 6: Accuracy of the techniques in closed-set programmer identification. Figure 6a present the results by training with 25 samples per programmer. Figure 6b presents the achieved accuracy by the individual feature types, which can be found in table 1. in programmer identification of source code [19, 37]. 14

15 TPR Domain specific features n-gram features Domain specific and n-gram features FPR (a) AUC Domain specific features n-gram features Domain specific and n-gram features Training samples (b) Figure 7: Validation results in programmer verification. Figure 7a shows the ROC-curves when trained with 7 samples per programmer. The AUC-values for various training samples are presented in figure 7b. 15

16 Programmer Verification For programmer verification we used the same dataset as was used for programmer identification. In the experiments we repeatedly selected one programmer as target programmer. The projects of this programmer are randomly split into training and validation samples. The outlier class is modeled by the projects of the other programmers in the dataset. By applying 5-fold cross validation on the outlier class we examine the performance of the classification process, i.e. the projects in the outlier class are randomly partitioned into 5 equally size subsets of which one set is used for validation and the others are used for training. This is done in such a way that the source code in the different subsets is written by different programmers. In this way we modeled the case that the programmer of negative test samples had not been observed before, which is the most realistic situation in practice. Figure 5 illustrates one fold in this process. In the experiments the 5-fold cross validation process is repeated multiple times in such a way that each programmer in the corpus is used once as target. The resulting performance results are averaged. Figure 7a shows the ROC-curves of the verification techniques. We did the test with 7 samples for the target user in order to be able to differentiate between the methods. That is, with more than 25 samples per target programmer all methods perform almost equally well. The ROC curve characterizes the compromise between the false positive rate (FPR) and true positive rate (TPR). Often the Area Under the Curve value (AUC-value) is taken as a scalar for comparison. The AUC-value can be interpreted as the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance [38]. Figure 7b compares the AUC values of the verification techniques with a varying number of training samples. The AUC values with 7 samples used to model the target author for the combined technique, the n-gram technique and the language specific technique are respectively 0.962, and The results indicate, as was the case in closed-set programmer identification, that the feature vector based classification with character n-gram and domain specific features was the most successful technique, while the machine learning approach with character n-grams was the most effective when considering the feature sets separately. In the authorship verification task we did not include the Scap method, since it was designed to be used in a closed-set authorship identification setting. 6. Conclusions In this work, we proposed and evaluated a language specific programmer identification technique for JavaScript source code. The first contribution of this work is the use of features directly extracted from the AST. We showed that structures in the AST, of which especially node 2- and 3-grams, are effective markers of a programmer s coding style. The proposed technique was compared to two techniques that are based on character n-grams. The results point out that features that exploit language structure through the AST improve 16

17 the programmer identification tasks, both closed-set programmer identification and programmer verification. Accordingly, the best performance is obtained by fusing the AST features to the character n-gram features. For closed-set identification the achieved accuracy is 0.91 for 35 authors with 25 samples per author. For the programmer verification task the combined features achieve AUC=0.96 with 7 training samples, while with more than 25 samples all methods have similar very good performance. Similar to our study, a recent (at time of writing unpublished) study also proposed to employ structures in the AST to quantify coding style [39]. Instead of adding character features to the AST features, they enrich the AST features with lexical features. The node n-grams, which play a central role in our study, were not considered. Further, their focus is on C++ source code that is distributed in binary form after compilation. In contrast, our method has broader applicability since it focuses on JavaScript, which is commonly encountered in its source code form on webpages. The second contributions of this research is the design of a labeled Javascript code dataset of substantial size, by utilizing source code from repository hosting service GitHub. The proposed method is generic and flexible enough to be used with different selection criteria, such as different programming languages and projects that have been developed by multiple programmers. For the current research, it was important to have projects developed by single programmers. This was established by collecting source files committed by the same GitHub user. Although GitHub provides specific organization accounts for teams, we cannot be sure that ordinary user accounts have not been used by multiple programmers. If this was the case, the results could even have been better, since the identification task would be more difficult than with true single users. The research presented in paper has raised several questions that provide the basis for further research. It would be interestiong to see how well the method would work with short scripts that could be found embedded in web pages. With the constructed corpus we could not test this, because no such small projects were available. Taking snippets from the available projects instead was not possible, because these were generally not parsable. We consider Javascript the most suitable programming language to consider, as it can be found as source code on web pages. Still the method can easily been used for other programming languages as well. Indeed, the proposed structural features open up the possibility of cross language programmer identification. That is, even when the syntax definitions of the languages are different, the underlying programming structures remain similar per programmer over different programming languages. Finally, the current study shows the effectivity of features based on structures in the AST, the selection of more sophisticated structural features could be investigated, such as frequent subtree patterns in the AST [25]. 17

18 References [1] N. Potha, E. Stamatatos, A profile-based method for authorship verification, in: Artificial Intelligence: Methods and Applications, Springer, 2014, pp [2] F. Mosteller, D. L. Wallace, Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed federalist papers, Journal of the American Statistical Association 58 (302) (1963) [3] S. Burrows, Source code authorship attribution, Ph.D. thesis, Ph. D. thesis, School of Computer Science and Information Technology, RMIT University, Melbourne, Australia (2010) [4] R. Zheng, J. Li, H. Chen, Z. Huang, A framework for authorship identification of online messages: Writing-style features and classification techniques, Journal of the American Society for Information Science and Technology 57 (3) (2006) [5] E. Stamatatos, A survey of modern authorship attribution methods, Journal of the American Society for information Science and Technology 60 (3) (2009) [6] M. Lambers, C. J. Veenman, Forensic authorship attribution using compression distances to prototypes, in: Computational Forensics, Springer, 2009, pp [7] S. Burrows, A. L. Uitdenbogerd, A. Turpin, Comparing techniques for authorship attribution of source code, Software: Practice and Experience 44 (1) (2014) [8] M. F. Tennyson, On improving authorship attribution of source code, in: Digital Forensics and Cyber Crime, Springer, 2013, pp [9] G. Frantzeskou, E. Stamatatos, S. Gritzalis, C. E. Chaski, B. S. Howald, Identifying authorship by byte-level n-grams: The source code author profile (scap) method, International Journal of Digital Evidence 6 (1) (2007) [10] E. Chatzicharalampous, G. Frantzeskou, E. Stamatatos, Author identification in imbalanced sets of source code samples, in: Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on, Vol. 1, IEEE, 2012, pp [11] F. Iqbal, R. Hadjidj, B. C. Fung, M. Debbabi, A novel approach of mining write-prints for authorship attribution in forensics, Digital Investigation 5, Supplement (0) (2008) S42 S51, the Proceedings of the Eighth Annual {DFRWS} Conference. 18

19 [12] G. Mikros, K. Perifanos, Authorship attribution in greek tweets using author s multilevel n-gram profiles, in: AAAI Spring Symposium Series, [13] E. M. Magdalena Jankowska, V. Keselj, Author verification using common n-gram profiles of text documents, in: Proceedings of COLING 2014, [14] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, E. C. R. Shin, D. Song, On the feasibility of internet-scale author identification, in: Security and Privacy (SP), 2012 IEEE Symposium on, IEEE, 2012, pp [15] P. M. McCarthy, G. A. Lewis, D. F. Dufty, D. S. McNamara, Analyzing writing styles with coh-metrix., in: G. Sutcliffe, R. Goebel (Eds.), FLAIRS Conference, AAAI Press, 2006, pp [16] A. Abbasi, H. Chen, Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace, ACM Trans. Inf. Syst. 26 (2) (2008) 7:1 7:29. [17] I. Krsul, E. H. Spafford, Authorship analysis: Identifying the author of a program, Computers & Security 16 (3) (1997) [18] P. W. Oman, C. R. Cook, Programming style authorship analysis, in: Proceedings of the 17th conference on ACM Annual Computer Science Conference, ACM, 1989, pp [19] H. Ding, M. H. Samadzadeh, Extraction of java program fingerprints for software authorship identification, Journal of Systems and Software 72 (1) (2004) [20] S. G. MacDonell, A. R. Gray, G. MacLennan, P. J. Sallis, Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis, in: Neural Information Processing, Proceedings. ICONIP 99. 6th International Conference on, Vol. 1, IEEE, 1999, pp [21] E. Stamatatos, On the robustness of authorship attribution based on character n-gram features. [22] J. Houvardas, E. Stamatatos, N-gram feature selection for authorship identification, in: Artificial Intelligence: Methodology, Systems, and Applications, Springer, 2006, pp [23] S. Burrows, A. L. Uitdenbogerd, A. Turpin, Application of information retrieval techniques for source code authorship attribution, in: Database Systems for Advanced Applications, Springer, 2009, pp [24] M. Koppel, J. Schler, S. Argamon, E. Messeri, Authorship attribution with thousands of candidate authors, in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 2006, pp

20 [25] S. Kim, H. Kim, T. Weninger, J. Han, H. D. Kim, Authorship classification: a discriminative syntactic tree mining approach, in: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, ACM, 2011, pp [26] H. Baayen, H. Van Halteren, F. Tweedie, Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution, Literary and Linguistic Computing 11 (3) (1996) [27] A. Kaster, S. Siersdorfer, G. Weikum, Combining text and linguistic document representations for authorship attribution, in: SIGIR workshop: Stylistic Analysis of Text For Information Access, [28] M. Tschuggnall, G. Specht, Enhancing authorship attribution by utilizing syntax tree profiles, EACL 2014 (2014) 195. [29] M. Chilowicz, E. Duris, G. Roussel, Syntax tree fingerprinting: a foundation for source code similarity detection, Universitye Paris-Est. [30] I. D. Baxter, A. Yahin, L. Moura, M. Sant Anna, L. Bier, Clone detection using abstract syntax trees, in: Software Maintenance, Proceedings., International Conference on, IEEE, 1998, pp [31] A. Hidayat, Esprima: Ecmascript parsing infrastructure for multipurpose analysis, accessed: (2015). [32] Mozilla Developer Network and individual contributors, Parser api, SpiderMonkey/Parser_API, accessed: (2015). [33] R. C. Lange, S. Mancoridis, Using code metric histograms and genetic algorithms to perform author identification for software forensics, in: Proceedings of the 9th annual conference on Genetic and evolutionary computation, ACM, 2007, pp [34] G. Gousios, B. Vasilescu, A. Serebrenik, A. Zaidman, Lean ghtorrent: Github data on demand, in: Proceedings of the 11th Working Conference on Mining Software Repositories, ACM, 2014, pp [35] G. Sidorov, F. Velasquez, E. Stamatatos, A. F. Gelbukh, L. Chanona- Hernández, Syntactic n-grams as machine learning features for natural language processing, Expert Syst. Appl. 41 (3) (2014) [36] C.-C. Chang, C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011) 27:1 27:27, software available at libsvm. [37] G. Frantzeskou, S. MacDonell, E. Stamatatos, S. Gritzalis, Examining the significance of high-level programming features in source code author classification, Journal of Systems and Software 81 (3) (2008)

21 610 [38] T. Fawcett, An introduction to roc analysis, Pattern recognition letters 27 (8) (2006) [39] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi, R. Greenstadt, De-anonymizing programmers via code stylometry. In review. 21

22 Appendix A. Approach to Construct the Corpus In this appendix we describe the followed process to construct a dataset with GitHub 1 JavaScript repositories. The result is a collection of source code repository labeled by the appropriate GitHub user name. To prevent topic biasses during validation, we defined each code sample in the corpus be a complete JavaScript repository. In this way the training and validation data is taken from different JavaScript projects. As a result, classification will less likely be based on characteristics of a project such as specific variable names and project specific style conventions. In the remainder of this appendix we first describe how we selected appropriate GitHub users in the first section. Then we detail how we cloned and included suitable repositories in the corpus. Mining appropriate GitHub users using GHTorrent In this work we limit ourselves to projects that have been developed by a single programmer. To select GitHub users we use the GitHub meta data provided by the GHTorrent project 2. We used the relational repository metadata that is offered as download in a MySQL database. This database contains the information necessary to select appropriate GitHub projects with the corresponding collaborators. Figure A.8 shows the database tables which are relevant in our context. The database table project members links the users to the project which they have commit access to. We removed all projects from the dataset which are no JavaScript projects (as our target language is JavaScript). To enable a machine learning approach many samples per author are required. Consequently, we need to find users which own a large number of projects that are exclusively developed by themselves. To find such users we imported the GHTorrent database in network analysis tool Gephi 3. In this tool we represented users and projects in a directed graph, where a directed edge (u, v) represents a user u with commit access to repository v. Because we only include samples in our corpus which were developed by a single programmer, or single user in GitHub terminology, we removed all edges v with in-degree larger than 1. For team work GitHub has special organization accounts, so account sharing is not to be expected. Then we listed the user nodes u in descending order on their out-degree. The result is a list of GitHub users sorted on the number of repositories they own. This list is used in the next step where we download the repositories of the users with the largest out-degree. Downloading repositories Unfortunately we cannot be sure that the repositories selected in Gephi were developed by a single author. Besides the collaborators (which have direct

Source Code Author Identification Based on N-gram Author Profiles

Source Code Author Identification Based on N-gram Author files Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, Sokratis Katsikas Laboratory of Information and Communication Systems Security