Abstract. 1. Introduction

Similar documents
Support Vector Machines

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Classifier Selection Based on Data Complexity Measures *

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Support Vector Machines

TN348: Openlab Module - Colocalization

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

An Optimal Algorithm for Prufer Codes *

Edge Detection in Noisy Images Using the Support Vector Machines

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

The Research of Support Vector Machine in Agricultural Data Classification

Machine Learning 9. week

Wishing you all a Total Quality New Year!

y and the total sum of

Cluster Analysis of Electrical Behavior

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Announcements. Supervised Learning

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Neural Networks in Statistical Anomaly Intrusion Detection

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

X- Chart Using ANOM Approach

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

CSCI 5417 Information Retrieval Systems Jim Martin!

Optimizing Document Scoring for Query Retrieval

Face Recognition Based on SVM and 2DPCA

Lecture 5: Multilayer Perceptrons

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

A Binarization Algorithm specialized on Document Images and Photos

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Relevance Feedback Document Retrieval using Non-Relevant Documents

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

Machine Learning: Algorithms and Applications

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Performance Evaluation of Information Retrieval Systems

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Learning-Based Top-N Selection Query Evaluation over Relational Databases

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

UB at GeoCLEF Department of Geography Abstract

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Using Neural Networks and Support Vector Machines in Data Mining

Fast Feature Value Searching for Face Detection

A Background Subtraction for a Vision-based User Interface *

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Query Clustering Using a Hybrid Query Similarity Measure

Related-Mode Attacks on CTR Encryption Mode

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Detection of an Object by using Principal Component Analysis

Mathematics 256 a course in differential equations for engineering students

Feature Kernel Functions: Improving SVMs Using High-level Knowledge

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

Smoothing Spline ANOVA for variable screening

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

Spam Filtering Based on Support Vector Machines with Taguchi Method for Parameter Selection

CLASSIFICATION OF ULTRASONIC SIGNALS

Problem Set 3 Solutions

An Entropy-Based Approach to Integrated Information Needs Assessment

An Anti-Noise Text Categorization Method based on Support Vector Machines *

Classification / Regression Support Vector Machines

A Statistical Model Selection Strategy Applied to Neural Networks

Data Mining: Model Evaluation

Intrinsic Plagiarism Detection Using Character n-gram Profiles

Meta-heuristics for Multidimensional Knapsack Problems

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

CS 534: Computer Vision Model Fitting

Feature Reduction and Selection

The Codesign Challenge

Private Information Retrieval (PIR)

An Improved Neural Network Algorithm for Classifying the Transmission Line Faults

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Programming in Fortran 90 : 2017/2018

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Reducing Frame Rate for Object Tracking

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

SUMMARY... I TABLE OF CONTENTS...II INTRODUCTION...

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Security Enhanced Dynamic ID based Remote User Authentication Scheme for Multi-Server Environments

S1 Note. Basis functions.

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors

Network Intrusion Detection Based on PSO-SVM

Three supervised learning methods on pen digits character recognition dataset

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Feature Selection as an Improving Step for Decision Tree Construction

Support Vector Machines. CS534 - Machine Learning

Transcription:

One-Class Tranng for Masquerade Detecton Ke Wang Salvatore J. Stolfo Computer Scence Department, Columba Unversty 500 West 20 th Street, New York, NY, 0027 {kewang, sal}@cs.columba.edu Abstract We extend pror research on masquerade detecton usng UNIX commands ssued by users as the audt source. Prevous studes usng mult-class tranng requres gatherng data from multple users to tran specfc profles of self and non-self for each user. Oneclass tranng uses data representatve of only one user. We apply one-class Naïve Bayes usng both the multvarate Bernoull model and the Multnomal model, and the one-class SVM algorthm. The result shows that oneclass tranng for ths task works as well as mult-class tranng, wth the great practcal advantages of collectng much less data and more effcent tranng. One-class SVM usng bnary features performs best among the oneclass tranng algorthms.. Introducton The Masquerade attack may be one of the most serous securty problems. It commonly appears as spoofng, where an ntruder mpersonates another person and uses that person s dentty, for example, by stealng ther passwords or forgng ther emal address. Masqueraders can be nsders or outsders. As an outsder, the masquerader may try to gan superuser access from a remote locaton and can cause consderable damage or theft. A smpler nsder attack can be executed aganst an unattended machne wthn a trusted doman. From the system s pont of vew, all of the operatons executed by an nsder masquerader may be techncally legal and hence not detected by exstng access control or authentcaton schemes. To catch such a masquerader, the only useful evdence s the operatons he executes,.e., hs behavor. Thus, we can compare one user s recent behavor aganst ther profle of typcal behavor and recognze a securty breach f the user s recent behavor departs suffcently from hs profled behavor, ndcatng a possble masquerader. The nsder problem n computer securty s shftng the attenton of the research and commercal communty from ntruson detecton at the permeter of network systems. Research and development s gong on n the area of modelng user behavors n order to detect anomalous msbehavors of mportance to securty; for example, the behavor of user-ssued OS commands as represented n ths paper, and n emal communcatons [7]. Consderable work s ongong n certan communtes to detect not only mpersonaton, but also author dentfcaton. For example, Sedelow [6] and Vel [8] are two examples bracketng the length of tme ths topc has exsted n the lterature. The masquerade problem s a challengng problem. If the masquerader can mmc the user s behavor successfully, he won t be detected. In addton, f the user hmself s behavng much dfferently than hs traned profle, the detector wll msclassfy hm as masquerader, whch may cause annoyng false alarms. There have been several attempts to solve ths problem usng command lne sequences, [4] and [9]. The best results so far reported are 60-70% accuracy wth a false postve rate as low as - 2%. The profles were computed usng supervsed machne learnng algorthms that classfy tranng data acqured from multple user. These approaches consdered tranng user profles as a mult-class supervsed learnng task where data gathered on a user s treated as an example of one-class,.e. a dstnct user. In ths paper, we consder a dfferent approach wth substantal practcal advantage. We examne the task of proflng a user by modelng hs data exclusvely, wthout usng examples from other users, and achevng good detecton performance and mnmal false postve rates. We also consder alternatve machne learnng algorthms that may be employed for ths one-class tranng approach. One-class tranng means that we only use the user s own legtmate examples of commands they ssue to buld the user s self profle. Prevous work uses both postve and negatve examples to buld both self and non-self profles, except for Maxon [9], who consders the problem of determnng how vulnerable a user s behavor may be to mmcry attack. Here we extend ths technque usng one-class SVM. Ths s mportant n many contexts, especally when the only nformaton avalable s the hstory of the user s actvtes. If a one-class tranng algorthm can acheve smlar performance to that exhbted by a mult-class approach, we may provde a sgnfcant beneft n real securty applcatons; much less data s requred, and tranng can proceed ndependently of any other user. The study reported n ths paper ndcates that ndeed one-class tranng algorthms perform equally well as two class tranng approaches.

Ths self profle dea s smlar to the wdely used anomaly detecton technques n ntruson detecton system [eg. 2, 3]. For example, the anomaly detector of IDES [8] uses establshed normal usage profles, whch s the expected behavor, to dentfy any large usage devaton as a possble attack. Several methods have been used to model the normal data, for example, decson trees [7], neural network [4], and sparse Markov Transducers [2], and Markov chans [9]. In ths paper, we appled one-class Naïve Bayes and one-class SVM algorthms to the masquerade dataset of UNIX system call sequences. In prevous work, we beleve there were several methodologcal flaws n the manner n whch data was acqured and used. The Schonlau dataset from [4] presents each user s command lne data wth a varyng number of artfcally created masquerade command blocks, rangng from 0 to 24, out of a total of 00 command blocks to be classfed. The prevous work only consdered the average performance of a gven method when t s appled to all of the 50*00 blocks of commands ssued by the 50 users. However, snce the masquerade blocks are randomly nserted nto each user s data by usng some other user s command block, each user s data has a dfferent number of masquerade blocks, and the content of these masquerade blocks all dffer. Ths data s not a good baselne to compare the effectveness of alternatve detecton methods because one method mght be better at detectng certan forms of masquerade attack whle others are not. Unfortunately, snce the dstrbuton of such masquerade blocks appear many tmes n the dataset, some algorthms appear to have better performance over others, whle, n practce or n other contexts, ths fndng may not be true. To better compare the alternatve methods proposed n ths work, we follow the exhaustve v49 evaluaton methodology from [9], whch wll be descrbed n detal n the secton about the expermental methodology and results. The ROC score [5] s used to compare several one-class tranng methods under dfferent false postve rate restrctons. Ths approach provdes a better vew to ndcate whch algorthm s better for most users. Even though one machne learnng algorthm may perform better than another n terms of detecton and false postve rate, our most mportant fndng s that one-class tranng can ndeed perform as well as mult-class tranng. Even so, the overall performance of both approaches ndcates that there s much room for mprovement. The rest of the paper s organzed as follows. Secton 2 descrbes the dataset we used n our research and provdes a revew of recent work, and Secton 3 dscusses the oneclass Naïve Bayes and one-class SVM methods. Secton 4 descrbes the expermental settng and our results, and Secton 5 concludes the paper wth our analyss and fndngs. 2. Recent Work Schonlau et al. [4] and Maxon [9] provde a dataset collected from keyboard commands on a UNIX platform. The dataset s publcly avalable at http://www.schonlau.net. Unfortunately, there are nether tmestamps nor other detaled command lne arguments avalable n the Schonlau dataset. Hence, the dataset n queston conssts smply of sequental truncated commands ssued by each of 70 users. Each user s represented by 5,000 commands, collected over tme from a few days to a few months. They randomly chose 50 users to serve as ntruson target, and the remanng 20 users as masqueraders. The frst 5,000 commands of each user are clean data (legtmately ssued by the user), and the next 0,000 commands of the target users were randomly njected wth commands ssued by the 20 masquerade users, apparently none of whom are any of the 50 users. The commands are grouped nto blocks, wth 00 commands per block. The commands n one block are ether all clean or all masquerade attacks, called drty blocks. The task studed n [4] s to accurately classfy the drty blocks as masquerade commands, whle classfyng the clean blocks as belongng to the user. From ths pont of vew, ths problem s very smlar to a text classfcaton problem, where the commands may be regarded as words or tokens n a document that s to be classfed. Ther work dentfes sx masquerade-detecton schemes. They are: Bayes -Step Markov, Hybrd Mult- Step Markov, Incremental Probablstc Acton Modelng (IPAM), Unqueness, Sequence-Match, and Compresson. These masquerade detecton methods were drectly appled to ths dataset; the frst 5000 commands were used for tranng, and the remanng 0000 commands were nserted and used for testng, where each were processed n blocks of 00 commands. Testng and tranng were performed on a per block bass. The paper by Maxon [9] presents a clear and concse revew of these sx methods. The strateges behnd each of these methods are qute dfferent, but the results show that no method completely domnates another usng cluster analyss and ROC curves. Maxon [9] mproved upon Schonlau s result by applyng the Naïve Bayes classfcaton algorthm usng the bag of words features. Naïve Bayes has been used n text classfcaton for a long tme and proved to be very effcent n ths context as well. Naïve Bayes was also used n the earler work [5] on classfyng malcous code attachments n emal messages. Maxon presents a detaled analyss of the orgns of the classfcaton error, revealng why some users are good masquerades and others are not. That paper desgned another experment, called v49, to perform ths error analyss. We also use the v49 expermental settng n our work, but here t s

used to compare the performance of dfferent classfers when appled to multple classes. The results for these revewed methods are dsplayed n Table and serve as a baselne for comparson. Method Hts False Postves N. Bayes (updatng) 6.5%.5% N. Bayes (no Upd.) 66.2% 4.6% Unqueness 39.4%.4% Hybrd Markov 49.3% 3.2% -step Markov 69.3% 6.7% IPAM 4.4% 2.7% Sequence Matchng 36.8% 3.7% Compresson 34.2% 5.0% 3. Machne learnng methods 3.. Learnng task For ths masquerade detecton problem, the learnng task s to buld a classfer that can accurately detect the masquerade commands whle not msclassfyng the user s legtmate commands as a masquerade. Usng the Schonlau dataset, whch s organzed as a set of blocks of 00 commands, the learnng task s to compute a bnary classfer whose nput s a block of 00 commands and whose output s a classfcaton of that block as ether generated by a masquerader or not. The target classfcaton s to detect the masquerader s command blocks. Hence, the masqueraders data are postve examples, whle the user s legtmate data are treated as negatve examples. Thus, a true postve outcome s a masquerade block of 00 commands, whle a false postve outcome s a block of commands legtmately ssued by the user but msclassfed as a masquerade. In the followng descrpton, we call the masquerade blocks postve examples and call the legtmate blocks, those ssued by the user hmself, negatve examples. One-class tranng means that a classfer s computed usng only negatve examples of the user hmself as tranng data to buld the classfer, whch wll be used to classfy both postve and negatve data. Thus, the task s to postvely dentfy masqueraders, but not to postvely dentfy a partcular user. 3.2. One-class or two class Prevous work consdered the problem as a mult-class supervsed tranng exercse. The dataset contans data for 50 users. For each user, a specfc class, the frst 5000 commands are treated as negatve examples, whle the data from the other 49 users are treated as postve examples. It s reasonable to assume the negatve examples, whch belong to the same user, were treated consstently, whle the postve examples used n tranng belong to another user. For the masquerade problem, t s probably mpossble and unreasonable to estmate how an attacker would behave. Thus, treatng sets of other users data as postve examples provdes a substantve bas (to those users behavor who probably was not behavng malcously). We next present the means of mplementng one-class tranng for Naïve Bayes classfer and for SVM, usng only data from a sngle user when tranng a classfer to profle a dstnct user. 3.3. Naïve Bayes Classfer The Naïve Bayes classfer [2] s a smple and effcent supervsed learnng algorthm, whch has been proved to be very effectve n text classfcaton, and many other applcatons. It s based on Bayes rule, p( u) P( d p ( u d) = p( d) whch calculates the probablty of a class gven an example. Appled to the masquerade problem, t calculates the lkelhood that a command block belongs to a masquerader (non-self), or some legtmate user. Dfferent commands c, whch are used as features here, are assumed ndependent from each other. Ths s the Naïve part of ths method. There are two common models used n Naïve Bayes Classfer, one s the mult-varate Bernoull model, and the other s the multnomal model []. In the multvarate Bernoull event model, a vector of bnary attrbutes s used to represent a document (n our case, a block of 00 commands), ndcatng whether the command occurs or doesn t occur n the document. The multnomal model uses the number of command occurrences to represent a document, whch s called bag-of-words approach, capturng the word frequency nformaton n documents. Accordng to McCallurn [] s result, mult-varate Bernoull model performs better for small vocabulary sze, and the multnomal model usually performs better at larger vocabulary sze. Because the vocabulary sze (the number of dstnct commands) of ths masquerade problem s 856, whch s a moderate n sze, we want to compare both of these models for ths problem.

Mult-varate Bernoull model Usng the mult-varate Bernoull Model, a command block d s represented as a bnary vector d = b ( d), b ( d),..., b ( )) ( 2 m d, wth b (d) set to f the command c occurs at least once n ths block. Here m s the total number of features,.e., the number of dstnct commands. Gven p( c, whch s the probablty estmated for command c for user u n the tranng data, we can compute p ( d of the test block d as: p ( d = m = ( b ( d) p( c + ( b ( d ))( p( c )) () where p( c s estmated wth a Laplacean pror: + N( c, u) p( c = (2) 2 + N( u) N (u) s the number of tranng examples for user u, whle N ( c, u) s the number of documents contanng the command c for user u. Multnomal model Usng the standard bag-of-words approach, each command block s represented by a feature vector d = n ( d), n ( d),..., n ( )) ( 2 m d, where (d) n s the number of tmes command c appears n the command block d. Smlarly, gven p( c, whch s the frequency count computed for command c for user u n the tranng data, we can compute p ( d of the test block d as: m n ( d ) p( d = ( p( c ) (3) = where p( c s derved from: p( c = m N ( u) = j= n d j ( j ) + α = (4) N ( u) n ( d ) + α * m Here α s used for smoothng, whch controls the senstvty to prevously unseen commands. (Ths mples there s a non-zero probablty any command may be ssued by any user.) We set t to 0.0 followng [9]. One-class Naïve Bayes Adaptng the above algorthm to one-class Naïve Bayes, whch uses only postve examples for tranng, s very smple. We only compute p( c for useru s self profle. For the non-self profle, we can assume each j command has equal probablty / m, whch s essentally random. Thus, gven a test d, we can compare p ( d self ) wth p ( d nonself ). The larger the rato of p ( d self ) to p ( d nonself ), the more lkely ths command block d s from the user u. Applyng the one-class Naïve Bayes algorthm to our specfc dataset s also qute smple. Snce each test document (a block of commands) has a fxed number of 00 commands, the probablty of non-self s the same for all tested blocks. Snce we do not have to compute the probablty of non-self; we may compare the probablty of beng self to a threshold n order to decde whether ths block s a masquerade block or not. Furthermore, we can easly adjust the threshold to control the false postve and detecton rate. 3.4. One-class support vector machne Support Vector Machnes (SVM) have been shown to be hghly effectve n text classfcaton as well [6], among other mportant learnng tasks. They are maxmal-margn classfers, rather than probablstc as s Naïve Bayes. In the two-class formulaton, the basc dea s to map feature vectors to a hgh dmensonal space and to compute a hyperplane that not only separates the tranng vectors from dfferent classes, but also maxmzes ths separaton by makng the margn as large as possble. Scholkopf et al. [3] proposed a method to adapt the SVM algorthm for one-class SVM, whch only use examples from one-class, nstead of multple classes, for tranng. The one-class SVM algorthm frst maps nput data nto a hgh dmensonal feature space va a kernel functon and treats the orgn as the only example from other classes. It then teratvely fnds the maxmal margn hyperplane that best separates the tranng data from the orgn. Consderng that our tranng data set x, x,..., 2 x X, Φ s the feature mappng X F to a hgh-dmensonal space, we can defne the kernel functon as: k( x, y) = ( Φ( x) Φ( y)) Usng kernel functons, the feature vectors need not be computed explctly, greatly mprovng computatonal effcency snce we can drectly compute the kernel values and operate on ther mages. Some common kernels are lnear, polynomal, and radal bass functon (rbf) kernels: Lnear Kernel: k( x, y) = ( x y) P-th order polynomal kernel: k ( x, y) = ( x y + ) k( x, y) = e x y 2 / 2 rbf kernel: Now, solvng the one-class SVM problem s equvalent to solvng the dual quadratc programmng (QP) problem: σ 2 p

mn α 2 j α α k ( x, x ) subject to 0 α, α =. v where α s a Lagrange multpler, whch can be thought of as a weght on example x, and ν s a parameter that controls the trade-off between maxmzng the number of data ponts contaned by the hyperplane and the dstance of the hyperplane from the orgn. After solvng forα, we can use a decson functon to classfy data. The decson functon s: f ( x) = sgn( α k( x, x) ρ) where the offset ρ can be recovered by ρ = α k x, ). ( j j j x In our work, we used the LIBSVM 2.4 [] avalable at http://www.cse.ntu.tw/~cjln/lbsvm for our experments. LIBSVM s an ntegrated tool for support vector classfcaton and regresson that mplemented Sholkopf s algorthm for one-class SVM. We used the default rbf kernel and the default values of the parameters for oneclass SVM. Another problem to consder for one-class SVM s how to represent the features. We used both a word count representaton and a bnary representaton, whch are equvalent to the multnomal model and mult-varate Bernoull model of Naïve Bayes algorthm, respectvely. The vectors are normalzed to length. j j nserted drty command blocks under a probablty dstrbuton from other users besdes the 50 users whose commands were traned on. For our one-class algorthm, the test data remans the same, but we only use one user s tranng data (the frst 5,000 commands) as negatve examples for tranng; there are no postve examples used at all to compute the classfer. Frst we compare the Naïve Bayes classfer usng the mult-varate Bernoull model and the multnomal model. Then we compare the one-class tranng algorthms aganst the mult-class tranng algorthms. All the classfcaton algorthms we tested do no updatng durng the test of any parameters because we want to compare the pure performance of each of them. Maxon [9] used an emprcally pre-fxed threshold to update the classfer durng the experment, whch s not an ntrnsc characterstc of the classfcaton algorthm. 4. Evaluaton We conducted two sets of experments. The frst experment repeats the expermental methodology of [4]. We show that the performance of one-class tranng s almost the same as the performance of mult-class tranng. Ths s a sgnfcant fndng on ts own. The second experment ams to compare the performance of the two one-class tranng algorthms when appled to multple users. Followng [9], we wll call the frst the SEA experment, whch s from the authors names n [4], Schonlau et al. The second experment s called v49, because we traned usng only one user s data and tested on all other 49 user s data. 4.. SEA Experment Recall that n ths experment, the frst 5,000 commands of a user serve as postve examples, and the frst 5,000 commands of all the other 49 users serve as negatve examples. The resultant classfer s tested on the rest of the 0,000 commands of the user. These have! "! Fgure dsplays plots comparng the mult-varate Bernoull model and the multnomal model of Naïve Bayes classfer. When usng mult-class tranng, the multnomal model s obvously better than the Bernoull model. But the dfference s not so obvous n one-class tranng, especally when the false postve rate s low. We

thus compare both models n the followng v49 experment. To compare the performance of the one-class tranng algorthms aganst the mult-class tranng algorthm on the same test data, we plot the ROC curves as dsplayed n Fgure. For the mult-class tranng algorthm, we only use the multnomal model Naïve Bayes algorthm as the baselne for comparson, whch s better than Bernoull model and has been proved to the best among the varety of methods as descrbed n [9]. For the one-class SVM, we compare both the bnary and word count representatons. From Fgure 2, we can see that only oneclass SVM usng the word count representaton s a lttle bt worse than the other three methods. One-class SVM usng the bnary representaton and one-class Naïve Bayes acheved almost the same performance as the two class Naïve Bayes algorthm. We also compare n Fgure 3 the performance of all the prevous algorthms from Table to one-class SVM algorthm usng bnary features, whch s best one among the one-class tranng algorthms. One-class SVM-bnary s better than most of the prevous algorthms except the two-class multnomal Naïve Bayes algorthm wth updatng. Ths experment confrmed our conjecture that for masquerade detecton, one-class tranng s as effectve as two class tranng. # $ % & ' ( $ ) * 4.2. v49 Experment As we have ponted out, snce the dataset used had randomly nserted masquerade blocks n each user s test commands (0,000 commands followng the frst 5,000), each user has a dfferent number of drty blocks and the orgns of these drty blocks also dffer. So the result of the SEA experment may not llustrate the real performance of a classfcaton algorthm. (There are too many unfxed parameters.) To better evaluate the performance of a classfcaton algorthm, we can treat these 50 users as our selected sample of common users. If we can prove algorthm A s better than algorthm B for most of the 50 users, we can nfer A s better than B n a general sense. To meet ths requrement, we follow the v49 experment, but for a dfferent purpose. We use one user s frst 5,000 commands as negatve tranng data to compute a classfer wthout any postve tranng data. For test data, we use the non-masquerade blocks from the 0,000 addtonal commands of the same user as negatve test data, and the other 49 users frst 5,000 commands as postve test data. Ths data s also organzed n blocks of 00 commands. As we mentoned before, the same algorthm mght perform qute dfferently for dfferent users. Fgure 4 llustrates the dfference. Fgure 4 shows the ROC curve for user 2, 20 and 40 usng one-class SVM wth the bnary feature representaton. Such a dfference occurs no matter whch algorthm has been used; the dfference s determned by the characterstc of each user.

+, $ ) * To compare the dfferent methods for multple users, we compute the ROC score for each user. In general, a ROC score s the fracton of the area under the ROC curve, the larger the better. A ROC score of means perfect detecton wthout any false postves. Fgure 5 below shows the ROC scores for users 20 and 40 usng the oneclass SVM-bnary algorthm. -, % ' #. +., /, 0 " ' For the masquerade problem, we are more nterested n the regon of the ROC curve wth a low false postve rate; otherwse, the annoyance level of false alarms would render the detector useless n practcal use. Therefore, we restrct the ROC scores to the curves wth false postve lower than P, whch s called the ROC-P score. For example, f we want to restrct the false postves to be lower than 5% of all command blocks, we can compute ROC-5. Smlar to the general ROC score, the ROC-P score s the fracton of the area under the ROC curve where the false postve rate s lower than P%. Fgure 7, dsplays an example of ROC-0, based on the ROCcurves of users 20 and 40. Only part of the ROC curve s drawn here to hghlght the plots. Fgure 6 llustrates the performance of several oneclass tranng algorthms as measured by ROC scores. The fgure ncludes results for all 50 users. From Fgure 6, we can see that one-class SVM usng word-count features s the worst among the four algorthms. At the hgh ROC score regon, wth a ROC score hgher than 0.8 (whch s what we prefer) one-class SVM usng bnary features performs best among all. There s no bg dfference between Naïve Byaes usng the multnomal model or the mult-varate Bernoull model.,. #. +.!" 2 3. 4 Snce we can see that one-class SVM usng the bnary feature s generally better than one-class SVM usng the word count feature, as depcted n Fgure 6; here we only compare the one-class SVM usng the bnary representaton wth the multnomal model Naïve Bayes and Bernoull model Naïve Bayes n the followng ROC-P comparson. Fgures 8 plots the comparson for ROC-5 and ROC-, whch means false postves are below 5% and %, respectvely. From these two plots, we can

determne that one-class SVM usng the bnary feature s almost always better than the other two one-class Naïve Bayes methods. 5, $ ) * " *! - 4 4! " ' -! To compare the performance of dfferent algorthms on an ndvdual user bass, we compare the ROC-P score user by user. Fgure 9 shows a user-by-user comparson of one-class SVM usng the bnary feature representaton and one-class Naïve Bayes usng the multnomal model, when the false postve rate s lower than %. Agan we can see, for most of the 50 users, one-class SVM wth bnary features s better than one-class Naïve Bayes usng the multnomal model. However, there are stll some users whose data exhbt better performance usng the one-class Naïve Bayes. Ths suggests that we can choose the best algorthm to use for an ndvdual user to mprove the whole system s performance. 6, 7 $ ) *! 4 5. Dscusson From our work we can see that one-class SVM usng bnary features performs better than one-class Naïve Bayes and one-class SVM usng word count features. Even so, masquerade detecton s a very hard problem, and all three algorthms dd not acheve very hgh accuracy wth near to zero false postve rates for every user. Ths s partly caused by the nherent nature of the data avalable and the dffculty of ths problem. We would lke to reapply these methods usng a rcher set of data as descrbed by Maxon [0], ncorporatng command arguments. We also beleve that temporal data assocated wth each user s sequental commands wll provde consderable value as well to mprove performance. Another problem to consder for the practcal utlty of these approaches s reslency to drect attack;.e. how could we protect the models that were computed from, for example, a mmcry attack by the masquerader? In the experments performed, we dd not evaluate feature selecton. We tested one-class SVM usng 00,

200, and 300 of the most frequently used UNIX commands. Each of the results s worse than had we used all of the avalable UNIX commands, whose total number s around 870. We also conjectured that 2-gram features (adjacent pars of commands) would perform better than ndvdual commands (-grams) as a feature. However, we found that the results were worse when we used all of the 2-grams. In further work, we would evaluate some feature selecton methods to mprove performance. For example, we beleve a selecton of some features usng both -gram and 2-grams may mprove the qualty of the user profles, and thus the accuracy of the detector. A system to detect masqueraders as descrbed n ths paper should not be vewed as a sngle detector, but rather as evdence to be correlated wth other sensors and other detectors. Thus, although the performance of the detectors descrbed heren and n pror work seemngly are not accurate enough, when one wshes to lmt false postves, t may be wse to relax the threshold to generate hgher true postve rates. If the output of the detector were combned wth other evdence (for example, fle system access anomaly detecton, or other sensors), t may be possble to rase substantally the bar n protectng hosts from malcous abuse. 6. Concluson In ths paper, to solve the masquerade detecton problem, we use one-class tranng algorthms whch only tran on a user s clean data. It has been demonstrated that one-class tranng algorthms can acheve smlar performance as multple class methods, but requre much less effort n data collecton and centralzed management. Besdes masquerade detecton, we beleve one-class tranng s also good for some other ntruson detecton problems where sample ntruson data are hard to get or too varable to cluster. We also gve a detaled comparson of the performance of dfferent one-class algorthms as appled to multple users. The results show that for most users one-class SVM usng the bnary feature representaton s better than oneclass Naïve Bayes and one-class SVM usng the word count representaton, especally when we want to restrct the false postve rate to a relatvely low level. In our future work, we plan to nclude command arguments, not only truncated commands, as features to mprove the accuracy of masquerade detecton. As the number of features ncrease, we also plan to do feature selecton to fnd the most nformatve features and to dscard those features that have no value for the target task. Acknowledgments Ths work was partally supported by DARPA contract No. F30602-02-2-0209. We also thank Prof. Tony Jebara for helpful suggestons and valuable comments. Reference: [] Chh-Chung Chang and Chh-Jen Ln, LIBSVM: a lbrary for support vector machnes, 200. Software avalable at http://www.cse.ntu.edu.tw/~cjln/lbsvm. [2] Eleazar Eskn, Wenke Lee and Salvatore J. Stolfo, Modelng System Calls for Intruson Detecton wth Dynamc Wndow Szes, Proceedngs of DISCEX II, June, 200. [3] Stephane Forrest, Steven A. Hofmeyr, Anl Somayaj, and Thomas A. Longstaff, A sense of self for UNIX processes, In Proceedngs of IEEE Symposum on Securty and Prvacy, 996. [4] Anup K. Ghosh and Aaron Schwartzbard, A study n usng neural networks for anomaly and msuse detecton, In Proceedngs of USENIX Securty Symposum 999 [5] M. Grbskov and N. L. Robnson, Use of recever operatng characterstc (ROC) analyss to evaluate sequence matchng, Computers and Chemstry, 20():25 33, 996. [6] Thorsten Joachms, Text categorzaton wth support vector machnes: Learnng wth many relevant features, In Proc. of the European Conference on Machne Learnng (ECML), pp. 37-42, 998. [7] W. Lee and S. J. Stolfo, Data mnng approaches for ntruson detecton, In Proceedngs of USENIX Securty Symposum 998 [8] T. Lunt, A.Tamaru, F. Glham, R. Jagannathan, C. Jala, H.S. Javtz, A. Valdes, and P.G. Neumann, A Real-Tme Intruson Detecton Expert System," SRI CSL Tecncal Report, SRI-CSL-90-05, June 990. [9] Maxon, Roy A. and Townsend, Tahla N, Masquerade Detecton Usng Truncated Command Lnes, Internatonal Conference on Dependable Systems and Networks (DSN- 02), pp. 29-228, Washngton, D.C. 23-26 June 2002. [0] Maxon, Roy A. Masquerade Detecton Usng Enrched Command Lnes, In Internatonal Conference on Dependable Systems & Networks (DSN-03), pp. 5-4, San Francsco, Calforna, 22-25 June 2003. IEEE Computer Socety Press, Los Alamtos, Calforna, 2003. [] A. McCallurn, K. Ngam, A Comparson of Event Models for Nave Bayes Text Classfcaton, AAAI-98 Workshop on Learnng for Text Categorzaton, 998 [2] T. M. Mtchell, Bayesan Learnng, Chapter 6 n Machne Learnng, pp. 54-200. McGraw-Hll, 997. [3] B. Scholkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Wllamson, Estmatng the support of a hghdmensonal dstrbuton. Technque report, Mcrosoft Research, MSR-TR-99-87, 999.

[4] M. Schonlau, W. DuMouchel, W. -H. Ju, A. F. Karr, M. Theus, and Y. Vard, Computer ntruson: Detectng masquerades, Statstcal Scence, 6():58-74, February 200. [5] Matthew G. Schultz, Eleazar Eskn, and Salvatore J. Stolfo, Malcous Emal Flter - A UNIX Mal Flter that Detects Malcous Wndows Executables, Proceedngs of USENIX Annual Techncal Conference - FREENIX Track, Boston, MA: June 200. [6] S. Y. Sedelow, The Computer n the Humantes and Fne Arts, ACM Computng Surveys 2(2): 89-0 (970) [7] Salvatore J. Stolfo, Shlomo Hershkop, Ke Wang, Olver Nmeskern, and Cha-We Hu, Behavor Proflng of Emal, st NSF/NIJ Symposum on Intellgence & Securty Informatcs (ISI 2003), June 2-3, 2003, Tucson, Arzona. [8] O. De Vel, A. Anderson, M. Corney, and G. Mohay, Mnng Emal Content for Author Identfcaton Forenscs, SIGMOD: Specal Secton on Data Mnng for Intruson Detecton and Threat Analyss, December 200. [9] Nong Ye, A Markov Chan Model of Temporal Behavor for Anomaly Detecton, Proceedngs of the IEEE Systems, Man, and Cybernetcs Informaton Assurance and Securty Workshop, 2000.