Detecting Outliers in High-Dimensional Datasets with Mixed Attributes

Similar documents
The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines

Abstract. Key Words: Image Filters, Fuzzy Filters, Order Statistics Filters, Rank Ordered Mean Filters, Channel Noise. 1.

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

A Novel Bit Level Time Series Representation with Implication of Similarity Search and Clustering

A Scalable and Efficient Outlier Detection Strategy for Categorical Data

A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR

Pipelined Multipliers for Reconfigurable Hardware

What are Cycle-Stealing Systems Good For? A Detailed Performance Model Case Study

NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION. Ken Sauer and Charles A. Bouman

Parallelizing Frequent Web Access Pattern Mining with Partial Enumeration for High Speedup

Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal Microsoft Research One Microsoft Way Redmond, WA, USA +1-(425)

Cluster-Based Cumulative Ensembles

Video Data and Sonar Data: Real World Data Fusion Example

Extracting Partition Statistics from Semistructured Data

Hello neighbor: accurate object retrieval with k-reciprocal nearest neighbors

Volume 3, Issue 9, September 2013 International Journal of Advanced Research in Computer Science and Software Engineering

A Load-Balanced Clustering Protocol for Hierarchical Wireless Sensor Networks

Self-Adaptive Parent to Mean-Centric Recombination for Real-Parameter Optimization

Adapting K-Medians to Generate Normalized Cluster Centers

Exploiting Enriched Contextual Information for Mobile App Classification

Smooth Trajectory Planning Along Bezier Curve for Mobile Robots with Velocity Constraints

A New RBFNDDA-KNN Network and Its Application to Medical Pattern Classification

the data. Structured Principal Component Analysis (SPCA)

Performance of Histogram-Based Skin Colour Segmentation for Arms Detection in Human Motion Analysis Application

An Optimized Approach on Applying Genetic Algorithm to Adaptive Cluster Validity Index

Approximate logic synthesis for error tolerant applications

Model Based Approach for Content Based Image Retrievals Based on Fusion and Relevancy Methodology

Time delay estimation of reverberant meeting speech: on the use of multichannel linear prediction

On - Line Path Delay Fault Testing of Omega MINs M. Bellos 1, E. Kalligeros 1, D. Nikolos 1,2 & H. T. Vergos 1,2

Accommodations of QoS DiffServ Over IP and MPLS Networks

Multiple-Criteria Decision Analysis: A Novel Rank Aggregation Method

arxiv: v1 [cs.db] 13 Sep 2017

Capturing Large Intra-class Variations of Biometric Data by Template Co-updating

Exploring the Commonality in Feature Modeling Notations

Calculation of typical running time of a branch-and-bound algorithm for the vertex-cover problem

Partial Character Decoding for Improved Regular Expression Matching in FPGAs

Gray Codes for Reflectable Languages

DETECTION METHOD FOR NETWORK PENETRATING BEHAVIOR BASED ON COMMUNICATION FINGERPRINT

Using Augmented Measurements to Improve the Convergence of ICP

We don t need no generation - a practical approach to sliding window RLNC

One Against One or One Against All : Which One is Better for Handwriting Recognition with SVMs?

Detection and Recognition of Non-Occluded Objects using Signature Map

A Novel Validity Index for Determination of the Optimal Number of Clusters

Algorithms for External Memory Lecture 6 Graph Algorithms - Weighted List Ranking

HEXA: Compact Data Structures for Faster Packet Processing

CleanUp: Improving Quadrilateral Finite Element Meshes

A Coarse-to-Fine Classification Scheme for Facial Expression Recognition

An Edge-based Clustering Algorithm to Detect Social Circles in Ego Networks

A scheme for racquet sports video analysis with the combination of audio-visual information

A {k, n}-secret Sharing Scheme for Color Images

A Partial Sorting Algorithm in Multi-Hop Wireless Sensor Networks

Flow Demands Oriented Node Placement in Multi-Hop Wireless Networks

Boosted Random Forest

KERNEL SPARSE REPRESENTATION WITH LOCAL PATTERNS FOR FACE RECOGNITION

A Multi-Head Clustering Algorithm in Vehicular Ad Hoc Networks

Contents Contents...I List of Tables...VIII List of Figures...IX 1. Introduction Information Retrieval... 8

Automated System for the Study of Environmental Loads Applied to Production Risers Dustin M. Brandt 1, Celso K. Morooka 2, Ivan R.

Facility Location: Distributed Approximation

Creating Adaptive Web Sites Through Usage-Based Clustering of URLs

COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY

New Fuzzy Object Segmentation Algorithm for Video Sequences *

C 2 C 3 C 1 M S. f e. e f (3,0) (0,1) (2,0) (-1,1) (1,0) (-1,0) (1,-1) (0,-1) (-2,0) (-3,0) (0,-2)

An Efficient and Scalable Approach to CNN Queries in a Road Network

RAC 2 E: Novel Rendezvous Protocol for Asynchronous Cognitive Radios in Cooperative Environments

13.1 Numerical Evaluation of Integrals Over One Dimension

Methods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems

Relevance for Computer Vision

A Comprehensive Review of Overlapping Community Detection Algorithms for Social Networks

Performance Benchmarks for an Interactive Video-on-Demand System

Semi-Supervised Affinity Propagation with Instance-Level Constraints

Plot-to-track correlation in A-SMGCS using the target images from a Surface Movement Radar

Torpedo Trajectory Visual Simulation Based on Nonlinear Backstepping Control

Cross-layer Resource Allocation on Broadband Power Line Based on Novel QoS-priority Scheduling Function in MAC Layer

Multi-Piece Mold Design Based on Linear Mixed-Integer Program Toward Guaranteed Optimality

Trajectory Tracking Control for A Wheeled Mobile Robot Using Fuzzy Logic Controller

Incremental Mining of Partial Periodic Patterns in Time-series Databases

An Interactive-Voting Based Map Matching Algorithm

Stable Road Lane Model Based on Clothoids

Improved flooding of broadcast messages using extended multipoint relaying

Bayesian Belief Networks for Data Mining. Harald Steck and Volker Tresp. Siemens AG, Corporate Technology. Information and Communications

Evolutionary Feature Synthesis for Image Databases

Batch Auditing for Multiclient Data in Multicloud Storage

Deep Rule-Based Classifier with Human-level Performance and Characteristics

Displacement-based Route Update Strategies for Proactive Routing Protocols in Mobile Ad Hoc Networks

1. Introduction. 2. The Probable Stope Algorithm

FUZZY WATERSHED FOR IMAGE SEGMENTATION

Fast Distribution of Replicated Content to Multi- Homed Clients Mohammad Malli Arab Open University, Beirut, Lebanon

Intra- and Inter-Stream Synchronisation for Stored Multimedia Streams

Unsupervised Stereoscopic Video Object Segmentation Based on Active Contours and Retrainable Neural Networks

3-D IMAGE MODELS AND COMPRESSION - SYNTHETIC HYBRID OR NATURAL FIT?

Lazy Updates: An Efficient Technique to Continuously Monitoring Reverse knn

Cluster-based Cooperative Communication with Network Coding in Wireless Networks

Multi-Channel Wireless Networks: Capacity and Protocols

Gradient based progressive probabilistic Hough transform

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications

Measurement of the stereoscopic rangefinder beam angular velocity using the digital image processing method

SVC-DASH-M: Scalable Video Coding Dynamic Adaptive Streaming Over HTTP Using Multiple Connections

Wide-baseline Multiple-view Correspondences

Fast Rigid Motion Segmentation via Incrementally-Complex Local Models

Performance Improvement of TCP on Wireless Cellular Networks by Adaptive FEC Combined with Explicit Loss Notification

Transcription:

Deteting Outliers in High-Dimensional Datasets with Mixed Attributes A. Koufakou, M. Georgiopoulos, and G.C. Anagnostopoulos 2 Shool of EECS, University of Central Florida, Orlando, FL, USA 2 Dept. of ECE, Florida Institute of Tehnology, Melbourne, FL, USA Abstrat - Outlier Detetion has attrated substantial attention in many appliations and researh areas. Examples inlude detetion of network intrusions or redit ard fraud. Many of the existing approahes are based on pair-wise distanes among all points in the dataset. These approahes annot easily ext to urrent datasets that usually ontain a mix of ategorial and ontinuous attributes, and may be sattered over large geographial areas. In addition, urrent datasets usually have a large number of dimensions. These datasets t to be sparse, and traditional onepts suh as Eulidean distane or nearest neighbor beome unsuitable. We propose ODMAD, a fast outlier detetion strategy inted for datasets ontaining mixed attributes. ODMAD takes into onsideration the sparseness of the dataset, and is experimentally shown to be highly salable with the number of points and number of attributes in the dataset. Keywords: Outlier Detetion, Mixed Attribute Datasets, High Dimensional Data, Large Datasets. Introdution Deteting outliers in data is a researh field with many appliations, suh as redit ard fraud detetion [], or disovering riminal ativities in eletroni ommere. Outlier detetion approahes fous on deteting patterns that our infreuently in the dataset, versus traditional data mining strategies that attempt to find regular or freuent patterns. One of the most widely aepted definitions of an outlier pattern is provided by Hawkins [2]: An outlier is an observation that deviates so muh from other observations as to arouse suspiion that it was generated by a different mehanism. Most of the existing researh efforts in outlier detetion have foused on datasets with a speifi attribute type, and assume that attributes are only numerial and/or ordinal, or only ategorial. In the ase of data with ategorial attributes, tehniues whih assume numerial data need to first map the ategorial values to numerial values, a task whih is not a straightforward proess (e.g., the mapping of a marital status attribute (married or single) to a numerial attribute). In the ase of ontinuous attributes, algorithms designed for ategorial data might use disretization tehniues to map intervals of ontinuous spae into disrete values, whih an lead to loss of information. A seond issue is that many appliations for mining outliers reuire the mining of very large datasets (e.g. terabyte-sale data). This leads to the need for outlier detetion algorithms whih must sale well with the size and dimensionality of the dataset. Data may also be sattered aross various geographial areas, whih implies that transferring data to a entral loation and then deteting outliers is impratial, due to the size of the data, as well as data ownership and ontrol issues. Thus, the algorithms designed to detet outliers must minimize the number of data sans, as well as the need for exessive ommuniation and reuired synhronization. A third issue is the high dimensionality of urrently available data. Due to the large number of dimensions, the dataset beomes sparse, and in this type of setting, traditional onepts suh as Eulidean distane between points, and nearest neighbor, beome irrelevant [3]. Employing similarity measures that an handle sparse data beomes imperative. Also, inspeting several, smaller views of the data an help unover outliers, whih would otherwise be masked by other outliers if one were to look at the entire dataset at one. In this paper, we ext our work in [4] and we propose an outlier detetion approah for datasets that ontain both ategorial and ontinuous attributes. Our method, Outlier Detetion for Mixed Attribute Datasets (ODMAD), uses an anomaly sore based on the ategorial values of eah data point. ODMAD then uses this sore to find similarities among the points in the sparse ontinuous spae. ODMAD is fast, effiiently handles sparse data, relies on minimal data sans, and ls itself to large and geographially distributed data. The organization of this paper is as follows: Setion 2 ontains an overview of previous researh in outlier detetion. In Setion 3, we present our outlier detetion approah, ODMAD. Setion 4 inludes our experimental results, followed by our onlusions in Setion 5.

2 Previous Work The existing outlier detetion work an be ategorized as follows. Statistial-model based methods assume that a speifi model desribes the distribution of the data [5], whih has the problem of obtaining the suitable model for eah partiular dataset and appliation [6]. Distane-based approahes (e.g. [7]) essentially ompute distanes among data points, thus beome uikly impratial for large datasets (e.g., a nearest neighbor method has uadrati omplexity with respet to the number of dataset points). Bay and Shwabaher [8] propose a distane-based method based on randomization and pruning and laim its omplexity is lose to linear in pratie. Distane-based methods reuire data to be in the same loation or large amounts of data to be transferred from different loations, whih makes them impratial for distributed data. Clustering tehniues an also be employed to first luster the data, so that points that do not belong in the formed lusters are designated as outliers. However, these methods are foused on optimizing lustering rather than finding outliers [7]. Density-based methods estimate the density distribution of the data and identify outliers as those lying in relatively low-density regions (e.g. [9]). Although these methods are able to detet outliers not disovered by the distane-based methods, they beome hallenging for sparse high-dimensional data [0]. Other outlier detetion efforts rely on Support Vetor methods [], Repliator Neural Networks [2], or using a relative degree of density with respet only to a few fixed referene points [3]. Most of the aforementioned tehniues are geared towards numerial data and thus are more appropriate for numerial datasets or ordinal data that an be easily mapped to numerial values [4]. Another limitation of previous methods is the lak of salability with respet to number of points and/or dimensionality of the dataset. Outlier detetion tehniues for ategorial datasets have reently appeared in the literature (e.g. [5]). In [4], we experimented with a number of representative outlier detetion approahes for ategorial data, and proposed AVF (Attribute Value Freueny), a simple, fast, and salable method for ategorial sets. Otey et al. [6] presented a distributed and dynami outlier detetion method for mixed attribute datasets that has linear runtime with respet to the number of data points; however their runtime is exponential in the number of ategorial attributes and uadrati in the number of numerial attributes. Regarding sparseness in highdimensional data, Ertoz et al. [3] use the osine funtion for doument lustering. They onstrut a shared nearest neighbor (SNN) graph, and then luster together high-dimensional points based on their shared nearest neighbors. In this paper, we ext our previous work in [4] and propose Outlier Detetion for Mixed Attribute Datasets (ODMAD), an outlier detetion approah for sparse data with both ategorial and ontinuous attributes. ODMAD exhibits very good auray and performane, it is highly salable with the number of points and dimensionality of the dataset, and an be easily applied to distributed data. We ompare ODMAD with the tehniue in [6] whih is the existing outlier detetion approah for distributed, mixed attribute datasets. 3 ODMAD Algorithm The outlier detetion proposed in this paper, ODMAD, detets outliers based on the assumption that outliers are points with highly irregular or infreuent values. In [4], we showed how this idea ould be used to effetively detet outliers in ategorial data. ODMAD exts the work in [4] to explore outliers in the ategorial and in the ontinuous spae of attributes. In this mixed attribute spae, an outlier an have irregular ategorial values only (type a), or irregular ontinuous values only (type b), or both (type ). The algorithmi steps of ODMAD are below: In the first step, we inspet the ategorial spae in order to detet data points with irregular ategorial values. This enables us to detet outliers of type a and type. In the seond step, we set aside the points found as irregular from the first step, and fous on the remaining points, in an attempt to detet the rest of the outliers (type b). Based on the ategorial values of the remaining points, we onentrate on subsets extrated from the data, and work only on these subsets, one at a time. These subsets are onsidered so that we an identify outliers that would have otherwise been missed (masked) by more irregular outliers. To illustrate our point, onsider the senario in Figure. Outlier point O 2 is irregular with respet to the rest of the data points, while the seond outlier, O, is loser to the normal points. In this ase, outlier point O 2 masks the other outlier point, O. One solution to this problem ould be to seuentially remove outliers. This implies several data sans, whih is impratial for large or distributed data. In Setion 3.3, we explain in more detail how we address this issue by onsidering subsets of the data. Figure : Masking Effet - Outlier O 2 is more irregular than normal points and outlier O, therefore O 2 will likely mask O.

3. Categorial Sore As shown in [4], the ideal outlier in a ategorial dataset is one for whih eah and every value of its values is extremely irregular (or infreuent). The infreuent-ness of an attribute value an be measured by omputing the number of times this value is assumed by the orresponding attribute in the dataset. In [4] we assigned a sore to eah data point in the dataset that reflets the freueny with whih eah attribute value of the point ours. In this paper, we ext this notion of outlierness to over the likely senario where none of the single values in an outlier point are infreuent, but the oourrene of two or more of its attribute values is infreuent. We onsider a dataset D with n data points, x i, i =..n. If eah point x i has m ategorial attributes, we write x i = [ x i,, x il,, x im ], where x il is the value of the l-th attribute of x i. Our anomaly sore for eah point makes use of the idea of an itemset (or set) from the freuent itemset mining literature [6]. Let I be the set of all possible ombinations of attributes and their values in dataset D. Let S be a set of all sets d suh that an attribute ours only one in eah set d: S = {d : d power set (I) l, k d, l k} where l and k represent attributes whose values appear in set d. We also define the length of d, d, as the number of attribute values in d, and the freueny or support of set d as f(d), whih is the number of points x i in dataset D whih ontain set d. Following the reasoning stated earlier, a point is likely to be an outlier if it ontains single values or sets of values that are infreuent. We say that a value or a set of values is infreuent if it appears less than minsup times in our data, where minsup is a user threshold. Therefore, a good indiator to deide if x i is an outlier in the ategorial attribute spae is the sore value, Sore, defined below: Sore ( MAXLEN x i ) = () f ( d) d d = d x i f ( d ) minsup Essentially, we assign an anomaly sore to eah data point that deps on the infreuent subsets ontained in this point. As shown in [6], we obtain a good outlier detetion auray by only onsidering sets of length up to a user-entered MAXLEN. For example, let point x i = [a b ], and MAXLEN = 3, the possible subsets of x i are: a, b,, ab, a, b, and ab. If subset d of x i is infreuent, i.e. f(d) minsup, we inrease the sore of x i by the inverse of f(d) times the length of d. In our example, if f(ab) = 3 and minsup = 5, ab is an infreuent subset of x i, and Sore will inrease by /(3 2) = /6. A higher sore implies that it is more likely that the point is an outlier. If a point does not ontain any infreuent subsets its sore will be zero. Sore is inversely proportional to the freueny, as well as to the length of eah set d that belongs to x i. Therefore, a point that has very infreuent single values will get a very high sore; a point with moderate infreuent single values will get a moderately high sore; and a point whose single values are all freuent and has a few infreuent subsets will get a moderately low sore. We note that Sore is similar to the one in [6]; however the latter does not make any distintion between sets of different freueny. We use the freueny of the sets to further distinguish between points that ontain the same number of infreuent values. The benefit of our sore beomes pronouned with larger datasets: for example, onsider a dataset with a million data points and minsup of 0%. Also assume two ategorial values: a, that appears only one in the dataset, and b, that appears in the dataset slightly less than a hundred thousand times. Using our sore, a data point ontaining value a (very infreuent) will have a muh higher sore than a point with value b. Using the sore by [6] the two values would add the same to the sore. Therefore, our sore better reflets the amount of irregularity in the data. 3.2 Continuous Sore Many existing outlier detetion methods are based on distanes between points in the entire dataset. In addition to the fat that this an be ineffiient, espeially for large or distributed data, it is very likely that in doing so, the algorithm might miss points whih are not globally obvious outliers, but easier to spot if we fous on a subset of our dataset. Furthermore, the notion of a nearest neighbor does not hold as well in high dimensional spaes beause the distane between any two data points beomes almost the same [7]. In our ase of mixed attribute data, it is reasonable to believe that data points that share a ategorial value should also share similar ontinuous values. Therefore, we an restrit our searh spae by fousing on points that share a ategorial value, and then rank these points based on similarity to eah other. One issue that arises is how to identify similarities between points in high-dimensional data. The most prevalent similarity or distane metri is the Eulidean distane, or the L 2 -norm. Even though the Eulidean distane is valuable in relatively small dimensionalities, its usefulness dereases as the dimensionality grows. Let us onsider the four points below, taken from the KDDCup 999 dataset (desribed in more detail in 4.): the first two points are normal and the seond two points are outliers (we removed the olumns that had idential values for all four points). Using Eulidean distane we find orretly that point is losest to point 2 and vie versa, but for points 3 and 4 we find that eah is losest in Eulidean distane to point, i.e. the two outliers are more similar to a normal point than to eah other.

0 0 0.002 0.002 0 0 0 0 0 0 0.004 0.004 0 0 0 0 0 2 08.2E-6 8.5E-6 0.02 0.02 0 0 0 0 0.9 0.2 0.2 0.9 0.9 0.0 0.04 0 0 0 0 3 0 0 0.002 0.002 0 0 0 0 0.004 0 0.5 0 0 4 0 0 0.002 0.002 0 0 0 0 0.004 0 0.99 0 0 This is mainly beause the Eulidean distane assigns eual importane to attributes with zero values as to attributes with non-zero values. In higher dimensionalities, the presene of an attribute is typially a lot more important than the absene of an attribute [3], as the data points in high dimensionalities are often sparse vetors. Cosine similarity is a ommonly used similarity metri for lustering in very highdimensional datasets, e.g. used for doument lustering in [3]. The osine similarity between two vetors is eual to the dot produt of the two vetors divided by the individual vetor norms. Assuming non-negative values, minimum osine similarity is 0 (non-similar vetors) and maximum is (idential vetors). In our example with the four points above, the osine funtion assigns highest similarity between points and 2, and between points 3 and 4, so it orretly identifies similarity between normal points (points and 2) and between outlier points (points 3 and 4). In this paper, we used the osine funtion to define similarities in the ontinuous spae. Consider a data point x i ontaining m ategorial values and m ontinuous values. The ategorial and ontinuous parts of x i are denoted by x i and x i respetively. Let a be one of the ategorial values of x i whih ours with freueny f (a). We identify a subset of the data that inludes the ontinuous vetors orresponding to all points that share value a: { x : a x i, i =..n}, whih ontains f(a) vetors. The osine similarity between the mean vetor of this set, µ a, and x i is below: os( m x ij µ a j x i, µ a ) = (2) j= x µ ij a where x is the L 2 -norm of vetor x. Finally, we assign the following sore to eah x i, for all ategorial values a in Sore 2 ( x i ) = os( x i, µ a ) (3) a x i a x i i x i : whih is the summation of all osine similarities for all ategorial values a divided by the total number of values in the ategorial part x i. As minimum osine similarity is 0 and maximum is, the points with similarity lose to 0 are more likely to be outliers. Even though using the osine similarity helps us better assess distanes in a high-dimensional spae, its use will not vastly improve our outlier detetion auray in a large dataset with many different types of outliers. As we noted earlier in this setion, we fous on speifi subsets of the ontinuous spae so as to identify outliers in smaller settings. In the next setions, we address the issue of having more than one outlier in a subset, and we outline whih ategorial values we use for Sore 2 in E. (3). 3.3 Improving Auray Many methods (e.g. [7]) assume that outliers are the data points that are irregular in omparison to the rest of the dataset, and that they an be globally deteted. However, in many real datasets there are multiple outliers with different harateristis and their irregularity and detetion deps on the rest of the outliers against whih they are ompared. This way, there ould be outliers in our dataset that are masked by other, more irregular outliers (see Figure ). The solution that we propose is to further use the knowledge that we obtain from the ategorial sores to help alleviate this issue. Based on E. (), data points with highly infreuent ategorial values will have a very high Sore. We an exlude these points with high Sore from the omputation of our ontinuous sore in Euations (2)-(3). The exlusion of these outlier points an be done in the following manner: as we ompute the freuenies and means for eah ategorial value in our dataset, we identify highly infreuent ategorial values. Based on this information, we an update the means for the rest of the ategorial values that o-our with the highly infreuent values. The details on how we selet the values to exlude from the ontinuous subsets are given in the following setions. 3.4 Algorithm ODMAD onsists of two phases: the first phase alulates the neessary uantities for the algorithm (ategorial values, freuenies, sets, and means); the seond phase goes over eah point in the dataset and deides if eah point is an outlier or not, based on the sores desribed in Setion 3. and 3.2. The pseudoode for the two phases is given in Figures 2 and 3, respetively. As shown in Figure 2, for the sore alulation from E. (), we only gather the freuenies of ertain sets: the pruned andidates. Pruned andidates are those infreuent sets suh that all their subsets are freuent. These are the sets that are pruned at eah phase of a Freuent Itemset Mining algorithm suh as Apriori [6]. The reason behind this is that as mentioned in setion 3., we are interested in either single ategorial values that are infreuent, or infreuent sets ontaining single values that are freuent on their own. This makes ODMAD faster as shown in the following example.

Input: D dataset (n points, m and m attributes) minsup, MAXLEN Output: G - Pruned Candidates & their freuenies; A - Categorial values, means & freuenies foreah point x i ( i =..n) begin Add the ategorial values of x i, their freuenies, & their means to A; foreah len = 2..MAXLEN begin Create andidate sets and get freuent itemsets; Add pruned sets & their freuenies to G ; Figure 2: First Phase of our Outlier Detetion Approah ODMAD Input: D - dataset (n points, m and m attributes) G, A, minsup, MAXLEN, window, sore,, low_sup, upper_sup Output: outliers foreah point x i ( i =..n) begin foreah ategorial value a in x i begin If f(a) < minsup Sore (x i ) += / f (a); If low_sup < f(a) upper_sup i Sore 2 (x i ) += os ( x, µ ); foreah pruned set d in G found in x i begin Sore (x i ) += / ( f (d) d ); If Sore > sore average Sore in window or Sore 2 < sore average Sore 2 in window flag(x i ) = outlier; else normal, add Sore,2 to window sores; Figure 3: Seond Phase our Outlier Detetion Approah ODMAD Example. Assume we have two points, eah with three ategorial attributes: x = [a b ] and x 2 = [a b d]. If only single values a, are infreuent with freueny eual to 5, the sore is as follows: Sore ( x Sore ( x 2 ) = f ( a) + f ( ) = ) = f ( a) = / 5 = 0.2 a 2 / 5 = 0.4, Sine a and are infreuent, we do not hek any of their ombinations with other values beause they will also be infreuent. The sets we will not hek are: ab, a, ad, b, d. However, bd onsists of freuent values, b and d, so we hek its freueny. Assuming bd is infreuent, and f (bd) = 4, we inrease the sore of x 2 : Sore ( x 2 ) = 0.2+ ( f ( bd ) bd ) = 0.325. Note that at this point we stop inreasing the sore of both x and x 2, beause there are no more freuent sets. Therefore, in this senario, we only need to hek sets a,, and bd, instead of all possible sets of length to 3 ontained in x, x 2. As we identify ategorial values and sets, we also update the orresponding mean vetors as disussed in Setion 3.3. We use a user-entered freueny threshold, alled low_sup, to indiate what values we onsider highly infreuent ; then ategorial values with freueny low_sup are marked as highly infreuent. As we desribed in 3.3, we exlude points that ontain these highly infreuent values from the mean in E. (2) of all other ategorial values they o-our with. For example: assume point x ontains ategorial values a, f(a) low_sup, and value b, f(b) > low_sup. We exlude point x as follows: µ b n = x f ( b) i=, b x i i x In the seond phase in Figure 3, we first find all ategorial values in eah point and update Sore in E. () aordingly. We do the same for all the pruned sets ontained in the urrent point. Also, for eah ategorial value, we ompute Sore 2 using the updated mean we omputed in the first phase. The ontinuous vetors we use are those that orrespond to ategorial values with freueny in (low_sup, upper_sup]. If a point has a value with freueny less than low_sup, its Sore 2 will be, as it ontains a highly infreuent ategorial value. If a point has no values with freueny in (low_sup, upper_sup] it will have a Sore 2 of 0. By applying a lower bound to the freueny range we exlude values with very infreuent ategorial values, and by applying an upper bound we limit the amount of data points to whih we assign a sore in the ontinuous domain. Finally, as we san and sore the data points, we maintain a window of ategorial and ontinuous sores. We also employ a delta value for the detetion of abnormal sores: sore for the ategorial sores and sore for the ontinuous sores. As we go over the points in the seond phase, if a point has a sore larger (smaller in the ase of Sore 2 ) than the average sore of the previous window of points by the orresponding value, it is flagged as an outlier. Otherwise, the point is normal, and its non-zero sores are added to the window we maintain. If eah of the m ategorial attributes has an average of v distint values, the omplexity upper bound is below: T n MAXLEN j ( m v m + v ) j= m = O ( n m ( ) v + n v m ) m j

TABLE. DETECTION RATE ON THE KDDCUP 999 TRAINING DATASETS (0% Training Set and Entire Training Set) 0% training set Entire Training set Attak Type ODMAD Otey s ODMAD Otey s Bak 50 50 75 00 Buffer overflow 9 36 9 9 FTP Write 75 75 00 00 Guess password 00 00 00 00 Imap 00 00 50 50 IP Sweep 60 30 92 76 Land 83 00 00 62 Load Module 00 40 00 80 Multihop 00 75 75 75 Neptune 00 67 00 90 Nmap 00 00 88 63 Perl 00 67 00 67 Phf 0 75 0 0 Pod 75 50 87 53 Port Sweep 00 67 00 64 Root Kit 60 40 80 40 Satan 00 67 00 50 Smurf 57 43 88 63 Spy 00 00 00 00 Teardrop 00 44 00 Warez lient 2 4 9 36 Warez master 00 33 67 00 TABLE 2. EXECUTION TIME (SECONDS) FOR ODMAD VERSUS OTEY S APPROACH ON THE KDDCUP 999 TRAINING DATASETS ODMAD Otey s Approah 0% Training Set 3.7 67.4 Entire Training Set 38 604.9 Therefore ODMAD sales linearly with the number of data points, n, and with the number of ontinuous attributes, m, but seems to be saling exponentially with the number of ategorial attributes m. In pratie our algorithm runs faster beause we are using only the pruned andidates for the ategorial value-based sore. Otey s method in [6] has exponential time with respet to ategorial attributes, and uadrati with the number of ontinuous. Moreover, the method in [6] reuires a ovariane matrix for eah possible itemset in the dataset, while our method only reuires a vetor of length m (the mean vetor) for eah ategorial value. 4 Experiments 4. Experimental Setup We implemented our approah and Otey s approah [6] using C++. We ran our experiments on a workstation with a Pentium 4.99 GHz proessor and GB of RAM. We used the KDDCup 999 intrusion detetion dataset [8] from the UCI repository [9]. This dataset ontains reords that represent onnetions to a military omputer network and multiple intrusions and attaks by unauthorized users. The raw binary TCP data were proessed into features suh as onnetion duration, protool type, number of failed logins, et. The KDD dataset ontains a training set with 4,898,430 data points and a dataset with 0% training data points. There are 33 ontinuous attributes and 8 ategorial attributes. Due to the large number of attaks in these datasets, we preproess them suh that attak points are around 2% of the dataset, and we preserve the proportions of the various attaks. We follow the same onept as in [6]: sine network traffi pakets t to our in bursts for some intrusions, we look at bursts of pakets in the data set. Our proessed dataset based on the entire training set ontains 983,550 instanes with 0,769 attak instanes, and similarly for the 0% training dataset. We ompare our method with the one proposed in [6] as it is the only existing distributed outlier detetion approah for mixed attribute datasets that sales well with the number of data points. We evaluate both algorithms based on two measures: outlier detetion auray, or the outliers orretly identified by the approah as outliers, and the false positive rate, refleting the number of normal points erroneously identified as outliers. We also ompare the exeution time of the two algorithms using the same data. 4.2 Results The outlier detetion auray or detetion rate reflets how many points we detet orretly as outliers. In the KDDCup set, if we detet one point in a burst of pakets as an outlier we mark all points in a burst as outliers, as in [6]. The false positive rate is how many normal points we inorretly detet as outliers versus total number of normal points. In Table, we depit the detetion rate ahieved from ODMAD versus the approah in [6] (better rates are in bold). In Table 2 we show the exeution time in seonds for the two approahes. We used window = 40 for all experiments. We experimented with several values for the Otey s approah parameters, and in Table we present the best results (we used: δ = 35; minsup = 50% for the 0% set, and 0% for the entire training set; sore = 2). For our approah we used: upper_sup = minsup = 0%; low_sup = 2%; sore = 0, sore =.27 (0% set); and sore = 0, sore =.8 (entire training set). As an be seen in Table, ODMAD has eual or better detetion rate than Otey s approah for all but two of the attaks on the 0% training set, and all but three of the attaks for the entire training set. Moreover, the detetion rates in Table for the 0% dataset were ahieved with a false positive rate of 4.32% for ODMAD and 6.99% for Otey s, while the detetion rates for the entire training set were ahieved with a false positive rate of 7.09% for ODMAD, and 3.32% for Otey s. Exeution time for our approah is signifiantly faster as well, e.g. ODMAD proessed the KDDCup 0% dataset in 38 seonds while it took Otey s approah 00 minutes for the same task. We attribute this mainly to the fat that the method in [6] reates

and heks a ovariane for eah and every possible set of ategorial values, while ODMAD looks at single ategorial values and the mean of their ontinuous ounterparts. We observed similar auray and performane for the KDD Test set, and we also onduted experiments to explore how ODMAD s performane varies with respet to the parameters (results not shown here due to spae). Detetion and false positive rates derease as sore inreases, as it reflets the magnitude of differene between sores in the data. The larger sore is, the higher the sore differene needs to be for a point to be an outlier, and ODMAD will return less and less outliers. Also, the overall results indiate that good values for upper_sup are lose to the value for minsup, and for low_sup lose to -3% deping on the dataset size. 5 Conlusions We proposed Outlier Detetion for Mixed Attribute Datasets (ODMAD), a fast outlier detetion algorithm for mixed attribute data that responds well to sparse highdimensional data. ODMAD identifies outliers based on ategorial attributes first, and then fouses on subsets of data in the ontinuous spae by utilizing information from the ategorial attribute spae. We experimented with the KDDCup 999 dataset, a benhmark outlier detetion dataset, in order to demonstrate the performane of our approah. We found that ODMAD in most instanes exhibits higher outlier detetion rates (auray) and lower false positive rates, ompared to the existing work in the literature [6]. Furthermore, ODMAD relies on two data sans and is onsiderably faster than the ompeting work in [6]. Exting our work for distributed data is the fous of our future work. Aknowledgements: This work was supported in part by NSF grants 03460, 064708, 077674, 077680, 064720, 05254209, 0203446. 6 Referenes [] Bolton, R.J., Hand, D.J., Statistial fraud detetion: A review, Statistial Siene, 7, pp. 235 255, 2002. [2] Hawkins, D. Identifiation of Outliers. Chapman and Hall. 980. [3] Ertoz, L., Steinbah, M., Kumar, V. Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data, Pro. of ACM Int l Conf. on Knowledge Disovery and Data Mining (KDD), 2002. [4] Koufakou, A., Ortiz, E., Georgiopoulos, M., Anagnostopoulos, G., Reynolds, K., A Salable and Effiient Outlier Detetion Strategy for Categorial Data, Int l Conf. on Tools with Artifiial Intelligene ICTAI, Otober, 2007. [5] Barnett, V., Lewis, T. Outliers in Statistial Data. John Wiley, 994. [6] Otey, M.E., Ghoting, A., Parthasarathy, A., Fast Distributed Outlier Detetion in Mixed-Attribute Data Sets, Data Mining and Knowledge Disovery, 2006. [7] Knorr, E., Ng, R., and Tuakov, V., Distane-based outliers: Algorithms and appliations, Very Large Databases Journal, 2000. [8] Bay, S.D. Shwabaher, M., Mining distane-based outliers in near linear time with randomization and a simple pruning rule, Pro. of ACM SIGKDD Int l Conf. on Knowledge Disovery and Data Mining, 2003. [9] Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J., LOF: Identifying density-based loal outliers, Pro. of the ACM SIGMOD Int l Conf. on Management of Data, 2000. [0] Wei, L., Qian, W., Zhou, A., Jin, W., HOT: Hypergraph-based Outlier Test for Categorial Data, Pro. of 7th Paifi-Asia Conferene on Knowledge Disovery and Data Mining PAKDD, pp. 399-40, 2003. [] Tax, D., Duin, R., Support Vetor Data Desription, Mahine Learning, pp. 45 66, 2004. [2] Harkins, S., He, H., Williams, G., Baster, R., Outlier Detetion Using Repliator Neural Networks, Data Warehousing and Knowledge Disovery, pp. 70-80, 2002. [3] Pei, Y., Zaiane, O., Gao, Y., An Effiient Referenebased Approah to Outlier Detetion in Large Dataset, IEEE Int l Conferene on Data Mining, 2006. [4] Hodge, V., Austin, J., A Survey of Outlier Detetion Methodologies, Artifiial Intelligene Review, pp. 85, 2004. [5] He, Z., Deng, S., Xu, X., A Fast Greedy algorithm for outlier mining, Proeedings of PAKDD, 2006. [6] Agrawal, R., Srikant, R., Fast algorithms for mining assoiation rules, Int l Conf. on Very Large Data Bases, pp. 487 499, 994. [7] Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U., When is nearest neighbor meaningful?, Int l Conf. Database Theory, pp. 27 235, 999. [8] Hettih, S., Bay, S. KDDCUP 999 dataset, UCI KDD arhive. 999. [9] Blake, C., Merz, C. UCI mahine learning repository. 998.