A Self-Learning Network Anomaly Detection System using Majority Voting

A Self-Learnng Network Anomaly Detecton System usng Majorty Votng D.Hock and M.Kappes Department of Computer Scence and Engneerng, Unversty of Appled Scences, Frankfurt am Man, Germany e-mal: {dehock kappes}@fb2.fh-frankfurt.de Abstract Network traffc s constantly changng. However, many current Intruson Detecton Systems requre somewhat statc condtons n order to work properly. In ths paper, we propose ongong tranng and updatng procedures and ntroduce a self-learnng Anomaly Detecton System based on majorty votng that can adapt to network changes by steadly exchangng small parts of tranng data. We evaluate the performance of dfferent replacement strateges for ths process. Of the evaluated replacement strateges, the replace oldest strategy archved the best results. Keywords Anomaly Detecton, Self-Learnng, Majorty Votng, Replacement Strateges 1. Introducton As shown n recent surveys, the number of threats has contnuously ncreased, along wth ther sophstcaton (Rchardson 2011). The preventon of network ntrusons and attacks s an ncreasngly mportant ssue as more and more crtcal processes and senstve data depend on network securty. One way to detect attacks early on are Intruson Detecton Systems (IDS). Krügel et al. (2005) specfed ntruson detecton as "the process of dentfyng and respondng to malcous actvtes targeted at computng and network resources". Whle tradtonal, sgnature-based IDS often lag behnd todays attacks, heurstc methods are a promsng alternatve. Heurstc technques are often clustered nto two categores, namely Msuse Detecton and Anomaly Detecton (Dokas et al. 2002). Msuse detecton recognzes threats by comparng them aganst a set of known ntrusons. Anomaly Detecton dscovers devatons from normal behavour. A major advantage of anomaly detecton s ts ablty to detect even unknown and new attacks. Msuse detecton can, lke tradtonal systems, only detect attacks whch are smlar to attacks already known (e.g., been prevously learned by the system). The common objectve of each Anomaly Detecton System s to report ntrusve actvty. However, not all anomalous actvtes are also ntrusve. A report of an anomalous, but not ntrusve actvty s called false postve. The dsadvantage of heurstc procedures, especally for anomaly detecton, are hgh false postve rates. False postves occur n partcular f new types of legtmate network traffc arse such as, e.g., the nstallaton of new dstrbuted programs, the use of new network devces or rregular backup and remote mantenance. 59

Proceedngs of the Tenth Internatonal Network Conference (INC2014) Actvty Descrpton Not ntrusve and not anomalous True Negatve (TN): there s no ntrusve actvty and the system does not report alerts. Intrusve and anomalous True Postve (TP): the actvty s ntrusve and s reported by the system. Intrusve but not anomalous False Negatve (FN): The system fals to detect an ntrusve actvty, because t s to smlar to the expected actvty. Not ntrusve but anomalous False Postve (FP): The actvty s not ntrusve, but dfferent from the usual actvty and reported by the system. Table 1: Defnton of possble outputs For some IDS, up to 99% of alerts can be classfed as false postves and are unrelated to network securty ssues (Petraszek 2004). Whle the false negatve rate can be mproved by combnng multple IDS, t s not possble to reduce a large number of false postves by such means. It s a tme-consumng task to dstngush between relevant and rrelevant alerts, whch means a hgh workload for securty analysts. A further problem of Anomaly Detecton s the so called concept drft (Gates and Taylor 2006), the phenomenon of unexpected change n network traffc over tme. Heurstc approaches usually have a learnng phase and a productve phase. Anomaly Detecton Systems learn the normal behavor of the network n the learnng phase, that means they derve a statstcal model, e.g., by collectng statstcs of network traffc over a perod of tme, whch can be compared later to the statstcs of the productve network traffc. Here, we refer to ths learned model as "sample". A hgh rate of false postves can have varous causes, but many can be traced back to samples of low qualty. The process of learnng the typcal network behavour can be a challengng task: If the s tranng carred out at a bad tme, the system mght not capture the complete regular behavour; f there s an attack n the sample, the system mght not be able to detect ths attack later; f the productve network s not accessble for the developer, he mght not be able to create accurate samples. However, even when the current behavour was captured perfectly, t s very lkely that the sample wll get outdated soon due to the contnuous change of network software, devces and usage. Gates et al. (2006) crtcally examned current Anomaly Detecton paradgms and dentfed these problems (summarzed as problem doman, operatonal usablty and tranng-data) as man ssues concernng todays Anomaly Detecton Systems. A possble soluton for some of these problems are IDS whch automatcally adjust wth the help of a dynamc update process to the changng network, so-called Self-Learnng Systems. Thus, the detecton model s updated so that t reflects the most recent network traffc. The contnuous manual updatng and testng by an operator would be a tme-consumng task and cannot be outsourced to a central nstance (lke for vrus scanners) as IDS updates need to be talored to the envronment they operate n and, n general, requre detaled customzaton. Thus, there s a need for methods to be able to optmze and update systems automatcally and locally. However, automatcally updatng such a system may open up new vulnerabltes f attacks are erroneously adjusted for, contamnatng the Anomaly Detecton System and preventng the detecton of such attacks (Kloft and Laskov 2007). In ths paper, we propose to solve these ssues by combnng a self-learnng approach wth majorty votng: Instead of checkng aganst a sngle model to 60

examne the network traffc, our approach compares the traffc to several models whch are contnuously replaced by mproved and better fttng models. At the same tme, the redundancy of our tranng data represents an advantage over nosy data and helps to stepwse replace obsolete data. The remander of the paper s organsed as follows. In the next secton, we present our method and ts features. We also outlne some detals of our update and replacement mechansms. Last but not least, we present the evaluatons of the ntroduced approach. The results nclude a comparson of performance usng dfferent replacement strateges. 2. Related Work Snce Dorothy Dennng publshed her ntal dea about Anomaly Detecton (Dennng 1987), exstng detecton mechansms have been contnuously expanded and mproved. Resultng methods nclude Outler Detecton (Dokas et al. 2002), Pattern Analyss (Km et al. 2004) and Prncpal Component Classfer (Munz and Carle 2007). However, only few of these approaches have been adopted n productve systems. Of partcular relevance to our work are methods that use multple testng and subsequent combnaton of the results and methods whch have ganed nterest especally n crowd-based mage classfcaton (Bachrach et al. 2012). Km and Bentley (2002) tested the effect of varous parameters (Sample lfe span, threshold) wth self-updatng clusterng algorthms. Cretu-Cocarle et al. (2009) used a Self-Learnng System together wth n-gram analyss and a smple votng algorthm. Our approach s smlar, but dffers n the detecton algorthm and features. Furthermore, we focused on the evaluaton of replacement strateges for automatc update mechansms. Therefore, we show more advanced strateges than full replacement. 3. Method In the followng, we wll brefly outlne the ratonale behnd our approach. Our method s based on classfyng network-traffc by assgnng a value from the range [0,1] ndcatng ts abnormalty to t. Dependng on the value and a threshold, the traffc s then classfed as normal or abnormal. Thus, f bengn and malgnant network traffc receve the same values durng tranng, the result s an area n whch we cannot dstngush whether the traffc s normal or ntrusve. In practce, not every value s equally often represented, for smplcty we assume for now that the values are normally dstrbuted for suffcently large data. Yet, the normal dstrbuton s presented merely as an llustraton, the actual dstrbuton depends of course on the feature, the user behavor and network characterstcs such as the Number of hosts and Operatng Systems (e.g., Mah (1997) found that the Packet Szes for HTTP traffc a heavy-taled dstrbuton). 61

Proceedngs of the Tenth Internatonal Network Conference (INC2014) Fgure 1: Relaton between overlappng values and ROC-curve The left part of Fgure 1 shows such grey areas wth the overlappng values wth the correspondng Recever Operatng Characterstc (ROC). The red bell-curve (left) shows values representng normal traffc, the blue bell-curve shows values representng ntrusve traffc. The p-value stands for the proporton of unambguous values (the proporton of non-overlappng areas), that means p s the probablty for a correct classfcaton. The rght sde of Fgure 1 shows the rato of the ROC curve (.e. true and false postve rate ) n relaton to the sze of the grey area (.e. grey area = 1-p) for all possble thresholds n [0,1]. For a mnmal threshold, all traffc s classfed as normal resultng n a pont (0,0) whereas for maxmal threshold all traffc s classfed as abnormal resultng n pont (1,1). The larger the area under the ROC curve (and thus the smaller the grey area), the better s the recognton rate. The ROC curve s a measure for the performance of a classfcaton algorthm (n ths case we classfy traffc as normal or anomalous) and shows the possble combnatons of true postve rate TP/(TP+ FN ) and false postve rate TN /(TN + FP). Dependng on the user preferences t s possble to adjust an Anomaly Detecton System a) more senstve and report even slght devatons from the normal behavour or b) more specfc to report only devatons that are defntely ntrusve. Whle a) leads to more false postves, b) leads to more undetected ntrusve actvtes. Fgure 2: Threshold 62

Fgure 2 shows the effect of the threshold. Every value on the left sde of the threshold s classfed as not ntrusve whle every value on the rght sde s classfed as ntrusve the ambguous values n the overlappng area ncluded. The ROC-Cuve shows the correspondng True and False Postve Rates for ths Threshold poston. A ROC curve close to the dagonal lne ndcates a random detecton rate. values close to the dagonal represent a same true and false postve rate, whch corresponds to the expected detecton rate of a random process. The deal ROC curve frst rses vertcally to true postve rate close to 1, whle false postve error rate ntally remans close to 0 before the false postve rate ncreases. We am to reduce the grey area by usng several samples smultaneously. Thus, nose s smoothed and the nfluence of outlers s reduced. The followng part of ths paper ntroduces a smple algorthm derved from Packet Header Anomaly Detecton (PHAD) (Mahoney and Chan 2001) and some features ntroduced n our earler papers whch focused on a correlaton analyss and outler detecton (Hock and Kappes 2013). Our method and features may be also a vald approach detect a wde range of anomales n practse, but our ratonale for ths approach s to evaluate the practcal use of majorty votng and Self-Learnng. 3.1. The Anomaly Detecton Method In our system, traffc s classfed takng nto account the followng features: 1. IP.src.Entropy and Dst.Port.Entropy show the probablty for the observed dstrbuton of IP Source Addresses and Destnaton Ports. For example, a Port scan wth nmap, that calls 50.000 dfferent ports a sngle tme, generates very small entropy values. Number of Packets to port value = Number of Packets n Entropy= p( value ) log p ( value ) =1 2. Mean.Packet.Sze s the mean-value of all packet szes over a certan tme. It can dstngush certan IP Protocols and detect some attacks wth unusual small or large packet sze. Mean. Packet. Sze= 1 n n Packet. Sze =1 3. Utlzaton s the number of packets over tme. It can be another evdence for Denal of Servce, because many DoS depend on resource exhauston and therefore generate a lot of network traffc. Number of Packets Utlzaton= Tme Usng these features, PHAD determnes the probablty of an anomaly, called anomaly score, based on the number and last tme of prevous occurrences of values. The tradtonal PHAD algorthm looks at the values of Packet Header felds to assgn an anomaly score to each Packet. We look at the values of custom features 63

Proceedngs of the Tenth Internatonal Network Conference (INC2014) and assgn an anomaly score to each Tme Wndow nstead. PHAD calculates a probablty for the observaton of each value and then sums all values to an Anomaly Score. We further use a logarthm to adjust the score to a range between 0 and 1. After the anomaly score was calculated for each sample s appled majorty votng. Majorty votng can mprove the probablty of a correct result when there s an ntal probablty hgher than 50% (.e. random). The new probablty majorty votng assumes all examples are equally good. Let p denote the probablty for a correct decson for a sample and assume that p s equal for all samples. The decson accuracy of majorty votng method wth n completely ndependent samples s gven n = Number of tmes the value was seen n tranng r = Unque values n tranng t = Tme snce last occurence t n feature = r score = 0.1log 10 by the bnomal formula: P Total = ((n 1 )/2) = 0 feature n!!(n )! p n (1 p) P Total s the probablty for the correct classfcaton usng majorty votng Fgure 3 shows the total probablty for a correct decson n relaton to the qualty and the number of samples. 64 Fgure 3: Theoretcal mprovement of qualty through ncrease of samples In summary, our productve phase can be broken down nto three smple steps. At frst, we derve our features for the current network traffc. Then we calculate the anomaly score usng each of the learned samples. Each anomaly score hgher then a

threshold votes for anomalous and each score smaller then the threshold votes for normal. The majorty of these votes reveals whether the fnal decson for ths tme wndow s anomalous or normal n form of a value from range [0,1]. FnalDecs on = AnomalousVote AnomalousVote+ NormalVote As seen n Fgure 3, the theoretcal mprovement of the qualty, calculated wth the bnomal formula, dmnshes wth the number of samples used. For practcal purposes, we therefore, we use ths amount of samples and f there are more than 15 models, the one wth most false postves s deleted to keep resources and tme consumpton of our method acceptable. 3.2. Update Anomaly Detecton Systems need to adept to the envronment to deal wth the contnuously changng network traffc, the so-called concept drft. The manual defnton and evaluaton of adopted samples would be tme consumng tasks. There s the need for an automatc update mechansm. However, a full replacement of the learned sample mght lead to several problems such as an decrease of the detecton performance due to outler, nose or the unnotced learnng of attacks. Therefore, our approach evaluates dfferent replacement strateges to update the Anomaly Detecton System step by step. The Self-Learnng Process frst adds a new sample generated from the current traffc and then selects a deletes one of the sample to keep the orgnal number of samples. The selecton parameter s defned by the replacement strategy. Some common replacement strateges are: Strategy Descrpton Parameter Replace worst Deletes the sample that performed worst Performance over an evaluaton perod. Replace random Replaces a random sample, f the new Performance sample fulfls a mnmum performance. Replace most smlar Calculates a smlarty measure and replaces Smlarty one of the two most smlar. Replace oldest Replace the sample added frst. chronologcal order Table 2: Overvew of Replacement Strateges Dependng on the replacement strategy, we need to calculate the performance of one or several samples or the smlarty between two samples. We consult the correlaton as smlarty measure: corr n 2 x,y = x x y y / x x y y = 1 n = 1 n = 1 2 65

Proceedngs of the Tenth Internatonal Network Conference (INC2014) The performance of a sample s approxmated wth the proporton of values ncluded n the sample. Therefore, there s the need to perform an evaluaton phase and compare the values from the evaluaton phase wth the values ncluded n the sample: Number of values captured by sample Performance= Number of total values 4. Results In order to test the replacement strateges, we set up an experment n whch network traffc was montored over a total perod of 5 hours. The features, as lsted n secton 2.1, were calculated for traffc representng a 10 second tme wndow. Samples were created usng the features from 15 tme wndows. The normal network traffc has been created usng a Wndows PC n a test network wth an Internet connecton. In addton to the generated traffc from Wndows (e.g. ARP, DHCP,..) we executed typcal user actvtes such as web rado, vdeo streamng, browsng and e-mal. Every 100 seconds an IPv6 Router Advertsement Flood was carred out as proxy for denal of servce attacks. A new sample for replacement strateges was offered once Fgure 4: Anomaly Scores over Tme 15 new tme wndows wth Anomaly score < 0.5. were avalable. 66

Fgure 4 shows two exemplary chosen replacement strateges over the course of tme. The x-axs shows the course of tme n the form of the score numbers (the traffc s rated n tme slots of 10 seconds), the y-axs ndcates the probablty for a anomaly (gven by the majorty votng). The upper part of the Fgure, s the plot wthout the exchange of samples, and the lower part shows the same experment when the oldest sample s replaced. Whle the black crcles show the score for normal traffc (should always be assessed wth 0), the red trangles show attacks (should always be assessed wth 1). The recognton rate n the upper dagram abruptly decreases after about half the tme (the black cloud of ponts s suddenly more dense n the upper part of the plot). Ths behavour usually occurs when the network behavour changes and the samples get deprecated. The lower dagram wth the replace oldest strategy shows a perod of adjustment where the deprecated samples get replaced by more fttng samples. After ths adaptaton phase, the detecton rate s agan better. The detaled performance of all strateges s shown as a ROC Curve (see Fgure 5). As already descrbed n secton 2, the result of classfcaton methods (.e. anomaly detecton as well), s always a compromse of detecton rate and error rate. Ths compromse can be generally adjusted by a threshold value. A ROC curve shows all possble combnatons of recognton rate and error rate and thereby provdes a better vew as a sngle value. We use the area under curve (AUC) as a numercal qualty measure (see Table 2). An AUC of 1 represents perfect detecton whle an AUC of 0.5 represents completely random results. The experment wthout replacement acheved, as expected, the worst results. "replace oldest" archved the best results. The replace worst strategy mght perform better wth another measure for the sample performance. Fgure 5: Expermental Results for Dfferent Replacement Strateges Ths could be archved by an addtonal evaluaton of each new sample over a perod of tme or feedback by humans. The nfluence of addtonal parameters such as sample sze, replacement frequency or recognton method has not been tested here. However, t s evdent that the replacement strategy can have, ndependent from other parameter, sgnfcant mpact on the qualty of the anomaly detecton. 67

Proceedngs of the Tenth Internatonal Network Conference (INC2014) Replacement Strategy AUC No Replacement 0.692 Oldest 0.825 Worst 0.797 Random 0.735 Most Smlar 0.731 Table 3: Area under Curve 5. Concluson and Future Work In ths paper, we evaluated Anomaly Detecton n combnaton wth majorty votng by usng several, dynamcally updated traffc samples. Our results strongly ndcate that multple samples reduce nose and outlers. Furthermore, majorty votng decreases false postves through adaptng the system by self-learnng. Our long-term goal s to develop and feld-test a modern Anomaly Detecton System based on the methods outlned n ths paper. We focused on the evaluaton of replacement strateges to for automatc update mechansms and showed that the replacement strategy has mpact on the qualty and can prevent concept drft. The replace oldest strategy archved the best results n our experments. Ths work was supported n part by the German Federal Mnstry of Educaton and Research n scope of grant number 03FH005PA2. Responsble for the content are the authors. 6. References Bachrach, Y., Graepel, T., Kasnec, G., Kosnsk, M. and Gael, J. (2012), "Crowd IQ - Aggregatng Opnons to Boost Performance", In Proc. of the 11th Internatonal Conference on Autonomous Agents and Multagent Systems, Vol. 1, pp535-542. Cretu-Cocarle, G.F., Stavrou, A., Locasto, M.E. and Stolfo, S.J. (2009), "Adaptve Anomaly Detecton va Self-Calbraton and Dynamc Updatng", In Proc. Recent Advances n Intruson Detecton, Sprnger, pp41-60. Dennng, D.E. (1987), "An ntruson-detecton model", IEEE Transactons on Software Engneerng, pp222-232. Dokas, P., Ertoz, L., Kumar, V., Lazarevc, A., Srvastava, J. and Tan, P. (2002), "Data mnng for network ntruson detecton", In Proc. NSF Workshop on Next Generaton Data Mnng, pp21-30. Gates, C. and Taylor, C. (2006), "Challengng the anomaly detecton paradgm: a provocatve dscusson", In Proc. of the 2006 workshop on New securty paradgms, ACM, pp21-29. Hock, D. and Kappes, M. (2013), "Usng R for anomaly detecton n network traffc", In Proc. ITA 13, Ffth Internatonal Conference on Internet Technologes and Applcatons. Km, J. and Bentley, P.J. (2002), "Towards an Artfcal Immune System for Network Intruson Detecton - An Investgaton of dynamc Clonal Selecton", In Proc. of the 2002 Evolutonary Computaton CEC'02, IEEE, Vol. 2, pp1015-1020. 68

Km, M., Kong, H., Hong, S., Chung, S. and Hong, J.W. (2004), "A flow-based method for abnormal network traffc detecton", In Proc. NOMS 2004, A Network Operatons and Management Symposum, IEEE/I-FIP, Vol. 1, pp599-612. Kloft, M. and Laskov, P. (2007), "A posonng attack aganst onlne anomaly detecton", In Proc. NIPS Workshop on Machne Learnng n Adversaral Envronments for Computer Securty. Kruegel, C., Valeur, F. and Vgna, G. (2005), "Intruson detecton and correlaton: challenges and solutons", Sprnger, Vol. 14. Mah, B.A. (1997), "An emprcal model of HTTP network traffc", In Proc. INFOCOM'97, IEEE, Vol. 2, pp592-600. Mahoney, M. and Chan, P.K. (2001), "PHAD: Packet header anomaly detecton for dentfyng hostle network traffc", Florda Insttute of Technology techncal report, pp1-17. Munz, G. and Carle, G. (2007), "Real-tme analyss of flow data for network attack detecton", In Proc. IM'07, 10th IFIP/IEEE Internatonal Symposum on Integrated Network Management, IEEE, pp100-108. Petraszek, T. (2004), "Usng adaptve alert classfcaton to reduce false postves n ntruson detecton", In Proc. Recent Advances n Intruson Detecton, Sprnger, pp102-124. Rchardson, R. (2011), "CSI computer crme and securty survey" 69