How to Predict Viruses Under Uncertainty

How to Predict Email Viruses Under Uncertainty InSeon Yoo and Ulrich Ultes-Nitsche Department of Informatics, University of Fribourg, Chemin du Musee 3, Fribourg, CH-1700, Switzerland. phone: +41 (0)26 300 8468/9149 {in-seon.yoo, uun}@unifr.ch Abstract. We try to answer these questions in this paper; how to detect email viruses without signatures?, how much probability can tell us whether the mail is abnormal?, and how can we detect virus patterns in an infected file?. In order to find out relations between email viruses and detectable knowledge, we analysed propagation of email viruses and characteristics of email viruses, studied infected files structure and applied Bayesian networks and Self-Organizing Maps to adaptive detection against email viruses. KEY WORDS Email Viruses, Network Security, Adaptive Detection 1 INTRODUCTION Many commercial systems are used in an attempt to detect and prevent email virus attacks. The most popular approach to defend against malicious program is through anti-virus scanners such as Symantec [1] and McAfee [2], as well as server-based filters that filters email with executable attachments or embedded macros in documents. However, the weakest point is that new virus or even modified virus from known-virus is not able to detect by the anti-virus scanners. Email viruses are spread via SMTP, they misuse Internet protocols. Considering this, we analysed email viruses with statistics in previous our work [3]. We will address this feature with a Bayesian network [4] in Section 2. After analysing email viruses feature, we found several evidences. However, we also need to predict how much probability is being malicious given this evidence later, that is why we have chosen the Bayesian network. When we use Bayesian networks, using prior and likelihood probabilities from several evidences, we can get posterior information of being maliciousness. In other words, we can get predictable probabilities of being malicious given several evidences based on prior characteristic analysis. When we analysed infected files structure, we found that there are virus position in each file format. Before email viruses infect systems, if we detect this peculiar patterns in the email attachment, we expect to detect either unknown or zero-day attack viruses. We have chosen Self-Organizing Maps (SOMs) [5] because of infected files structure. We will address details about this in Section 3. We believe that this study of adaptive detection will be useful to reduce risks by email viruses to other secure systems such as firewalls, IDSs, and anti-viruses.

2 PROBABILISTIC INDUCTION FROM EMAIL VIRUS PACKETS To make a statistical relation between interesting events among incomplete data, we have chosen Bayesian networks [4], or probabilistic graphical models. The Naive Bayesian classifier [6] among several Bayesian network models, we have currently built, can classify packets of SMTP protocol. We analysed certain packet characteristics [3] that allow us to attach to packets probabilities of their maliciousness. The analysed file characteristics are used for the parameters of the Naive Bayesian classifier. 2.1 Naive Bayesian Classifier for Email Packets Melissa and VBS/LoveLetter were among the first viruses to illustrate the problem with email attachments and trust. They made use of the trust that exists between friends or colleagues; they received an attachment from a friend who asks to open it. This is what happens with Win32/SirCam and other similar email viruses as well. Furthermore, emails with malformed MIME headers are one of methods to attack email systems. According to our analysis of Internet viruses [3], about 80% of Windows file worms are transferred via email, and approx. 61% have an.exe file extension. In addition, average 2% of all emails per month contain virus data, 80% of which are in an executable file format. In executable attachment formats, there are.exe (64%),.SCR (22%),.COM (6%),.PIF (22%),.BAT (3%),.VBS (2%), and Others (12%). Attachments are probably still the number one threat in email systems. Fig. 1. A Naive Bayesian classifier for detecting abnormal emails. According to this prior information, we can select and examine four aspects of SMTP packets: a header part, a sender part, a recipient part, and an attachment part. Here, we use four variables which are in correspondence with the four factors above; a header part (H), a sender part (Fr), a recipient part (To), and an attachment part (EF). Each part of an SMTP packet is examined in a checking process. In this process, each part is assigned as TRUE or FALSE. After checking process, each part s values are assigned to the Naive Bayesian classifier in Fig.1. This Naive Bayesian classifier represents the probabilities of being malicious given several evidences such as malformed H,

Fr, To, and EF. We assigned the prior probabilities of a malicious mail and the likelihood probabilities of a malicious executable attachment according to our results obtained from statistics data [3]. The other prior probabilities are assigned as subjective values based on the prior probability of a malicious mail. As a result, each node has four likelihood probabilities; two (TRUE or FALSE) different probabilities for each node given two cases (TRUE or FALSE) for the potaintial maliciousness of a mail packet, which is denoted by M. We built the Naive Bayesian classifier like in Fig.1, then we used an exact inference algorithm for Bayesian inference using MATLAB 5.3 [7]. The results of posterior probabilities from the Naive Bayesian classifier are in [Table 1]. Table 1. Results of posterior probabilites from the Naive Bayesian Network Probability of Malicious Mail (M = T) given evidences Results P (M = T H = F ) 0.4494 P (M = T H = F F r = F ) 0.9662 P (M = T H = F F r = F T o = F ) 0.9992 P (M = T H = F F r = F T o = F EF = T ) 0.9998 P (M = T F r = F ) 0.4167 P (M = T F r = F T o = F ) 0.9698 P (M = T F r = F T o = F EF = T ) 0.9923 P (M = T T o = F ) 0.4787 P (M = T T o = F EF = T ) 0.7860 P (M = T EF = T ) 0.0755 Through this result, we may predict the probability of being malicious if certain evidence is given. For example, if only the header part is FALSE (meaning that it is malformed), the maliciousness of current SMTP packet is 0.4494. If the sender part and the recipient part are FALSE, the maliciousness is 0.9698. If the sender part is FALSE, and the mail includes an executable attachment, the probability of being malicious is 0.7860. However, note that this executable attachment is not recognized as a malicious executable file yet in this moment. This Bayesian network just tells us because this is an executable file, the probability of being malicious is like this value (0.7860). The detailed process for recognizing this executable file is the next step (we address this problem in the following Sections). Nevertheless, the value of the probability to be malicious is quite sensible. 3 LEARNING INFECTED FILES PATTERNS We address infected files format patterns rather than the virus itself in this section. A virus is a piece of code, however it infects several files and changes their forms as an effect of the infection. Once the virus program infects several files in one system, existing files contain a piece of virus code and are spread as another infected virus

program. The following will examine the virus-infected programs structure and reflection of these virus-infected files by Self-Organizing Maps (SOMs) [5]. After examining virus-infected files structure and virus features, we believe there is a recognizable area like a signature. When we experimented with virus-infected files using a SOM, the result increased our confidence in virus detection using SOMs significantly. We aim to design the SOM in a way that neurons will flag the presence of peculiar patterns in data packets and that the position of the active neurons will reflect the position of potentially malicious content in the packet. We show the results of various virus-infected format in this section. For example, the Jerusalem virus of DOS EXE format and the Win95 CIH virus of Windows New EXE format is in Fig.3 and Macro Word97.Mbug virus is in Fig.4. 3.1 Parasitic Viruses : EXE format Parasitic Viruses are all the file viruses which have to change the contents of target files while transferring copies of themselves, but the files themselves remain to be completely or partly usable. The most common method of virus incorporation into a file is by appending the virus to the end of file. In this process the virus changes the top of file in such way that the virus code is executed first. This kind of appending is simple and usually effective. The virus writer does not need to know anything about the program to which the virus will append and the appended program simply serves as a carrier for the virus [8]. Fig. 2. Virus positions in DOS EXE/New EXE file. In the DOS EXE file header, the starting address is changed, and the length of the executable module is changed. Or less often the stack pointer registers are changed, then the virus goes to change the CRC(Cyclic Redundancy Checksum) part of the file and so on. In the Windows and OS/2 executables (newexe - NE, PE, LE, LX), the fields in the NewEXE header are changed. The structure of this header is much more complicated than that of a conventional DOS EXE file, so there are more fields to be changed; the starting address, the number of sections in the file, properties of the sections and so on.

In addition to that, before infection, the size of the file may increase to a multiple of one paragraph (16 bytes) in DOS or to a section in Windows and OS/2. The size of the section depends on the properties of the EXE file header. Fig.2 shows this case. Fig. 3. Self-Organizing Maps of DOS EXE/NEW EXE files infected by Jerusalem viruses, and Win95.CIH viruses. The Jerusalem virus is one of the oldest viruses. There are numerous variants of it. The Jerusalem virus infects both.exe and.com executable files except COM- MAND.COM [9]. Although each variant s SOM has different degree of weight, we found lower weighted figure of the SOM is in similar location; the lower weighted data of DOS EXE format are in middle-left side. (see Fig.3). The Win95.CIH (aka. Chernobyl) is a Windows95/98 specific parasitic virus infecting Windows PE (Portable Executable) files, about 1Kbyte of length. There are three original virus versions (1.2, 1.3 and 1.4) known, which are very closely related and only differ in few parts of their code. They have different lengths, texts inside the virus code and trigger date. Fig.3 shows the trained SOMs of Win95.CIH 1.2, 1.3 and 1.4 infected New EXE files. Each Win95.CIH has very similar location of lower weighted data, although other weighted data are distributed differently. 3.2 Macro Viruses Macro viruses are in fact programs written in macro languages, built into some data processing systems such like Microsoft Word, and Microsoft Excel spreadsheet. To propagate, such viruses use the capabilities of macro languages and with their help transfer themselves from one infected file, e.g. document or spreadsheet, to another. Macro viruses for Microsoft Word, Microsoft Excel and Office97 are most common. Fig.4 shows the case. File worms are able to infect Word documents, and Excel viruses can infect Excel spreadsheets. The same is true for Office97[10].

Fig. 4. Macro virus position in an infected document, and SOMs of Word97 file infected by Macro viruses. The Macro.Word97.Mbug virus is a class macro virus for Word97 documents and templates. The infected file is an MS Word, document file. Variants of the Macro.Word97.Mbug virus are presented in Fig.4 after training by the SOM. The lower weighted data distribution of Macro.Word97.Mbug is very clear, big portion of upper-middle side. We have to mention that Microsoft Word Version 6 and 7 allows to encrypt macros in documents [10]. Therefore some word viruses are present inside the infected documents in an encrypted and execute only form. 4 ADAPTIVE DETECTION OF EMAIL VIRUSES We have examined how to get the probability of being malicious when several evidences are given, how the SOM can train several format s infected files. In this section, we discuss how to apply these two factors to adaptive detection against email viruses. Fig.5 represents the process flow of detecting email viruses combined with a probability of being malicious. Decoded SMTP packets flow into individual check processes e.g. check header, check sender, check recipient, and check attachment. In the middle-centre of Fig.5, one of the processes to get the probability of being malicious is displayed. Solid boxes represent set TRUE. Each row represents H, Fr, To, and EF which are in correspondence with check header, check sender, check recipient, and check attachment. P(A) denotes only EF = T (meaning that there are executable attachments in this mail.), P(B) will be calculated by combined probabilities of H, Fr, and To. P(C) denotes the whole combined probability of the four factors. There are two different process flows. One is based on attachment existence, and the other is without an email attachment. In the first case, P(A), P(B) and P(C) are calculated, if we only consider an attachment in an email packet, i.e. we follow the first process flow. If the attachment is not executable, and if P(B) is over 0.4, this hints

Fig. 5. Process flow of detection email viruses combined with probability of being malicious. towards the fact that this mail could be a document or compressed file. We then need to check for macro viruses by SOMs if this is a document file. The Macro.Word97.Mbug and Melissa virus are covered by this category. If the attachment is executable, but it has no multiple extensions, and if only P(C) exists, it means that this executable file is likely to be infected. Therefore it needs to be checked for infected files by SOMs. The Win95.CIH (Chernobyl) is in this category. However, if not only P(C) but also P(A) exist, and if P(A) is over 0.7, it means that this mail probably contains a virus executable attachment. Nimda worm, W32/Goner and W32/Gibe are covered by this case. As Fig.5 shows, if the attachment is executable and has multiple extensions, it already means it contains with a very high probability abnormal and/or malicious code. In addition to this, if P(C) is over 0.8, we are almost certain that this attachment is a virus file. W32/SirCam, W32/MyParty, VBS/LoveLetter, and W32/BadTrans are in this category. Apart from an attachment part, if the mail does not include any attachment, we can only deal with P(B). If P(B) is over 0.41, this mail is very suspicious. The decision of dropping this mail depends on the security policy. Furthermore, if P(B) is over 0.9, it means this mail is almost 100% abnormal. Subsequently this probability based decision like in Fig.5 combined with determination of macro viruses or infected files based on SOMs will help to detect new viruses and variants of known viruses.

5 CONCLUSION AND FUTURE WORK Using the Naive Bayesian classifier, we analysed certain packet characteristics that allow us to attach probabilities of their maliciousness to data packets. The Bayesian network we have currently experimented with can classify packets of the SMTP protocol. To define the parameters of the Bayesian network we have exhaustively analysed file characteristics of virus attachments to emails. Moreover, we investigated the infected files patterns based on each file format using SOMs. Initial experiments show that this approach appears to be successful. Finally we proposed an adaptive detection combined with probabilities of being malicious and suggested necessity of macro viruses or infected files using SOMs in two situations. Although this study is very experimental approach, it addresses many relations between the SMTP protocol and the probability of being malicious, and between file formats and the SOMs of infected files. In addition, using this relations, we may predict email viruses. For future work, we will apply other protocols to detect other attacks using Bayesian networks, and find out infected pattern s point to distinguish between normal and infected files using SOMs. References 1. Symantec, Symantec worldwide homepage, 2002. Online Publication. 2. McAfee, Mcafee home page, 2002. Online Publication. 3. IS.Yoo and U.Ultes-Nitsche, intelligent firewall: Packet-based recognition against internetscale virus attacks., Conference on Communications and Computer Networks (CCN 2002), November 2002. 4. J. Pearl, Probabilistic Reasoning In Intelligent Systems: Networks of Plausible Inference. San Mateo, California: Morgan Kaufmann Publishers, Inc., 1988. 5. T.Kohonen, Self-Organizing Maps. Berlin, Heidelberg: Springer, 1995. 6. P. Langley and S. Sage, induction of selective bayesian classifiers, Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, 1994. 7. MATHWORKS, The mathworks, inc., 2003. MATLAB. 8. C. P.Pfleeger, Security in Computing. Prentice-Hall International, Inc., 1997. International Edition, Second Edition. 9. N. A. Inc, Virus information library : Jerusalem, 1987. Online Publication. 10. E. Kaspersky, Virus analysis texts - macro viruses, 2000. Online Publication.