Usage Guide to Handling of Bayesian Class Data

CAMELOT Security 2005 Page: 1 Usage Guide to Handling of Bayesian Class Data 1. Basics Classification of textual data became much more importance in the actual time. Reason for that is the strong increase of unwanted mass email (SPAM). Regular text filters search the message text for known words and phrases. This is often not longer effectual enough. Mass mailer applications know in the meantime how text filters work. They change their textual phrases all the time to find backdoors and bypass the filter system. Words like VIAGRA are changing to V*I*A*G*A or V1AGRA. Simple text filters fail at this point, because the known text patterns are not longer there in its original style. The text data classification determines probabilities for different cases to find the best matching category. The base for that calculation is not only given by the data itself but also due to the conjunction between the occurrence of different data, e.g. the combination of different words or phrases in the message text. The process to classify textual data is based on Bayesian Theorem for conditional probability. This method is quite easy to use, but it needs a well structured set of test data. The existing test data defines finally the quality of the classification results. The end result of a Bayesian classification is never 100% positive, only that the probability of one event applying. However the result of a text pattern search feature is always 100% clear, but only if the procedure is used with known text patterns. The benefit of a Bayesian classification system in identifying spam is that the returns from the analysis are also usable results even with unknown data sources.

CAMELOT Security 2005 Page: 2 2. The Bayesian Theorem for conditional Probability (Naive Bayes) The probability calculation is bases on the Naive Bayes Method. The Naive Bayes is a procedure that classifies data, based on the Bayesian Theorem for conditional probability. This classification system will be trained, first with data from different classes, and the attributes in the training data will be used to calculate the relative probabilities and assign these relative probabilities to the corresponding classes. The results will then be stored in special database to be used as foundation for the classification of all new data. Since the probability of one event is conditional on the probability of a previous one, the predefined classes can be used to classify all the new data that is received. New data will be classified based on the stored test data. The classifier calculates the probabilities for each single word or phrase. Each probability is a result of the frequency of the class data and the frequency of the words of the message text. The end result is achieved by the class with the highest probability. It is very important to note that under normal conditions the Naïve Bayes system and data are unusable in its original state, since the classification process and success rate is based on available data; the system therefore needs to learn first what is considered by definition to be spam and what it not spam. Therefore, the pool of training data must contain sufficient recognizable attributes to allow the Bayesian Classifier to work properly. Event this is the actual benefit of this method. Existing date can be expanded by a well directed training and adapted to your personal needs. The definitions of the Naive Bayes require the independence of all the given attributes. This may not always be the case; however the procedure has historically provided excellent results, with the data records not to only belonging to a single class, but actually being identified as belonging to several classes of probabilities. So, a Bayesian classification system returns also usable results even with unknown data sources.

CAMELOT Security 2005 Page: 3 3. The CAMELOT Bayesian Text Classifier CAMELOT uses the Naive Bayes Method for classification of textual data. Together with the regular text pattern analyzer, SPAM can be reduced to a minimum. The text pattern analyzer detects in the run-up known text patterns and continues with the corresponding action for the respective message. All unidentified contents will be classified then by the Bayesian Text Classifier. The required test data is stored in special data stores. The system differentiates between global and user dependent data. These data stores can be expanded or reduced by several trainings methods. This is handled usually by messages declared as SPAM or NOT SPAM. To create a optimized basic dataset, the quality of the training data itself is very important. The administrator needs to deal with that data to find to right ones to train. Only well structured training data enables a functional data store. The required Data Store must be created and supported by the administrator. User dependent data can be managed by the user itself, but the administrator should always have a eye on it to guarantee a clean environment. Bad structured Bayesian class can cause unwanted results. If clean messages are wrongly trained as INVALID, all messages with similar contents will be also classified as SPAM. The can cause a loss of messages.

CAMELOT Security 2005 Page: 4 4. The CAMELOT Bayesian Data Manager The CAMELOT Bayesian Data Manager is a management tool for Bayesian Class Data. The administrator can use this application to train and optimize Bayesian data stores. The Bayesian Data Manager returns important information about the consistence of the data and contains a classification feature to simulate classification processes in the same ways as the security service works. The handling of Bayesian Class Data and the usage of the Bayesian Data Managers is explained in the following chapters.

CAMELOT Security 2005 Page: 5 4.1 Preliminaries The configuration of Bayesian Class Data needs a few arrangement first. First, a appropriate data store must be created. The second is a well structured pool of training data. This is usually a pool a messages with known content, declared as SPAM and NOT SPAM. The usage of Bayesian classification requires the following basic rules: The class data for valid and invalid contents should be always trained approximately equal. That means that the portion of one class should not be significant bigger than the other class. A well known mistake is to train only SPAM messages. Some vendors of anti spam software deliver their products only with SPAM data. The result is that all messages will be classified as SPAM. Useful classification results can only be realized when the classification feature knows the difference between SPAM and NOT SPAM. This can only be done when the database contains valid and invalid data. The second important thing is that the data store is trained by yourself. Even the valid data is always very different. Message contents of companies in the medical sector consist in the first instance of medical terms. Messages of a travel agency contain information about traveling. So, the class data of a travel agency would never work well for a medical company. Software products with pre-defined class data fail more often than not because of this reason. 4.1.1 Creating a Data Store with Bayesian Class Data The CAMELOT setup wizard creates two Bayesian data stores automatically during the installation process. Both have the name Bayesian Class Data, but the type is different. One of them is for global data and the other one is used for user dependent data. For the first time, you should use the global profile. If you have no Bayesian Class Data profile in your environment, you can create one in a few minutes. Start the CAMELOT Configuration Utility and select the Data Stores tab. Click on Add to create a new profile. The data store wizard is coming up. On the second wizard page, you need to select the type of data to use with the store. Select Bayesian Classifier Definition data and click on Next. On the following wizard page, you need to select if it should be a global or a user dependent data store. Please select Global Data. The values for Name and Description on the next page are arbitrary.

CAMELOT Security 2005 Page: 6 Further on the next pages, you need to select a data source. It is strongly recommend to use a database at this point. Bayesian Class Data are very memory intensive. So, a fast data access is very important. The creating of the required database tables will handled automatically by data store wizard.

CAMELOT Security 2005 Page: 7 4.1.2 Assembling the Training Data For the first startup, a number of text files or email messages is needed. These files must be well assorted regarding its contents. Create two directories on the hard disk for this with the names CLEAN and SPAM. These directories should be used later to store the files with valid (CLEAN) and invalid (SPAM) data.. Now copy the files with the appropriate contents to this directories. The invalid data directory should contain only files with invalid contents. Use only files with contents where you think that it is SPAM. You can use regular text files or RFC822 compatible message files, as you can find in the CAMELOT directories. Maybe the Quarantine directory contains useful data. The directory for valid data should only contain files with clean data. These are email messages from regular email transactions. These files should contain nothing which looks like SPAM. This initial selection is very important, because this data is the base for later classifications. For your first tests, about 20 or 30 files in each directory should be enough. The tendency of the test data should be balanced all the time. A exceeding part of valid or invalid data can cause a projection to one side during the classification process.

CAMELOT Security 2005 Page: 8 4.2 Starting the Bayesian Data Managers Start the Bayesian Data Manager in the CAMELOT group of the Windows start menu. The main windows comes up. Click on Open to open a data store. The store list shows all suitable existing data stores to use with Bayesian class data. If you want to select a user dependent data store, you also need to enter a user email address to initialize the store. The data store will be opened in active mode by default. In this case, the Passive mode Option is not set. In active mode, a database cursor will be created but no data is loaded. The required data is loaded only when it is needed. This mode is very useful when your database is very big. Enable the Passive mode if you want to load the whole database into memory. In this case, all operations will run much faster. Also your changes are not saved immediately to the database. The data will be saved when you click on Save. This option is not recommended for big databases, because the load process can take very long time. For your first test, please open a empty data store. You can use the store created by the CAMELOT setup wizard or the one you ve created before by yourself. In both cases, you should open a global data store. Also the Passive mode is recommended at this point. This allows you to play a little bit with all the features without destroying any data in the database. After you ve opened the data store, all list boxes on the left pane will stay empty. Also the statistics on the right hand side do not show any information, because there is no data in the database.

CAMELOT Security 2005 Page: 9 4.3 Building the Data Store by training with existing Data Select the Training button from the toolbar. On the right pane, you will see the training feature. This feature allows you to expand or reduce your data store with existing data. There are basically two classes, VALID and INVALID. All data, trained on this way will receive its properties for later classification. Click on the Add File button and select all files, located in your valid data directory. Then click on Train Valid to expand the data store with valid class data. All trained data will receive the property VALID in this way. Now click on Clear List and select all the files from your invalid data directory, using the Add File button. Train theses file by clicking on the Train Invalid button. All selected files will receive the INVALID property now. The lists on the left hand side will contain the trained data now. There are two lists, one for invalid and one for valid data. The probability column shows the probability for each single word. This value depends on the frequency of the word itself, as well on the number of words in the data store. If the frequency of only one word changes, the probability of all words in the store will change.

CAMELOT Security 2005 Page: 10 The training statistics in the lower pane shows two graphs. The right one show how much data was added to the database and now much was modified. Data will be modified when the word already exists in the database. The word is will not be added a second time. The probability of the existing record will be increased in this case. If the Untrain Valid or Untrain Invalid feature is used, the graph shows the number of records deleted from the database.. The left graph shows the part of the original data used for training. Because of a special Stopwords List, not all data is used for training. The textual data will be normalized first, by remove all words like AND, OR, THE, etc. from the original text. Since such words occur very often, their basic probability would be much higher as the probability of other words. The classification would be controlled by these words and the result would be unusable. Also the stopwords list is shown on the left hand side. You modify this list by adding or deleting words. A detailed stopwords list is a very important thing for good classification results. The No duplicates option on the right pane prevents duplicated training data. If one file is trained multiple times, it s data is counted multiple times. So also the basic probability would be much higher. This will turn the tendency of the data store in a wrong direction. The option can only be used if your data store works with a CRC Table. This table is used to register all processed text contents. If no CRC Table is used, the system is not able to check for duplicates.

CAMELOT Security 2005 Page: 11 4.4 The Pool Statistics The Statistics button on the toolbar opens the statistics information tab for the actual data store in the right window pane. This area did not contain any data when the application was started for the first time. After successful training width valid and invalid data, the data store has changed its properties. Now, the table contains the corresponding values for VALID and INVALID pool data. You can see the size of the data store and its consistency. Two properties are very important here, the relationship between valid and invalid data as well as the portion of single word records in the database. The relationship between valid and invalid data is defined by the probability of both classes. This is basically not equate with the number of words in the data store. The basic probability of the data store is the sum of all probabilities of each single word. It is calculated by the number of each single word and the number of all words in the data store. So, a large amount of database records does not indicate anything about the tendency of the class. Classes with a low number of words can be rated much higher as classes with more words, because of stronger weights of single words. The basic rating of the data store is basically equal to the tendency of the store. This is shown in the left graph. If one of the classes is to heavy, the tendency of the store will point to the direction of this class. If the valid class is rated much higher than the invalid class, the data store would tend to the valid side. In this case, the hit quota for spam detection would be very low. If the invalid class is would be rated to high, the spam detection works fine, but the number of false positive classification would grow. In this case, valid messages are classified as SPAM. It is very important to keep your data store well balanced. Large tendencies to one side must be prevented. Minor tendencies are irrelevant for well classification results in case that the value for the minimum score in your policy profile is not to low. The second graph show the portion of single word records in the data store. The value is interesting for big databases. A overloaded data store with Bayesian class data will return improper results. If the amount of test data is too big, the dataset is almost equal to the training data, but far away from real-

CAMELOT Security 2005 Page: 12 ity. On this account, the data store must be optimized from time to time. This process is called Pruning. This term is also used in landscaping when the trees get their branches pruned. In this process, always the small branches are pruned. The same things happens when the data store gets optimized. Single word records appear only one time in the database. These records do not have a high probability value, but the classification can change its direction when many single words exist. Therefore, these records must be pruned. The right graph shows how big the portion of single word records is. The data store should be pruned when this value is too high. Note The pruning process is only reasonable for large datasets. Smaller datasets consist basically of many single word records. This is not a problem. It can be a problem when the amount of data is growing fast. In this case, the reference to real life gets lost. These datasets can be optimized by pruning of single word records.

CAMELOT Security 2005 Page: 13 4.5 Data Classification The Classification feature is used to simulate classifications of textual data. This feature works in the same way as the CAMELOT Security service does. Thereby, it is possible to check the data store based on special text data. The text can be entered directly into the text box on the Text data classification tab or loaded from a file by Import button. The File data classification tab is used for classification of multiple files. In this case, the textual data will be loaded directly from file. The classification feature is very suitable to check the previously trained data. For your first test, please click on the Import button and select a file from your valid date directory. The text will appear in the text box. The HTML Mode option is not usable in this case. You should use that option when the text contains HTML code. The import feature will decode the HTML parts automatically when the file is loaded. Click on Classify to start the classification process. All single hits will be tagged in green and red color now, whereby green means valid and red invalid hits. Please note that the appearance of the colors must not be equal to the classification result. The frequency of the words is only one thing. The other one is the basic probability of every single word. The lower pane shows again two graphs. The left one is similar to the training feature. It shows the part of processed data. Also the classification process is normalizing the text data first, based on the stopwords list. You can test this process by yourself with the Normalize button. The text box will show the reduced version of the original text. Another classification would return now a 100% rate of processed data. The right graph shows the classification result. This should be almost 100% Valid in your case, because you ve selected a file from your valid data directory. You can repeat the same process with a file from your invalid data directory. In this case, your result will be almost 100% Invalid. This test can be repeated with files from your training directories. The result will always look similar.

CAMELOT Security 2005 Page: 14 More interesting is that test when you classify independent files. Files not originated from your training directories. The result will be somewhere between 100% Valid and 100% Invalid. This would be a typical case of classification of unknown data, as it happens everyday in real life. The results of a Bayesian classification are given by a value between 0 and 100%. The decision if the messages is VALID or INVALID is based on the minimum score of the policy profile. This value should not be too close at the edge of 50%. The File data classification feature is basically identical with the operations above. However, this feature can process multiple files at once. The result will be the total result of all files in this case.

CAMELOT Security 2005 Page: 15 4.6 Optimizing Bayesian Class Data (Pruning) The optimization of Bayesian class data is even for large datasets very important. The usage of a overloaded data store can return very improper results. The store data is almost equal to the training data and out of touch with reality. The imprecision is caused mostly by so called single word records. This words appear only one time in the database. Their basic probability is rather low, but a high number of such records can change the result. Many of these words can be prevented by a suited stopwords list, but sometime it is not possible to avoid a database optimization. The optimization process is called Pruning. The landscape gardeners know that term from there every year pruning process, when they cut the thin braches from the trees. They cut always the thin braches that the tree is not running into seed. The pruning process for a Bayesian data store works in the same way. The table in the Pruning tab contains the occurrence of single word records. The graph below shows the characteristics of single word numbers. This curve tends always downwards. That means that the words with the lowest number appear most often. In your test dataset, it must be the number of 1. With the help of the ruler on the right hand side, you can scale the horizontal and vertical area of the graph. So, you need to find the right settings for your dataset. The axis might be interesting from 1 to 5, because your dataset is quite small. The option Logarithmic scaling is useful to bring out the leaps. To test his feature, please set the pruning level to 1 and click on Start Pruning. The single word records will be remove from the database now. After the process if finished, the graph will change. The lowest value is not longer 1, it is 2 now. Also the curve is more plain now. This indicates a better consistency of the data store.

CAMELOT Security 2005 Page: 16 Note This process is only reasonable for large dataset. Your data store was recently optimized, but in fact important records are lost. According to this, the pruning of small datasets is not a real optimization. Important data was deleted. This data was a fundamental part of the store substance.