WEKA 1. Weka. Valeria Guevara. Thompson Rivers University

Size: px

Start display at page:

Download "WEKA 1. Weka. Valeria Guevara. Thompson Rivers University"

Kerry Garry Cunningham
5 years ago
Views:

1 WEKA 1 Weka Valeria Guevara Thompson Rivers University Author Note This is a final project COMP 4910 for the bachelors of computing science from the Thompson Rivers University supervised by Mila Kwiatkowska.

2 WEKA 2 Abstract This project focuses on documents classification using text mining through a classification model generated by the open source software WEKA. This software is a repository of machine learning algorithms to discover knowledge. Weka easily preprocesses the training documents to compare different algorithms configurations. The exactitude in the generated predictive model will be measured based on a confusion matrix. This project will help to illustrate text mining preprocessing and classification using WEKA. The result will be the development of a tool to generate the input data files arff and of a video tutorial on documents classification in Weka in English and Spanish. Keywords: Weka, documents classification, arff, stopwords, toquenizer, pruning, decision tree C4.5, words vector, text mining, F-measurement, machine learning, text classification, stemming, knowledge society.

3 WEKA 3 Weka Weka is a native New Zealand bird that does not fly but has a penchant for shiny objects. [30] Newzealand.com. (2015). Old legends from New Zealand narrate that these birds steal shiny items. The University of Waikato in New Zealand started the development of a tool with that name because this would contain algorithms for data analysis. Currently WEKA package is a collection of algorithms for machine learning tasks of data mining. The package of Waikato Environment for Knowledge Analysis contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. [31] Hall, M., Frank, E., Geoffrey H., Pfahringer, B., Reutemann, P., & Witten, IH (2009). This software analyzes large amounts of data and decide which is the most important. It aims to make automatic predictions that help decision making. Weka VS Other Machine Learning Tools There are other tools for data mining as RapidMiner, IBM Cognos Business Intelligence, Microsoft SharePoint and Pentaho. IBM Cognos Business Intelligence provides a not very userfriendly display. Microsoft SharePoint creates predictive models of mining business but their information is not their main objective. Where RapidMiner offers a great display of results, but the datasets are loaded slower than in Weka. Pentaho its graphical interface is not difficult to understand to describe your options as Weka does. The tool implements Weka machine learning techniques implemented in easy to learn java under a GNU General Public License. WEKA provides three ways to be used, through its graphical interface, command line interfaces and application code in Java API interface

4 WEKA 4 language. Although WEKA has not been used primarily for troubleshooting predictions in business, this helps the construction of new algorithms. Therefore it turns out to be the most optimal software for initial data analysis, classification, clustering algorithms, research. In this project the Weka tool is used to create a predictive model using text classification algorithms of machine learning algorithms. Installation Weka can be downloaded at: In this case we speak of the latest version Weka. In the same URL you can find instructions for installation on different platforms. In Windows Weka must be located in launcher program in a folder version of Weka downloaded, in this case the latest version is weka-3-6. Weka default directory is same directory where the file is loaded. Linux will have to open a terminal and type: java -jar /installation/directory/weka.jar. It is common to find an error of insufficient memory, which in turn is achieved by specifying for example GB 2GB will "-Xmx2048m" in the setup files. Further information weka.wikispaces.com/outofmemoryexception be found. You can be ordered with the -Xms and -Xmx parameter indicating the minimum and maximum RAM respectively. In windows you can edit the file RunWeka.bat RunWeka.ini or the installation directory should be changed Weka maxheap = 128m = 1024m maxheap line. You can not assign more than 1.4G to JVM. You can also assign to the virtual machine with the command: java -Xms <minimum-memory-mapped> M

5 WEKA 5 -Xmx <Maximum-memory-mapped> M -jar weka.jar [32] Garcia, D., (2006). In linux the -XmMemorySizem option is used, replacing MemorySize the required size in megabytes. for instance: java -jar -Xm512m /instalación/directorio/weka.jar. Execution Weka The first screen will show a coach you are interfaces called "Applications" where in this version of Explorer, Experimenter, KnowledgeFlow sub-cli and Simple tools are deployed. Explorer is responsible for conducting exploration operations on a data set. Experimenter experiments performed statistical tests to create an automated manner different algorithms different data. KnowledgeFlow shows graphically the operation panel work Weka. Simple CLI or single client that provides the command line interface to enter commands. The main user interface is "Explorer" consists of six panels. Preprocess is the first window to open this interface. In this window, the data are loaded. Weka accepts load the data set from a URL, database, CSV or ARFF files. The ARFF file is the primary format to use any classification task in WEKA. Input data. As previously it was described, three data inputs are considered in data mining. These are the concepts, instances and attributes. An Attribute-Relation File Format file is a file that describes a concept list of instances with their respective attributes. These files are used by Weka for text classification and clustering applications.

6 WEKA 6 ARFF files. These files have two parts, the header information and data information. The first section contains the name of the relationship with the attributes (name and type). The relationship name is defined in the first line of arff where name-relation is a string with the following <relation-name> The next section is the attribute declarations. This is an ordered sequence of statements of each attribute instances. These statements uniquely define an attribute name and data type. The order in which the attributes are declared indicates the position where you are in the instances. For example, the attribute that is declared at the first position is expected in all instances at the first position state the value of this attribute. The format for its declaration <attribute-name> <data type> Weka has several data-type supported: i) NUMERIC: are all real numbers where the separation between the decimal and integer part is represented by a point and not a comma. ii) INTEGER: treated as numeric. iii) NOMINAL provide a list of possible values for example {good, bad}. These express the possible values that the attribute can take the following attribute _name {<nominal1>, <nominal2>, <nominal3>,...} iv) STRING: is a sequence of text values. These attributes are declared, as attribute _name string. v) DATE: dates and times are declared <name> Date [<date format>].

7 WEKA 7 Where <name> is the name of the attribute and <date format> is an optional string consists of characters hyphens spaces and time units. The date format Specify the values to date should be analyzed. The format set accepts the combination of format ISO-8601: aaaa-mm-dd't'hh: mm: ss. timestamp DATE "yyyy-mm-dd HH: mm: ss" vi) Relational attributes are data attributes for multiple instances in the following <name> relational <Attribute definitions <name> There exist rules on the attribute statements: a) The names of relations as string or string must be enclosed in double quotes "if it includes spaces. b) Both the attributes and relationships names cannot start with a character before the \ u0021 ASCII '{', '}', ',', or '%'. c) Values that contain spaces must be quoted. d) Keywords numeric, real, integer, string and date can be case insensitive. e) Relational data must be enclosed in double quotes ". The second section is the statement of information. It is declared on one line. Each line below represents an instance defining attributes with commas. The attribute value must be in the same order in which they were found in one section attribute. Missing values are represented with a trailing question mark "?". The string values and nominal attributes are

8 WEKA 8 different between upper and lower case. It should be cited any value that contains a space. Comments are cited delimiter character "%" to the end of the line. In text classification, arff files represent the entire document as a single text attribute that is of type string. The second attribute to consider is the class attribute. This will define the class instance belongs. This type of attribute can be of type string or nominal. An example of the resulting text file is the document type and the type string nominal class of two DocumentText class {English, 'texto a clasificar aquí... ', español 'Classify text here...', English Data preprocessing. In this window, data are loaded and may be edited. Data can be manually modified with edition or filtering. Filters are learning techniques methods that modify the data set. Weka has a variety of filters structured hierarchically in supervised and non-supervised where the root is weka. These filters are divided into two categories as a result of the way they operate with data attribute and instance. As point out earlier, these techniques are classified in a way that depends on the input data relationships. Unsupervised learning techniques as descriptive inductive models do not know their correct classification. This means that the instances do not require an attribute that declares the class. Inductive techniques of predictive supervised learning depend on the class values to which it corresponds. This refers to instances will contain a class attribute that corresponds which they belong.

9 WEKA 9 In Current relation module the dataset that has been loaded is described as the name, and instances number. Attributes allows to select attributes using options from All, None, Invert and it further provides the option to enter a regular expression. In the Selected attribute part display information about the selected attribute. At the bottom is illustrated a histogram of the attributes selected in Attributes. Preprocessing for classifying documents In Weka is possible to create documents classification models into categories previously analyzed. The documents in Weka usually need to be converted into "vectors text" before applying machine learning techniques. For this the easiest way to render text is as bag of words or word vector. [34] Namee, B. (2012). StringToWordVector filter performs the process of converting the string attribute to a set of attributes that represent the occurrence of words of the full text. The document is represented as a text string in a single attribute type string. StringToWordVector Filter This is the fundamental text analysis WEKA filter. This class offers abundant choices of natural language processing, including the use of lexematización for convenient corpus, custom tokens and using various lists of empty words. At the same time, it calculates weights Frequency and Duration TF.IDF etc. StringToWordVector places the class attribute to the top of the list of attributes. To change the order it can use the filter Reorder to reorder. This filter can be configured all the techniques of linguistic natural language processing to attributes. To apply the filter StringtoWordVector in batch mode from the command line can be done as follows:

10 WEKA 10 Java -cp/aplicaciones/weka-3-6-2/weka.jar weka.filters.unsupervised.attribute.stringtowordvector -b -i datos_entrenamiento.arff -o vector_ datos_entrenamiento.arff -r datos_prueva.arff vector_ data_ prueva.arff The sets datos_entrenamiento are the training set, vector_ datos_entrenamiento are the training set vector, datos_prueva are the test set and vector_ data_ prueva are the test set vector. The -cp option puts Weka jar in the class path, use -b indicates the batch mode, -i file specifies the training data, -o output file after processing the first file, -r is my file Test and -S is the output file of the previous test file. Options can be modified in the user interface, when you click on the filter name beside the choose button. Having previously selected the filter from Booton choose. Having the window open weka.filters.unsupervised.attribute.stringtowordvector show the following to be modified according to the needs of the documents to be classified. The options are: IDFTransform TFTransform attributeindices attributenameprefix donotoperateonperclassbasis invertselection lowercasetokens mintermfreq normalizedoclength outputwordcounts periodicpruning stemmer stopwor tokenizer usestoplist wordstokeep

11 WEKA 11 In Weka.sourcearchive.com [39] refers to a mental map of Weka options which are as follows shown in the following illustration is:

12 WEKA 12 wordstokeep Defines the number N of words per class limit, if there is a class attribute which is trying to maintain. In this case only the more common N terms among all attribute values in the chain will remain. Higher values represent lower efficiency because it will take more time learning model. donotoperateonperclassbasis Flag that set to keep all relevant words for all classes. It is set to true when the maximum number of words and the minimum term often does not apply to an attribute of a class, instead it is based on all classes. TFTransform Term frequency score (TF) Transformation: when position the flag as true, this filter executes the transformation term-frequency score representing textual data in a vector space the term-frequency (TF) is used. The TF represents numerical measure the words of the text relevance among the entire collection. This not only considers the relevance of a single term itself, it also contemplates the relevance in the entire collection of documents. Mathematically its represented as the function TF (t, d) which expresses the term t in the document d is as: log (1 + t word frequency on the instance or document d). The inverse document frequency IDF is the number of documents containing the term t appear where t is defined in the TF. It find words often related in terms of log (1 + IJF) where fij is the frequency of the word t in the document (example) j. DFTransform Inverse Document Frequency (IDF) Transformation: positioning the flag with "true" will define the use of the following equation:

13 WEKA 13 t word frequency in instance d as ftd and as a result: F td * log (nº documents and instances d / nº of documents with word t) This is explained taking into account set D which includes all documents in the collection represented as D = {d1, d2,..., dn}. It finds out most relevant documents to the other fij * log (nº Docs / nº nº of Documents with the i word) where fij is the frequency of word i in document j. By multiplying IDF by the TF the result assign more weigh to the terms with greater frequency in the documents but at the same time relatively rare in the collection of documents. Weight [33] Salton, G., Wong, A., & Yang, C. (1975). outputwordcounts Counts the words occurrences in the string, the default settings only reports the presence or absence as 0/1. The result is a vector where each dimension is a different word. The value in this dimension is a binary 0 or 1 is say yes or no is the word in that document. The frequency of the word in that document is represented as an integer number with: IDFTransform and TFTransform as "False" and outputwordcounts to "True" opccions. This is enable to do an explicit words account. It is established as "false" when only cares about the presence of a term, not its frequency. To calculate tf * (IDF) must be set IDFTransform as True, TFTransform as false and outputwordcounts set as True. To achieve log (1 + tf) * log (IDF) TFTransform must be set to True. normalizedoclength It is set true to determine whether the words frequency in an instance must be normalized. Normalization is calculated as Actual Value * Average Document Length / Document Length.

14 WEKA 14 This option is set with three sub-options, the first option "No normalization". The second is "Normalize all data" that takes a measure as a common scale of all measures taken in the various documents. The third option is "Normalize test data only." It has a word with a real value of the tf-idf result of the word in that document with the settings as follows IDFTransform and "TFTransform" to "True" and "normalizedoclength" to "Normalize all data." Stemmer Selects the stemming algorithm to use in the words. Weka by default supports four default stemmer algorithms. Lovin Stemmer algorithm is its iterated version and supports Snowball stemmers. IteratedLovinsStemmer algorithm is a version of the algorithm LovinsStemmer which is a set of transformation rules for changing word endings as well as words present participle, irregular plurals, and morphological English. NullStemmer algorithm performs any derivative at all. The algorithm SnowballStemmer came standard vocabularies of words and their equivalents roots. Weka can easily add new algorithms stemmer because it contains a wrapper class for as snowball stemmers in Spanish. Weka contains all algorithms snowball but can be easily included in the location of the class weka.core.stemmers.snowballstemmer Weka. Snowball is a string processing language designed for stemming creation. There are three ways to get these algorithms, the first is to install the unofficial package. The second is to add snowball jar pre-compiled class location. The third is to compile the latest stemmer by itself from snowball zip. The algorithms are in snowball.tartarus.org that have a stemmer in Spanish. In the following link you can see examples and download this stemmer:

15 WEKA 15 Snowball Spanish Stemming Algorithm comes from Snowball.tartarus.org. It defines an usual R1 and R2 regions. Furthermore RV is defined as the following vowel after the region if the second letter is a consonant, or RV and after the following consonant the region, if the first two letters are vowels, or RV as the region also after the third letter if these options do not exist RV is the end of the word. Step 0: Search the longest pronoun between the following suffixes: "I selo selos selas is SELA's you what the will of us" and remove it, if it comes after one of iendo ar Ando ír ER'm iendo ar er get going. Step 1: Look in the longest common suffix and deletes it. Step 2: If no suffix is not removed in step 1 seeks to eliminate other suffixes. Step 3: Find the longest among the residual suffixes os a o á í ó e é in RV and eliminates them. Step 4: remove sharp accents. [36]. For more information about suffixes in step 1 and 2 go to snowball page. The previous algorithm will be added into weka when the following command for Windows is applied: java -classpath "weka.jar, snowball jar" weka.gui.guichooser For Linux: java -classpath "weka.jar: snowball jar" weka.gui.guichooser [37] Weka.wikispaces.com,. (2015).

WEKA 16 The jar snowball-20051019.jar previously compiled and stored in the location where the application of Weka on the computer. It may confirm with the command: java weka.core.

16 WEKA 16 The jar snowball jar previously compiled and stored in the location where the application of Weka on the computer. It may confirm with the command: java weka.core.systeminfo As shown in the figure below. Stopwords This are terms that are widespread and appears more frequently and do not provide information about a text. This option determines whether a sub string in the text is an empty word. Stopwords terms come from predefined list. This option converts all words to lowercase before term removal. Stopwords it is pertinent to eliminate meaningless words within the text and eliminate frequent and useful words of decision trees. Weca s stopwords by default are based on the Rainbow lists that are found in the next link:

17 WEKA 17 Rainbow is a program that performs statistical text classification. It is based on the Bow library. [38] Cs.cmu.edu, (2015). The format of these lists is one word per line, where each comments must start with '#' to be omitted. WEKA is configured with a list of empty words English but you can set different lists of stopwords. You can change this list from the user interface by clicking on the option you have Weka by default uses Weka-3-6 list but it can choose any location that points to a desired list. Rainbow has separate lists for English and Spanish, in order to make both languages the "ES-stopwords" add both lists from Rainbow. usestoplist: Flag to use empty words. If is set to "True" ignores the words that are in the predefined stopwords list from the previous option. Tokenizer: Choose measurement unit to separate each text attribute from the arff. This has three sub options. The first is AlphabeticTokenizer where only alphabetical symbols are continuous sequences that cannot be edited. When tokenize only considers the alphabet in English. At the same time there is WordTokenizer option that establishing a list of delimiters. As was referenced previously, punctuation in Spanish is ";:.?!?! - - () [] '" << >> ". In Spanish, unlike English contemplates a sign of the beginning and another end in an exclamation. The second is NGramTokenizer that divides the original text string in a subset of consecutive words that form a pattern with unique meaning. Its parameters are derived "delimiters" to use that default is '\ r \ n \ t,;:.' "()! 'GramMaxSize which is the maximum size of the Ngram with a default value of 3 and GramMinSize be the minimum size of the Ngram with a

18 WEKA 18 default value of 1. N-grams can help uncover patterns of words between them which represent a meaningful context. mintermfreq: Sets the minimum frequency that each word or term must possess to be considered as an attribute, the default is 1. It is often applied when class has an attribute that has not been set to true flag "donotoperateonperclassbasis" the text of the entire chain for a particular class that is in that same attribute is selected tokenisa. The frequency of each token is calculated based on its frequency in the class. In contrast, if there is no class, the filter will calculate a unique dictionary and the frequency is calculated based on the entire attribute value chain of the chosen attribute, not only those related to a particular class value. periodicpruning Eliminates low-frequency words. It uses a numerical value as a percentage of the size of the document that sets the frequency to prune the dictionary. The default value is -1, meaning no periodic pruning. Periodic pruning rate is specified as a percentage of the data set. For example, this specified that 15% of each set of input data, regularly pruned in the dictionary, after creating a comprehensive dictionary. May not have enough memory for this approach. attributenameprefix Sets the prefix for the names of attributes created, by default is "". This only provides a prefix to be added to the names of the attributes that the filter StringToWordVector created when the document is fragmented.

19 WEKA 19 lowercasetokens Flag when its set to "True", converts all words in the document into lowercase before being added to the record. Flag true eliminate the option to distinguish themselves by eliminating the rule names that begin with uppercase names. Acronyms may be considered when this option to is set to "False". attributeindices Sets the range of attributes to act on the sets of attributes. The default is first-last which ensures that all attributes san treated as if they were a single chain from first to last. This range will create a chain of ranges containing a comma-separated list. invertselection Flag to work with the attributes selected in the range. It stands as true to work with the unique attributes unselected "true" or. The default value is "False" is work with the selected attributes. After cleaning the data on the tab "Preprocess" vector attributes are analyzed to obtain the desirable knowledge in the "Classify" tab. Classification The second panel of Explorer is "Classify" or classification generated by machine learning model from the training data. These models serve as a clear explanation of the structure found in the information analyzed. Weka especially considering the model J48 decision tree for the most popular text classification. J48 is the Java implementation of the algorithm C4.5. Previously described as the algorithm that each branch represents one of the possible choices in the if-then format that the tree offers to represent the results in each leaf. It can summarized the

20 WEKA 20 C4.5 algorithm as the amount of measurement of the information contained in a data set and grouped by importance. The idea of the importance of a given attribute in a dataset. J48 Print recursively the tree structure variable of type string by accessing information stored in each attribute nodes. To create a classification, you must first choose the algorithm classifier in the Choose button located in the upper left side of the window. This button will display a tree where the root is Weka and sub folder is "classifiers". Within the sub folder tree located in weka.classifiers.trees tree models such as J48 and RepTree are found. RepTree combines the standard decision tree with random forest algorithm. To access the classifier's options are given double-click the name of the selected classifier. "Test Options". The classification has four main modes and others to manage the training data. These are found in the section "Test Options with the following options a) Use training set: training method with all available data and apply the results on the same dataset collection. b) Supplied test set: select training data set froma file or URL. This set must be compatible with the initial data and is selected by pressing "Set" button. c) Cross-validation: performs a cross-validation depending on the number of "Folds" selected. Cross-validation specify a number of partitions to determine how many temporary models will be created (Folds). First a part is selected, then a classifier is built from all parts are except the selected one that remains for testing. [32] Garcia, D., (2006). d) Percentage Split: define the percentage of the total input from the classifier model was built and the remaining part will be tested.

21 WEKA 21 Weka allows us to select more than a few options for defining the test method with the "More Options" button, these are: Output Model: open in the output window pattern classifier. Output per-class stats: display statistics for each class. Output entropy evaluation measures: displays measurement information entropy in the standings. Output confusion matrix: displays the resulting confusion matrix classifier. Store predictions for visualization: Weka will keep classifier model predictions as in the test data. In the case of using this option classifier J48 will show the tree errors. Output predictions: show a table of the real and predicted values for each instance from test data. It states the relation between the classifier and each instance in the test data. Output additional attributes: is set to display the values of attributes, not those of the class. A range will be specified to be included along the actual and predicted values of the class. Cost-sensitive evaluation: produce additional information on the output of the assessment, the total cost and average cost of misclassification. Random seed for xcal /% Split: specifies the random seed used when before data have been divide for evaluation purposes. Preserve order for% Split: Retains the order in the percentage of data instead of creating a random for the first time with the value of the default seeds is 1. Output source code: generate the Java code model produced by the classifier.

22 WEKA 22 In the event that does not have a set of data independent evaluation it is necessary to obtain a reasonably accurate idea of the generated model and select the correct option. In the case of classifying documents is recommended select at least 10 "Folds" for cross-validation and assessment approach. It also recommends allocating a large percentage of "Percentage Split". Below these options "Test Options", it is a menu where a list with all attributes will be find. This allows you to select the attribute that act as the result for classification. In the case of the classification of documents will be the class to which the instance belongs. The classification method start by pressing the "Start" button. The image of the weka bird found in the bottom right will start to dance till the classifier achieves complete. WEKA creates a graphical representation of the classification tree J48. This tree can be viewed by right-clicking on the last set of results "Result List" and selecting "Visualize tree" option. The window size can be adjusted by right-clicking and select Fit to Screen. Classifier for classifying documents J48 The model J48 uses the decision tree algorithm C4.5 to build a model from selected training data. This algorithm is found in weka.classifiers.trees. J48 classifier has different parameters that can be edited by double clicking on the name of the selected classifier. J48 employs two pruning methods, but this does not make the pruning of error. The main objectives of pruning are to make the tree easier to understand and reduce the risk of overuse of the training data in the direction of be able to classify just about perfectly. The tree learn the specific properties of the training data and not the lower concept.

23 WEKA 23 The first J48 pruning method is known as replacement subtree. The nodes in a decision tree can be replaced with a leaf by reducing the number of nodes in a branch. This process starts from the fully formed leaves and work up towards the root. The second is to raise the hive. A node is move to the tree root and replaces other nodes in the branch. Normally, this process is not negligible and is wise turn it off when the induction process takes time. By clicking on the name of the J48 classifier which is located right next to the Choose" will display a window with the following editable options: confidencefactor sets the number of pruning. Lower values experience more pruning. Reducing this value may reduce the size of the trees and also helps in removing irrelevant nodes that generate misclassification. [40] Drazin, S., & Montag, M. (2015). minnumobj: Sets the minimum number of instances separation per leaf in the case of trees with many branches. unpruned: flag to preform pruning. In true the tree is pruned. Default is "False" which means that pruning is not carried out. reducederrorpruning: flag to use pruning error reduction in C.4.5 tree. Method after pruning using a resistance to the errors estimations. Similarly, it is for breeding hives and throw an exception not the confidence level used for pruning. Seed: Seed number shuffle data randomly and reduce error pruning. This is considered when reducederrorpruning flag is set to "True". The default seed is 1. numfolds: number of pruning to reduce error. Sets the number of folds that are retained for pruning, with a set used for pruning and the rest for training. To use these Folds reducederrorpruning flag must be set to "True".

24 WEKA 24 binarysplits: when this flag is set "True", it creates only two branches for nominal attributes with multiple values instead of a branch for each value. When the nominal attribute is binary there is no difference, except in how this attribute is shown in the output result. The default is "False". saveinstancedata: flag set to "True" to store training data for its visualization. The default is "False". subtreeraising: flag to preform pruning with the subtree raising method. This moves a node to the tree root replacing other nodes. In "True" weka considered subtreeraising in the process of pruning. uselaplace: flag that preform a leaves count in Laplace. Set to "True", weka will count the leaves that become smaller based on a popular complement to estimates probability called Laplace. debug: banner to add information to the console. In "True", it adds additional information to the console of the classifier. It can reach 100% correct in the training data clearing pruning and establish the minimum number of instances on a sheet 1.

WEKA 25 Weka document classification Weka tool was selected in order to generate a model that classifies specialized documents from two different courpus (English and Spanish).

25 WEKA 25 Weka document classification Weka tool was selected in order to generate a model that classifies specialized documents from two different courpus (English and Spanish). WEKA package is a collection of machine learning algorithms for data mining tasks. Text mining uses these algorithms to learn from examples or "training set", new texts are classified into categories analyzed. It is defined as Waikato Environment for Knowledge Analysis. For more information contact Installing WEKA Weka can be downloaded from: In this tutorial version is Weka For Windows WEKA must be situated in the program launcher located in a weka folder. The Weka default directory is the same directory where the file is loaded. For Linux: WEKA will have to open a terminal and type: java -jar /installation/directory/weka.jar.

WEKA 26 Based on the text mining methodology Weka is represented in a framework with four stages, data acquisition, document preprocessing, information extraction and evaluation.

26 WEKA 26 Based on the text mining methodology Weka is represented in a framework with four stages, data acquisition, document preprocessing, information extraction and evaluation. Data Acquisition ARFF files are the primary format to use any classification task in WEKA. These files considered basic input data (concepts, instances and attributes) for data mining. An Attribute- Relation File Format file describes a list of instances of a concept with their respective attributes. The documents selected for the training data set has been found on the Thompson Rivers University library that has the following link: It was randomly selected 71 medical academic articles in English and Spanish. These documents are stored in Portable Document Format (PDF). Based on the TRU library was detected the classification of this documents into six categories Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes recognized. These documents are stored in directories named by its categories within the main folder called Medicine. As shown in the figure below. In order to form an arff file it was created in Microsoft Visual Studio Professional C # 2012 an application that generated the arff from a directory that contains a collection of

WEKA 27 documents in a based on their category name. This application could be carried out with the collaboration of a library called itextsharp PDF for a portable document format text extraction.

27 WEKA 27 documents in a based on their category name. This application could be carried out with the collaboration of a library called itextsharp PDF for a portable document format text extraction. Documents Directory to ARFF can specify the name of the relationship to define, the location of the home directory that contains all documents subdivided into categorical directories and comments required. Also, it specify the file name generated with arff extension and its location. At the end of the application are two buttons, one for exit and another to generate the arff file with the information described. This can be download under current projects for Text Mining. The resulting arff generate a string type attribute called " textodocumento" that describe all text found in the document and the nominal attribute "docclass" that define the class to which it belongs. As a note, recent versions of Weka Weka as in this case the class attribute can never be named "class".

28 WEKA 28 The file will be generated as follows: % tutorial de Weka para la Clasificación de textodocumento docclass {Hemodialysis, Nutrition, Cancer, Obesity, Diet, "texto ", Hemodialysis texto, Nutrition "texto.", Cancer "texto ", Obesity "texto ", Diet "texto ", Diabetes Document Preprocess Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. "Applications" is the first screen on Weka to select the desired sub-tool. In this "Explorer" is selected. It consists of six panels: Preprocess, Classify, cluster, Associate, Select attributes and Visualize. Preprocess Preprocessing for the classification of documents. To load the generated arff, click on the button "Open file..." at the top right. Select the created file "medicinaweka.arff". On "Current Relation" the dataset that has been loaded is described. It describes the relationship with the medicina name, the number of instances as 71 and a total of attributes as 2. At the bottom of the under "Attributes" section, attributes are described. This framework allows to select the attributes, in this case are show " textodocumento " and "docclass". When selecting "docclass" the "Selected attribute" part describes the nominal attribute with 6 labels and the total of its instances. These "labels" are 11 levels from Hemodialysis and 12

29 WEKA 29 instances from the others: Nutrition, Cancer, Obesity, Diabetes Diet. At the bottom of this section is ilustrated a histogram of the attribute "docclass" labels that by hovering the graph it will describe the attribute name as shown in the following figure illustrates. Weka uses StringToWordVector filter to convert the "textodocumento" and "docclass"." attribute into a set of attributes that represent the occurrence of words of the full text,. This filter is a technique of unsupervised learning. These inductive technique is designed to detect clusters and label entries from a set of observations without knowing the correct classification. The filters are found when click the Choose " button under "Filter" section. This button opens a window with root weka. From there selecte filters and the unsupervised folder to after select attribute and finally select StringToWordVector.

30 WEKA 30 StringToWordVector filter can configured its attributes with language processing techniques. To edit this filter is only necessary to click on the filter name. it will open a that show the following options. They were generated a set of optimal options from different combinations of options applied to the same training data. Each resulting model was calculated its F-measurement which describes the proportion of its predicted instances erroneously. The options that generated the greatest number of instances predicted correctly are as follows: a) wordstokeep: Standing with 1000 since it defines the word limit per class to maintain. Where donotoperateonperclassbasis flag: as "False" to base wordstokeep in all classes. b) TFTransform as "True", DFTransform as "True" outputwordcounts as "True" and normalizedoclength: is set to "No normalization". The values are not normalized to the filter papers find more interrelated and count how often a word is in the document and not only consider whether the term is in the document. OutputWordCounts is the flag that describes whether a word exist or not in the document and normalizedoclength couts a word with its actual value from tf-idf result of that word in the document, no matter how small or longer the document is. c) lowercasetokens: as "True" to convert all to lowercase words before being added to the record and analyze the same word in lowercase and uppercase separately.

31 WEKA 31 d) Stemmer: selects the algorithm to elimination the morpheme in a given language in order to reduce the word to its root. Select no stemmer as the classification of texts is multilingual and it will only aply stemming for one lenguage. No stemmer is configured when click on the "Select" button menu is deployed and "NullStemmer" is selected. Weka has a standard algorithm in English from snowball.tartarus.org. Snowball is a string processing language designed for creating stemmer and feature a stemming algorithm in Spanish. To use the algorithm in Spanish will have to download the jar snowball jar from This will be stored in the location where Weka application is. Finally the algorithm will be added when the following command is applied from the command line in Weka. For Windows: java -classpath "weka.jar, snowball jar" weka.gui.guichooser For Linux: java -classpath "weka.jar: snowball jar" weka.gui.guichooser It will be confirmed with the command to verify the parameter java.class.path java weka.core.systeminfo As shown in the following figure:

WEKA 32 Having set the SnowballStemmer, Selecte it by clicking the "Choose" button. This button will display a menu which selecte from weka> core> stemmers and choose SnowballStemmer.

32 WEKA 32 Having set the SnowballStemmer, Selecte it by clicking the "Choose" button. This button will display a menu which selecte from weka> core> stemmers and choose SnowballStemmer. Click on the stemmer name and a window that can delimit the language will apear. For Spanish on the side labeled "stemmer" it will be type "spanish" in place of "porter" and click "OK". e) Stopwords determines whether a sub string in a text is a word that does not provide information about a text. This words come from a predefined Rainbow list, where the default is Weka-3-6. Rainbow is a program that performs the statistical text classification base on Bow library. Rainbow has separate lists in English and Spanish, in order to make both languages is use the "ES-stopwords" file that contains both lists from Rainbow. "ES-stopwords" list can be download from To change the list click on Weka-3-6 which is next to the label stopwords and choose ES-stopwords" previously downloaded. Set the usestoplistse option to

WEKA 33 "True" to ignore the words that are on "ES-stopwords" within the "Stopwords" option list. f) Tokenizer: option to choose unit to separate the attribute "DocumentText".

33 WEKA 33 "True" to ignore the words that are on "ES-stopwords" within the "Stopwords" option list. f) Tokenizer: option to choose unit to separate the attribute "DocumentText". By clicking "Choose" button a menu will be displayed and select "WordTokenizer". Set the "deimiters" in English and Spanish when cloc on the name and following window will appear. Delimiters in Spanish are,;:.,;:'()?!!-[] <> ".. this includes an end character in for exclamation and interrogation..,;:'"()?!!-[] <> As shown in the figure below.: Another option is to choose NGramTokenizer to divide the original text string in a subset of consecutive words that form a pattern with unique meaning. This uses the default "delimiters" is '\ r \ n \ t,;:.'?! "()", This is useful to help uncover patterns of words between them representing a meaningful context. g) mintermfreq: default is 1 for each word must to possess to be considered as an attribute to this the "donotoperateonperclassbasis" flag should be "False". h) periodicpruning be filed in no pruning with -1, it won t remove low-frequency words.

34 WEKA 34 i) attributenameprefix lefts with nothing to not add a prefix to the attributes generated. j) attributeindices: will be saved as first-last to ensure that all attributes are treated as if they were a single chain from first to last. k) invertselection be preserved in "False" to work with the selected attributes. At the end, you can save, cancel and apply. The window must have been as follows:

WEKA 35 To save the algorithm with these options click on Save..." button and the select the location and name. To apply the algorithm with these options in the click "OK" button.

35 WEKA 35 To save the algorithm with these options click on Save..." button and the select the location and name. To apply the algorithm with these options in the click "OK" button. This will return to the "Preprocess" window where "DocumentText" attribute must have been selected from the "Attributes" framework. Click the button "Apply". It is located in the upper right of the module "Filter". Weka image located in the lower right corner will start to dance until the process is complete. Information extraction After the data cleaning on the "Preprocess" tab, it proceeds to the extraction of information. By click on the tab "Classify" on the second panel of Explorer. This stage analyze the attributes vector for the creation of the classification model that will define the structure found in the analyzed information. Weka considered the decision tree model J48 the most popular on text classification. J48 is the Java implementation of the algorithm C4.5. Algorithm that in each node represent one of the possible decisions to be taken and each leave represent the predicted class. First, choose the sorting algorithm from the "Choose" button located in the upper left side of the window.

36 WEKA 36 This button will display a tree where the root is weka and the sub folder is "classifiers". Within the sub folder tree located in weka.classifiers.trees, select the tree model J48, as shown in the following figure: Double-click on the name of the J48 classifier located next to the "Select" button to access to its options.

37 WEKA 37 It can reach 100% in correct classification disabling pruning and setting the minimum number of instances in a leaf as 1. In this case these parameters changed are: a) minnumobj: is set to 1 and leave the other parameters in the default configuration. In the "Test Options" module the training data is set. Select Use training set" to train the method with all available data and apply the results on the same input data collection.

WEKA 38 Additionally you can apply a partitioning percentage to the input data by selecting the "Percentage Split" option and defining the percentage from the total input data to build the classifier

38 WEKA 38 Additionally you can apply a partitioning percentage to the input data by selecting the "Percentage Split" option and defining the percentage from the total input data to build the classifier model, leaving the remaining part to test. Under options "Test Options" is a menu that displays a list with all attributes. In the case select "docclass" because this is the attribute that act as the result for classification in this example. The classification method started by pressing the "Start" button. The weka bird image found in the bottom right, will begin to dance until the end of the sorting process.

39 WEKA 39 WEKA creates a graphical representation of the classification tree J48. This tree can be viewed by right-clicking on the last set of results "Result List" and selecting "Visualize tree" or "tree Display" option.

40 WEKA 40 The window size can be adjusted to make it more explicit by right clicking and selecting "Fit to Screen", as show in the image below. Results Evaluation Weka describes the proportion of instances erroneously predicted with the measure - Fβ score. The value is a percentage consist of precision and Recall. Precision measures the percentage of correct positive predictions that are truly positive Recall is the ability to detect positive cases out of the total of all positive cases.

41 WEKA 41 With these percentages it is expected that the best model is the F-measure value closer to 1. The following table shows some combinations that are significant in the data preprocess for model generation. This comparison table describes its measures of precision and recall as well as its measurement-f. First the best filter options are analyzed with unadjusted values for the J48 classifier. In this the best parameters are selected. After the best settings for J48 classifier algorithm are selected with the best configuration on the StringToWordVector filter. Comparison table: Documents classification models. Features Precision Recall F-Measure Word Tokenizer English Spanish (E&S ) Word Tokenizer E&S + Lower Case Conversion Trigrams E&S + Lower Case Conversion Stemming + Word Tokenizer E&S + Lower Case Conversion Stopwords + Word Tokenizer E&S + Lower Case Conversion Stopwords + Stemming Word Tokenizer E&S + Lower Case Conversion Stopwords + Word Tokenizer E&S + Lower Case Conversion + J48 minnumobj = In conclusion the best model is a combination of the options Word Tokenizer Stopwords + S + E & Lower Case Conversion applied to the filter on the data preprocessing and further adjusting 1 minnumobj on the J48 classifier algorithm.

42 WEKA 42 The next confusion matrix is the result from the combination of Stopwords + Word Tokenizer E&S + Lower Case Conversion adjusting minnumobj to 1 on the J48 algorithm. This generates the following binary values in their confusion matrix. a b c d E f Classified as a = Hemodialysis b = Nutrition c = Cancer d = Obesity e = Diet f = Diabetes This table only shows classes with precision and recall at 100%. Accuracy values are as follows for each class: Class TP Rate FP Rate Precision Recall F-Measure Hemodialysis Nutrition Cancer Obesity Diet Diabetes Weighted Avg

43 WEKA 43 Conclusion Document classification in Spanish is analyzed using text mining through Weka an open source software. This software analyzes large amounts of data and decide which is the most important. It aims to make automatic predictions that help decision making. When comparing WEKA with other data mining tools as RapidMiner, IBM Cognos Business Intelligence, Microsoft SharePoint and Pentaho, weka provides a friendly interface easy to understand, load data efficiently and consider data mining as main objective. Text mining seeks patterns extraction from the analysis of large collections of documents in order to gain new knowledge. Its purpose is the discovery of interesting groups, trends, associations and the visualization of new findings. Text mining is considering as a subset of data mining. For this reason, adopts text mining adopts the data mining techniques which uses machine learning algorithms. Computational linguistics techniques also provides techniques to text mining. This science studies natural language with computational methods to make them understandable by the operating system. Automatic categorization determines the subject matter from a document collection. This unlike clustering, choose the class to which a document belongs in a list of predefined classes. Each category is trained through a previous manual process of categorization. The classification starts with a set of training texts previously categorized then generate a classification model based on the set of examples. This is be able to allocate the correct clas from a new text. Decision tree is a classification technique that represent the knowledge through ifelse statements structure represented in the branches of a tree.

SUPERVISED ATTRIBUTE. NAME - weka.filters.supervised.attribute.attributeselection. NAME - weka.filters.supervised.attribute.

SUPERVISED ATTRIBUTE. NAME - weka.filters.supervised.attribute.attributeselection. NAME - weka.filters.supervised.attribute. Contents SUPERVISED ATTRIBUTE... 2 NAME - weka.filters.supervised.attribute.attributeselection... 2 NAME - weka.filters.supervised.attribute.classorder... 2 NAME - weka.filters.supervised.attribute.discretize...