WEKA 1. Weka. Valeria Guevara. Thompson Rivers University

Size: px
Start display at page:

Download "WEKA 1. Weka. Valeria Guevara. Thompson Rivers University"

Transcription

1 WEKA 1 Weka Valeria Guevara Thompson Rivers University Author Note This is a final project COMP 4910 for the bachelors of computing science from the Thompson Rivers University supervised by Mila Kwiatkowska.

2 WEKA 2 Abstract This project focuses on documents classification using text mining through a classification model generated by the open source software WEKA. This software is a repository of machine learning algorithms to discover knowledge. Weka easily preprocesses the training documents to compare different algorithms configurations. The exactitude in the generated predictive model will be measured based on a confusion matrix. This project will help to illustrate text mining preprocessing and classification using WEKA. The result will be the development of a tool to generate the input data files arff and of a video tutorial on documents classification in Weka in English and Spanish. Keywords: Weka, documents classification, arff, stopwords, toquenizer, pruning, decision tree C4.5, words vector, text mining, F-measurement, machine learning, text classification, stemming, knowledge society.

3 WEKA 3 Weka Weka is a native New Zealand bird that does not fly but has a penchant for shiny objects. [30] Newzealand.com. (2015). Old legends from New Zealand narrate that these birds steal shiny items. The University of Waikato in New Zealand started the development of a tool with that name because this would contain algorithms for data analysis. Currently WEKA package is a collection of algorithms for machine learning tasks of data mining. The package of Waikato Environment for Knowledge Analysis contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. [31] Hall, M., Frank, E., Geoffrey H., Pfahringer, B., Reutemann, P., & Witten, IH (2009). This software analyzes large amounts of data and decide which is the most important. It aims to make automatic predictions that help decision making. Weka VS Other Machine Learning Tools There are other tools for data mining as RapidMiner, IBM Cognos Business Intelligence, Microsoft SharePoint and Pentaho. IBM Cognos Business Intelligence provides a not very userfriendly display. Microsoft SharePoint creates predictive models of mining business but their information is not their main objective. Where RapidMiner offers a great display of results, but the datasets are loaded slower than in Weka. Pentaho its graphical interface is not difficult to understand to describe your options as Weka does. The tool implements Weka machine learning techniques implemented in easy to learn java under a GNU General Public License. WEKA provides three ways to be used, through its graphical interface, command line interfaces and application code in Java API interface

4 WEKA 4 language. Although WEKA has not been used primarily for troubleshooting predictions in business, this helps the construction of new algorithms. Therefore it turns out to be the most optimal software for initial data analysis, classification, clustering algorithms, research. In this project the Weka tool is used to create a predictive model using text classification algorithms of machine learning algorithms. Installation Weka can be downloaded at: In this case we speak of the latest version Weka. In the same URL you can find instructions for installation on different platforms. In Windows Weka must be located in launcher program in a folder version of Weka downloaded, in this case the latest version is weka-3-6. Weka default directory is same directory where the file is loaded. Linux will have to open a terminal and type: java -jar /installation/directory/weka.jar. It is common to find an error of insufficient memory, which in turn is achieved by specifying for example GB 2GB will "-Xmx2048m" in the setup files. Further information weka.wikispaces.com/outofmemoryexception be found. You can be ordered with the -Xms and -Xmx parameter indicating the minimum and maximum RAM respectively. In windows you can edit the file RunWeka.bat RunWeka.ini or the installation directory should be changed Weka maxheap = 128m = 1024m maxheap line. You can not assign more than 1.4G to JVM. You can also assign to the virtual machine with the command: java -Xms <minimum-memory-mapped> M

5 WEKA 5 -Xmx <Maximum-memory-mapped> M -jar weka.jar [32] Garcia, D., (2006). In linux the -XmMemorySizem option is used, replacing MemorySize the required size in megabytes. for instance: java -jar -Xm512m /instalación/directorio/weka.jar. Execution Weka The first screen will show a coach you are interfaces called "Applications" where in this version of Explorer, Experimenter, KnowledgeFlow sub-cli and Simple tools are deployed. Explorer is responsible for conducting exploration operations on a data set. Experimenter experiments performed statistical tests to create an automated manner different algorithms different data. KnowledgeFlow shows graphically the operation panel work Weka. Simple CLI or single client that provides the command line interface to enter commands. The main user interface is "Explorer" consists of six panels. Preprocess is the first window to open this interface. In this window, the data are loaded. Weka accepts load the data set from a URL, database, CSV or ARFF files. The ARFF file is the primary format to use any classification task in WEKA. Input data. As previously it was described, three data inputs are considered in data mining. These are the concepts, instances and attributes. An Attribute-Relation File Format file is a file that describes a concept list of instances with their respective attributes. These files are used by Weka for text classification and clustering applications.

6 WEKA 6 ARFF files. These files have two parts, the header information and data information. The first section contains the name of the relationship with the attributes (name and type). The relationship name is defined in the first line of arff where name-relation is a string with the following <relation-name> The next section is the attribute declarations. This is an ordered sequence of statements of each attribute instances. These statements uniquely define an attribute name and data type. The order in which the attributes are declared indicates the position where you are in the instances. For example, the attribute that is declared at the first position is expected in all instances at the first position state the value of this attribute. The format for its declaration <attribute-name> <data type> Weka has several data-type supported: i) NUMERIC: are all real numbers where the separation between the decimal and integer part is represented by a point and not a comma. ii) INTEGER: treated as numeric. iii) NOMINAL provide a list of possible values for example {good, bad}. These express the possible values that the attribute can take the following attribute _name {<nominal1>, <nominal2>, <nominal3>,...} iv) STRING: is a sequence of text values. These attributes are declared, as attribute _name string. v) DATE: dates and times are declared <name> Date [<date format>].

7 WEKA 7 Where <name> is the name of the attribute and <date format> is an optional string consists of characters hyphens spaces and time units. The date format Specify the values to date should be analyzed. The format set accepts the combination of format ISO-8601: aaaa-mm-dd't'hh: mm: ss. timestamp DATE "yyyy-mm-dd HH: mm: ss" vi) Relational attributes are data attributes for multiple instances in the following <name> relational <Attribute definitions <name> There exist rules on the attribute statements: a) The names of relations as string or string must be enclosed in double quotes "if it includes spaces. b) Both the attributes and relationships names cannot start with a character before the \ u0021 ASCII '{', '}', ',', or '%'. c) Values that contain spaces must be quoted. d) Keywords numeric, real, integer, string and date can be case insensitive. e) Relational data must be enclosed in double quotes ". The second section is the statement of information. It is declared on one line. Each line below represents an instance defining attributes with commas. The attribute value must be in the same order in which they were found in one section attribute. Missing values are represented with a trailing question mark "?". The string values and nominal attributes are

8 WEKA 8 different between upper and lower case. It should be cited any value that contains a space. Comments are cited delimiter character "%" to the end of the line. In text classification, arff files represent the entire document as a single text attribute that is of type string. The second attribute to consider is the class attribute. This will define the class instance belongs. This type of attribute can be of type string or nominal. An example of the resulting text file is the document type and the type string nominal class of two DocumentText class {English, 'texto a clasificar aquí... ', español 'Classify text here...', English Data preprocessing. In this window, data are loaded and may be edited. Data can be manually modified with edition or filtering. Filters are learning techniques methods that modify the data set. Weka has a variety of filters structured hierarchically in supervised and non-supervised where the root is weka. These filters are divided into two categories as a result of the way they operate with data attribute and instance. As point out earlier, these techniques are classified in a way that depends on the input data relationships. Unsupervised learning techniques as descriptive inductive models do not know their correct classification. This means that the instances do not require an attribute that declares the class. Inductive techniques of predictive supervised learning depend on the class values to which it corresponds. This refers to instances will contain a class attribute that corresponds which they belong.

9 WEKA 9 In Current relation module the dataset that has been loaded is described as the name, and instances number. Attributes allows to select attributes using options from All, None, Invert and it further provides the option to enter a regular expression. In the Selected attribute part display information about the selected attribute. At the bottom is illustrated a histogram of the attributes selected in Attributes. Preprocessing for classifying documents In Weka is possible to create documents classification models into categories previously analyzed. The documents in Weka usually need to be converted into "vectors text" before applying machine learning techniques. For this the easiest way to render text is as bag of words or word vector. [34] Namee, B. (2012). StringToWordVector filter performs the process of converting the string attribute to a set of attributes that represent the occurrence of words of the full text. The document is represented as a text string in a single attribute type string. StringToWordVector Filter This is the fundamental text analysis WEKA filter. This class offers abundant choices of natural language processing, including the use of lexematización for convenient corpus, custom tokens and using various lists of empty words. At the same time, it calculates weights Frequency and Duration TF.IDF etc. StringToWordVector places the class attribute to the top of the list of attributes. To change the order it can use the filter Reorder to reorder. This filter can be configured all the techniques of linguistic natural language processing to attributes. To apply the filter StringtoWordVector in batch mode from the command line can be done as follows:

10 WEKA 10 Java -cp/aplicaciones/weka-3-6-2/weka.jar weka.filters.unsupervised.attribute.stringtowordvector -b -i datos_entrenamiento.arff -o vector_ datos_entrenamiento.arff -r datos_prueva.arff vector_ data_ prueva.arff The sets datos_entrenamiento are the training set, vector_ datos_entrenamiento are the training set vector, datos_prueva are the test set and vector_ data_ prueva are the test set vector. The -cp option puts Weka jar in the class path, use -b indicates the batch mode, -i file specifies the training data, -o output file after processing the first file, -r is my file Test and -S is the output file of the previous test file. Options can be modified in the user interface, when you click on the filter name beside the choose button. Having previously selected the filter from Booton choose. Having the window open weka.filters.unsupervised.attribute.stringtowordvector show the following to be modified according to the needs of the documents to be classified. The options are: IDFTransform TFTransform attributeindices attributenameprefix donotoperateonperclassbasis invertselection lowercasetokens mintermfreq normalizedoclength outputwordcounts periodicpruning stemmer stopwor tokenizer usestoplist wordstokeep

11 WEKA 11 In Weka.sourcearchive.com [39] refers to a mental map of Weka options which are as follows shown in the following illustration is:

12 WEKA 12 wordstokeep Defines the number N of words per class limit, if there is a class attribute which is trying to maintain. In this case only the more common N terms among all attribute values in the chain will remain. Higher values represent lower efficiency because it will take more time learning model. donotoperateonperclassbasis Flag that set to keep all relevant words for all classes. It is set to true when the maximum number of words and the minimum term often does not apply to an attribute of a class, instead it is based on all classes. TFTransform Term frequency score (TF) Transformation: when position the flag as true, this filter executes the transformation term-frequency score representing textual data in a vector space the term-frequency (TF) is used. The TF represents numerical measure the words of the text relevance among the entire collection. This not only considers the relevance of a single term itself, it also contemplates the relevance in the entire collection of documents. Mathematically its represented as the function TF (t, d) which expresses the term t in the document d is as: log (1 + t word frequency on the instance or document d). The inverse document frequency IDF is the number of documents containing the term t appear where t is defined in the TF. It find words often related in terms of log (1 + IJF) where fij is the frequency of the word t in the document (example) j. DFTransform Inverse Document Frequency (IDF) Transformation: positioning the flag with "true" will define the use of the following equation:

13 WEKA 13 t word frequency in instance d as ftd and as a result: F td * log (nº documents and instances d / nº of documents with word t) This is explained taking into account set D which includes all documents in the collection represented as D = {d1, d2,..., dn}. It finds out most relevant documents to the other fij * log (nº Docs / nº nº of Documents with the i word) where fij is the frequency of word i in document j. By multiplying IDF by the TF the result assign more weigh to the terms with greater frequency in the documents but at the same time relatively rare in the collection of documents. Weight [33] Salton, G., Wong, A., & Yang, C. (1975). outputwordcounts Counts the words occurrences in the string, the default settings only reports the presence or absence as 0/1. The result is a vector where each dimension is a different word. The value in this dimension is a binary 0 or 1 is say yes or no is the word in that document. The frequency of the word in that document is represented as an integer number with: IDFTransform and TFTransform as "False" and outputwordcounts to "True" opccions. This is enable to do an explicit words account. It is established as "false" when only cares about the presence of a term, not its frequency. To calculate tf * (IDF) must be set IDFTransform as True, TFTransform as false and outputwordcounts set as True. To achieve log (1 + tf) * log (IDF) TFTransform must be set to True. normalizedoclength It is set true to determine whether the words frequency in an instance must be normalized. Normalization is calculated as Actual Value * Average Document Length / Document Length.

14 WEKA 14 This option is set with three sub-options, the first option "No normalization". The second is "Normalize all data" that takes a measure as a common scale of all measures taken in the various documents. The third option is "Normalize test data only." It has a word with a real value of the tf-idf result of the word in that document with the settings as follows IDFTransform and "TFTransform" to "True" and "normalizedoclength" to "Normalize all data." Stemmer Selects the stemming algorithm to use in the words. Weka by default supports four default stemmer algorithms. Lovin Stemmer algorithm is its iterated version and supports Snowball stemmers. IteratedLovinsStemmer algorithm is a version of the algorithm LovinsStemmer which is a set of transformation rules for changing word endings as well as words present participle, irregular plurals, and morphological English. NullStemmer algorithm performs any derivative at all. The algorithm SnowballStemmer came standard vocabularies of words and their equivalents roots. Weka can easily add new algorithms stemmer because it contains a wrapper class for as snowball stemmers in Spanish. Weka contains all algorithms snowball but can be easily included in the location of the class weka.core.stemmers.snowballstemmer Weka. Snowball is a string processing language designed for stemming creation. There are three ways to get these algorithms, the first is to install the unofficial package. The second is to add snowball jar pre-compiled class location. The third is to compile the latest stemmer by itself from snowball zip. The algorithms are in snowball.tartarus.org that have a stemmer in Spanish. In the following link you can see examples and download this stemmer:

15 WEKA 15 Snowball Spanish Stemming Algorithm comes from Snowball.tartarus.org. It defines an usual R1 and R2 regions. Furthermore RV is defined as the following vowel after the region if the second letter is a consonant, or RV and after the following consonant the region, if the first two letters are vowels, or RV as the region also after the third letter if these options do not exist RV is the end of the word. Step 0: Search the longest pronoun between the following suffixes: "I selo selos selas is SELA's you what the will of us" and remove it, if it comes after one of iendo ar Ando ír ER'm iendo ar er get going. Step 1: Look in the longest common suffix and deletes it. Step 2: If no suffix is not removed in step 1 seeks to eliminate other suffixes. Step 3: Find the longest among the residual suffixes os a o á í ó e é in RV and eliminates them. Step 4: remove sharp accents. [36]. For more information about suffixes in step 1 and 2 go to snowball page. The previous algorithm will be added into weka when the following command for Windows is applied: java -classpath "weka.jar, snowball jar" weka.gui.guichooser For Linux: java -classpath "weka.jar: snowball jar" weka.gui.guichooser [37] Weka.wikispaces.com,. (2015).

16 WEKA 16 The jar snowball jar previously compiled and stored in the location where the application of Weka on the computer. It may confirm with the command: java weka.core.systeminfo As shown in the figure below. Stopwords This are terms that are widespread and appears more frequently and do not provide information about a text. This option determines whether a sub string in the text is an empty word. Stopwords terms come from predefined list. This option converts all words to lowercase before term removal. Stopwords it is pertinent to eliminate meaningless words within the text and eliminate frequent and useful words of decision trees. Weca s stopwords by default are based on the Rainbow lists that are found in the next link:

17 WEKA 17 Rainbow is a program that performs statistical text classification. It is based on the Bow library. [38] Cs.cmu.edu, (2015). The format of these lists is one word per line, where each comments must start with '#' to be omitted. WEKA is configured with a list of empty words English but you can set different lists of stopwords. You can change this list from the user interface by clicking on the option you have Weka by default uses Weka-3-6 list but it can choose any location that points to a desired list. Rainbow has separate lists for English and Spanish, in order to make both languages the "ES-stopwords" add both lists from Rainbow. usestoplist: Flag to use empty words. If is set to "True" ignores the words that are in the predefined stopwords list from the previous option. Tokenizer: Choose measurement unit to separate each text attribute from the arff. This has three sub options. The first is AlphabeticTokenizer where only alphabetical symbols are continuous sequences that cannot be edited. When tokenize only considers the alphabet in English. At the same time there is WordTokenizer option that establishing a list of delimiters. As was referenced previously, punctuation in Spanish is ";:.?!?! - - () [] '" << >> ". In Spanish, unlike English contemplates a sign of the beginning and another end in an exclamation. The second is NGramTokenizer that divides the original text string in a subset of consecutive words that form a pattern with unique meaning. Its parameters are derived "delimiters" to use that default is '\ r \ n \ t,;:.' "()! 'GramMaxSize which is the maximum size of the Ngram with a default value of 3 and GramMinSize be the minimum size of the Ngram with a

18 WEKA 18 default value of 1. N-grams can help uncover patterns of words between them which represent a meaningful context. mintermfreq: Sets the minimum frequency that each word or term must possess to be considered as an attribute, the default is 1. It is often applied when class has an attribute that has not been set to true flag "donotoperateonperclassbasis" the text of the entire chain for a particular class that is in that same attribute is selected tokenisa. The frequency of each token is calculated based on its frequency in the class. In contrast, if there is no class, the filter will calculate a unique dictionary and the frequency is calculated based on the entire attribute value chain of the chosen attribute, not only those related to a particular class value. periodicpruning Eliminates low-frequency words. It uses a numerical value as a percentage of the size of the document that sets the frequency to prune the dictionary. The default value is -1, meaning no periodic pruning. Periodic pruning rate is specified as a percentage of the data set. For example, this specified that 15% of each set of input data, regularly pruned in the dictionary, after creating a comprehensive dictionary. May not have enough memory for this approach. attributenameprefix Sets the prefix for the names of attributes created, by default is "". This only provides a prefix to be added to the names of the attributes that the filter StringToWordVector created when the document is fragmented.

19 WEKA 19 lowercasetokens Flag when its set to "True", converts all words in the document into lowercase before being added to the record. Flag true eliminate the option to distinguish themselves by eliminating the rule names that begin with uppercase names. Acronyms may be considered when this option to is set to "False". attributeindices Sets the range of attributes to act on the sets of attributes. The default is first-last which ensures that all attributes san treated as if they were a single chain from first to last. This range will create a chain of ranges containing a comma-separated list. invertselection Flag to work with the attributes selected in the range. It stands as true to work with the unique attributes unselected "true" or. The default value is "False" is work with the selected attributes. After cleaning the data on the tab "Preprocess" vector attributes are analyzed to obtain the desirable knowledge in the "Classify" tab. Classification The second panel of Explorer is "Classify" or classification generated by machine learning model from the training data. These models serve as a clear explanation of the structure found in the information analyzed. Weka especially considering the model J48 decision tree for the most popular text classification. J48 is the Java implementation of the algorithm C4.5. Previously described as the algorithm that each branch represents one of the possible choices in the if-then format that the tree offers to represent the results in each leaf. It can summarized the

20 WEKA 20 C4.5 algorithm as the amount of measurement of the information contained in a data set and grouped by importance. The idea of the importance of a given attribute in a dataset. J48 Print recursively the tree structure variable of type string by accessing information stored in each attribute nodes. To create a classification, you must first choose the algorithm classifier in the Choose button located in the upper left side of the window. This button will display a tree where the root is Weka and sub folder is "classifiers". Within the sub folder tree located in weka.classifiers.trees tree models such as J48 and RepTree are found. RepTree combines the standard decision tree with random forest algorithm. To access the classifier's options are given double-click the name of the selected classifier. "Test Options". The classification has four main modes and others to manage the training data. These are found in the section "Test Options with the following options a) Use training set: training method with all available data and apply the results on the same dataset collection. b) Supplied test set: select training data set froma file or URL. This set must be compatible with the initial data and is selected by pressing "Set" button. c) Cross-validation: performs a cross-validation depending on the number of "Folds" selected. Cross-validation specify a number of partitions to determine how many temporary models will be created (Folds). First a part is selected, then a classifier is built from all parts are except the selected one that remains for testing. [32] Garcia, D., (2006). d) Percentage Split: define the percentage of the total input from the classifier model was built and the remaining part will be tested.

21 WEKA 21 Weka allows us to select more than a few options for defining the test method with the "More Options" button, these are: Output Model: open in the output window pattern classifier. Output per-class stats: display statistics for each class. Output entropy evaluation measures: displays measurement information entropy in the standings. Output confusion matrix: displays the resulting confusion matrix classifier. Store predictions for visualization: Weka will keep classifier model predictions as in the test data. In the case of using this option classifier J48 will show the tree errors. Output predictions: show a table of the real and predicted values for each instance from test data. It states the relation between the classifier and each instance in the test data. Output additional attributes: is set to display the values of attributes, not those of the class. A range will be specified to be included along the actual and predicted values of the class. Cost-sensitive evaluation: produce additional information on the output of the assessment, the total cost and average cost of misclassification. Random seed for xcal /% Split: specifies the random seed used when before data have been divide for evaluation purposes. Preserve order for% Split: Retains the order in the percentage of data instead of creating a random for the first time with the value of the default seeds is 1. Output source code: generate the Java code model produced by the classifier.

22 WEKA 22 In the event that does not have a set of data independent evaluation it is necessary to obtain a reasonably accurate idea of the generated model and select the correct option. In the case of classifying documents is recommended select at least 10 "Folds" for cross-validation and assessment approach. It also recommends allocating a large percentage of "Percentage Split". Below these options "Test Options", it is a menu where a list with all attributes will be find. This allows you to select the attribute that act as the result for classification. In the case of the classification of documents will be the class to which the instance belongs. The classification method start by pressing the "Start" button. The image of the weka bird found in the bottom right will start to dance till the classifier achieves complete. WEKA creates a graphical representation of the classification tree J48. This tree can be viewed by right-clicking on the last set of results "Result List" and selecting "Visualize tree" option. The window size can be adjusted by right-clicking and select Fit to Screen. Classifier for classifying documents J48 The model J48 uses the decision tree algorithm C4.5 to build a model from selected training data. This algorithm is found in weka.classifiers.trees. J48 classifier has different parameters that can be edited by double clicking on the name of the selected classifier. J48 employs two pruning methods, but this does not make the pruning of error. The main objectives of pruning are to make the tree easier to understand and reduce the risk of overuse of the training data in the direction of be able to classify just about perfectly. The tree learn the specific properties of the training data and not the lower concept.

23 WEKA 23 The first J48 pruning method is known as replacement subtree. The nodes in a decision tree can be replaced with a leaf by reducing the number of nodes in a branch. This process starts from the fully formed leaves and work up towards the root. The second is to raise the hive. A node is move to the tree root and replaces other nodes in the branch. Normally, this process is not negligible and is wise turn it off when the induction process takes time. By clicking on the name of the J48 classifier which is located right next to the Choose" will display a window with the following editable options: confidencefactor sets the number of pruning. Lower values experience more pruning. Reducing this value may reduce the size of the trees and also helps in removing irrelevant nodes that generate misclassification. [40] Drazin, S., & Montag, M. (2015). minnumobj: Sets the minimum number of instances separation per leaf in the case of trees with many branches. unpruned: flag to preform pruning. In true the tree is pruned. Default is "False" which means that pruning is not carried out. reducederrorpruning: flag to use pruning error reduction in C.4.5 tree. Method after pruning using a resistance to the errors estimations. Similarly, it is for breeding hives and throw an exception not the confidence level used for pruning. Seed: Seed number shuffle data randomly and reduce error pruning. This is considered when reducederrorpruning flag is set to "True". The default seed is 1. numfolds: number of pruning to reduce error. Sets the number of folds that are retained for pruning, with a set used for pruning and the rest for training. To use these Folds reducederrorpruning flag must be set to "True".

24 WEKA 24 binarysplits: when this flag is set "True", it creates only two branches for nominal attributes with multiple values instead of a branch for each value. When the nominal attribute is binary there is no difference, except in how this attribute is shown in the output result. The default is "False". saveinstancedata: flag set to "True" to store training data for its visualization. The default is "False". subtreeraising: flag to preform pruning with the subtree raising method. This moves a node to the tree root replacing other nodes. In "True" weka considered subtreeraising in the process of pruning. uselaplace: flag that preform a leaves count in Laplace. Set to "True", weka will count the leaves that become smaller based on a popular complement to estimates probability called Laplace. debug: banner to add information to the console. In "True", it adds additional information to the console of the classifier. It can reach 100% correct in the training data clearing pruning and establish the minimum number of instances on a sheet 1.

25 WEKA 25 Weka document classification Weka tool was selected in order to generate a model that classifies specialized documents from two different courpus (English and Spanish). WEKA package is a collection of machine learning algorithms for data mining tasks. Text mining uses these algorithms to learn from examples or "training set", new texts are classified into categories analyzed. It is defined as Waikato Environment for Knowledge Analysis. For more information contact Installing WEKA Weka can be downloaded from: In this tutorial version is Weka For Windows WEKA must be situated in the program launcher located in a weka folder. The Weka default directory is the same directory where the file is loaded. For Linux: WEKA will have to open a terminal and type: java -jar /installation/directory/weka.jar.

26 WEKA 26 Based on the text mining methodology Weka is represented in a framework with four stages, data acquisition, document preprocessing, information extraction and evaluation. Data Acquisition ARFF files are the primary format to use any classification task in WEKA. These files considered basic input data (concepts, instances and attributes) for data mining. An Attribute- Relation File Format file describes a list of instances of a concept with their respective attributes. The documents selected for the training data set has been found on the Thompson Rivers University library that has the following link: It was randomly selected 71 medical academic articles in English and Spanish. These documents are stored in Portable Document Format (PDF). Based on the TRU library was detected the classification of this documents into six categories Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes recognized. These documents are stored in directories named by its categories within the main folder called Medicine. As shown in the figure below. In order to form an arff file it was created in Microsoft Visual Studio Professional C # 2012 an application that generated the arff from a directory that contains a collection of

27 WEKA 27 documents in a based on their category name. This application could be carried out with the collaboration of a library called itextsharp PDF for a portable document format text extraction. Documents Directory to ARFF can specify the name of the relationship to define, the location of the home directory that contains all documents subdivided into categorical directories and comments required. Also, it specify the file name generated with arff extension and its location. At the end of the application are two buttons, one for exit and another to generate the arff file with the information described. This can be download under current projects for Text Mining. The resulting arff generate a string type attribute called " textodocumento" that describe all text found in the document and the nominal attribute "docclass" that define the class to which it belongs. As a note, recent versions of Weka Weka as in this case the class attribute can never be named "class".

28 WEKA 28 The file will be generated as follows: % tutorial de Weka para la Clasificación de textodocumento docclass {Hemodialysis, Nutrition, Cancer, Obesity, Diet, "texto ", Hemodialysis texto, Nutrition "texto.", Cancer "texto ", Obesity "texto ", Diet "texto ", Diabetes Document Preprocess Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. "Applications" is the first screen on Weka to select the desired sub-tool. In this "Explorer" is selected. It consists of six panels: Preprocess, Classify, cluster, Associate, Select attributes and Visualize. Preprocess Preprocessing for the classification of documents. To load the generated arff, click on the button "Open file..." at the top right. Select the created file "medicinaweka.arff". On "Current Relation" the dataset that has been loaded is described. It describes the relationship with the medicina name, the number of instances as 71 and a total of attributes as 2. At the bottom of the under "Attributes" section, attributes are described. This framework allows to select the attributes, in this case are show " textodocumento " and "docclass". When selecting "docclass" the "Selected attribute" part describes the nominal attribute with 6 labels and the total of its instances. These "labels" are 11 levels from Hemodialysis and 12

29 WEKA 29 instances from the others: Nutrition, Cancer, Obesity, Diabetes Diet. At the bottom of this section is ilustrated a histogram of the attribute "docclass" labels that by hovering the graph it will describe the attribute name as shown in the following figure illustrates. Weka uses StringToWordVector filter to convert the "textodocumento" and "docclass"." attribute into a set of attributes that represent the occurrence of words of the full text,. This filter is a technique of unsupervised learning. These inductive technique is designed to detect clusters and label entries from a set of observations without knowing the correct classification. The filters are found when click the Choose " button under "Filter" section. This button opens a window with root weka. From there selecte filters and the unsupervised folder to after select attribute and finally select StringToWordVector.

30 WEKA 30 StringToWordVector filter can configured its attributes with language processing techniques. To edit this filter is only necessary to click on the filter name. it will open a that show the following options. They were generated a set of optimal options from different combinations of options applied to the same training data. Each resulting model was calculated its F-measurement which describes the proportion of its predicted instances erroneously. The options that generated the greatest number of instances predicted correctly are as follows: a) wordstokeep: Standing with 1000 since it defines the word limit per class to maintain. Where donotoperateonperclassbasis flag: as "False" to base wordstokeep in all classes. b) TFTransform as "True", DFTransform as "True" outputwordcounts as "True" and normalizedoclength: is set to "No normalization". The values are not normalized to the filter papers find more interrelated and count how often a word is in the document and not only consider whether the term is in the document. OutputWordCounts is the flag that describes whether a word exist or not in the document and normalizedoclength couts a word with its actual value from tf-idf result of that word in the document, no matter how small or longer the document is. c) lowercasetokens: as "True" to convert all to lowercase words before being added to the record and analyze the same word in lowercase and uppercase separately.

31 WEKA 31 d) Stemmer: selects the algorithm to elimination the morpheme in a given language in order to reduce the word to its root. Select no stemmer as the classification of texts is multilingual and it will only aply stemming for one lenguage. No stemmer is configured when click on the "Select" button menu is deployed and "NullStemmer" is selected. Weka has a standard algorithm in English from snowball.tartarus.org. Snowball is a string processing language designed for creating stemmer and feature a stemming algorithm in Spanish. To use the algorithm in Spanish will have to download the jar snowball jar from This will be stored in the location where Weka application is. Finally the algorithm will be added when the following command is applied from the command line in Weka. For Windows: java -classpath "weka.jar, snowball jar" weka.gui.guichooser For Linux: java -classpath "weka.jar: snowball jar" weka.gui.guichooser It will be confirmed with the command to verify the parameter java.class.path java weka.core.systeminfo As shown in the following figure:

32 WEKA 32 Having set the SnowballStemmer, Selecte it by clicking the "Choose" button. This button will display a menu which selecte from weka> core> stemmers and choose SnowballStemmer. Click on the stemmer name and a window that can delimit the language will apear. For Spanish on the side labeled "stemmer" it will be type "spanish" in place of "porter" and click "OK". e) Stopwords determines whether a sub string in a text is a word that does not provide information about a text. This words come from a predefined Rainbow list, where the default is Weka-3-6. Rainbow is a program that performs the statistical text classification base on Bow library. Rainbow has separate lists in English and Spanish, in order to make both languages is use the "ES-stopwords" file that contains both lists from Rainbow. "ES-stopwords" list can be download from To change the list click on Weka-3-6 which is next to the label stopwords and choose ES-stopwords" previously downloaded. Set the usestoplistse option to

33 WEKA 33 "True" to ignore the words that are on "ES-stopwords" within the "Stopwords" option list. f) Tokenizer: option to choose unit to separate the attribute "DocumentText". By clicking "Choose" button a menu will be displayed and select "WordTokenizer". Set the "deimiters" in English and Spanish when cloc on the name and following window will appear. Delimiters in Spanish are,;:.,;:'()?!!-[] <> ".. this includes an end character in for exclamation and interrogation..,;:'"()?!!-[] <> As shown in the figure below.: Another option is to choose NGramTokenizer to divide the original text string in a subset of consecutive words that form a pattern with unique meaning. This uses the default "delimiters" is '\ r \ n \ t,;:.'?! "()", This is useful to help uncover patterns of words between them representing a meaningful context. g) mintermfreq: default is 1 for each word must to possess to be considered as an attribute to this the "donotoperateonperclassbasis" flag should be "False". h) periodicpruning be filed in no pruning with -1, it won t remove low-frequency words.

34 WEKA 34 i) attributenameprefix lefts with nothing to not add a prefix to the attributes generated. j) attributeindices: will be saved as first-last to ensure that all attributes are treated as if they were a single chain from first to last. k) invertselection be preserved in "False" to work with the selected attributes. At the end, you can save, cancel and apply. The window must have been as follows:

35 WEKA 35 To save the algorithm with these options click on Save..." button and the select the location and name. To apply the algorithm with these options in the click "OK" button. This will return to the "Preprocess" window where "DocumentText" attribute must have been selected from the "Attributes" framework. Click the button "Apply". It is located in the upper right of the module "Filter". Weka image located in the lower right corner will start to dance until the process is complete. Information extraction After the data cleaning on the "Preprocess" tab, it proceeds to the extraction of information. By click on the tab "Classify" on the second panel of Explorer. This stage analyze the attributes vector for the creation of the classification model that will define the structure found in the analyzed information. Weka considered the decision tree model J48 the most popular on text classification. J48 is the Java implementation of the algorithm C4.5. Algorithm that in each node represent one of the possible decisions to be taken and each leave represent the predicted class. First, choose the sorting algorithm from the "Choose" button located in the upper left side of the window.

36 WEKA 36 This button will display a tree where the root is weka and the sub folder is "classifiers". Within the sub folder tree located in weka.classifiers.trees, select the tree model J48, as shown in the following figure: Double-click on the name of the J48 classifier located next to the "Select" button to access to its options.

37 WEKA 37 It can reach 100% in correct classification disabling pruning and setting the minimum number of instances in a leaf as 1. In this case these parameters changed are: a) minnumobj: is set to 1 and leave the other parameters in the default configuration. In the "Test Options" module the training data is set. Select Use training set" to train the method with all available data and apply the results on the same input data collection.

38 WEKA 38 Additionally you can apply a partitioning percentage to the input data by selecting the "Percentage Split" option and defining the percentage from the total input data to build the classifier model, leaving the remaining part to test. Under options "Test Options" is a menu that displays a list with all attributes. In the case select "docclass" because this is the attribute that act as the result for classification in this example. The classification method started by pressing the "Start" button. The weka bird image found in the bottom right, will begin to dance until the end of the sorting process.

39 WEKA 39 WEKA creates a graphical representation of the classification tree J48. This tree can be viewed by right-clicking on the last set of results "Result List" and selecting "Visualize tree" or "tree Display" option.

40 WEKA 40 The window size can be adjusted to make it more explicit by right clicking and selecting "Fit to Screen", as show in the image below. Results Evaluation Weka describes the proportion of instances erroneously predicted with the measure - Fβ score. The value is a percentage consist of precision and Recall. Precision measures the percentage of correct positive predictions that are truly positive Recall is the ability to detect positive cases out of the total of all positive cases.

41 WEKA 41 With these percentages it is expected that the best model is the F-measure value closer to 1. The following table shows some combinations that are significant in the data preprocess for model generation. This comparison table describes its measures of precision and recall as well as its measurement-f. First the best filter options are analyzed with unadjusted values for the J48 classifier. In this the best parameters are selected. After the best settings for J48 classifier algorithm are selected with the best configuration on the StringToWordVector filter. Comparison table: Documents classification models. Features Precision Recall F-Measure Word Tokenizer English Spanish (E&S ) Word Tokenizer E&S + Lower Case Conversion Trigrams E&S + Lower Case Conversion Stemming + Word Tokenizer E&S + Lower Case Conversion Stopwords + Word Tokenizer E&S + Lower Case Conversion Stopwords + Stemming Word Tokenizer E&S + Lower Case Conversion Stopwords + Word Tokenizer E&S + Lower Case Conversion + J48 minnumobj = In conclusion the best model is a combination of the options Word Tokenizer Stopwords + S + E & Lower Case Conversion applied to the filter on the data preprocessing and further adjusting 1 minnumobj on the J48 classifier algorithm.

42 WEKA 42 The next confusion matrix is the result from the combination of Stopwords + Word Tokenizer E&S + Lower Case Conversion adjusting minnumobj to 1 on the J48 algorithm. This generates the following binary values in their confusion matrix. a b c d E f Classified as a = Hemodialysis b = Nutrition c = Cancer d = Obesity e = Diet f = Diabetes This table only shows classes with precision and recall at 100%. Accuracy values are as follows for each class: Class TP Rate FP Rate Precision Recall F-Measure Hemodialysis Nutrition Cancer Obesity Diet Diabetes Weighted Avg

43 WEKA 43 Conclusion Document classification in Spanish is analyzed using text mining through Weka an open source software. This software analyzes large amounts of data and decide which is the most important. It aims to make automatic predictions that help decision making. When comparing WEKA with other data mining tools as RapidMiner, IBM Cognos Business Intelligence, Microsoft SharePoint and Pentaho, weka provides a friendly interface easy to understand, load data efficiently and consider data mining as main objective. Text mining seeks patterns extraction from the analysis of large collections of documents in order to gain new knowledge. Its purpose is the discovery of interesting groups, trends, associations and the visualization of new findings. Text mining is considering as a subset of data mining. For this reason, adopts text mining adopts the data mining techniques which uses machine learning algorithms. Computational linguistics techniques also provides techniques to text mining. This science studies natural language with computational methods to make them understandable by the operating system. Automatic categorization determines the subject matter from a document collection. This unlike clustering, choose the class to which a document belongs in a list of predefined classes. Each category is trained through a previous manual process of categorization. The classification starts with a set of training texts previously categorized then generate a classification model based on the set of examples. This is be able to allocate the correct clas from a new text. Decision tree is a classification technique that represent the knowledge through ifelse statements structure represented in the branches of a tree.

SUPERVISED ATTRIBUTE. NAME - weka.filters.supervised.attribute.attributeselection. NAME - weka.filters.supervised.attribute.

SUPERVISED ATTRIBUTE. NAME - weka.filters.supervised.attribute.attributeselection. NAME - weka.filters.supervised.attribute. Contents SUPERVISED ATTRIBUTE... 2 NAME - weka.filters.supervised.attribute.attributeselection... 2 NAME - weka.filters.supervised.attribute.classorder... 2 NAME - weka.filters.supervised.attribute.discretize...

More information

WEKA homepage.

WEKA homepage. WEKA homepage http://www.cs.waikato.ac.nz/ml/weka/ Data mining software written in Java (distributed under the GNU Public License). Used for research, education, and applications. Comprehensive set of

More information

ESERCITAZIONE PIATTAFORMA WEKA. Croce Danilo Web Mining & Retrieval 2015/2016

ESERCITAZIONE PIATTAFORMA WEKA. Croce Danilo Web Mining & Retrieval 2015/2016 ESERCITAZIONE PIATTAFORMA WEKA Croce Danilo Web Mining & Retrieval 2015/2016 Outline Weka: a brief recap ARFF Format Performance measures Confusion Matrix Precision, Recall, F1, Accuracy Question Classification

More information

The Explorer. chapter Getting started

The Explorer. chapter Getting started chapter 10 The Explorer Weka s main graphical user interface, the Explorer, gives access to all its facilities using menu selection and form filling. It is illustrated in Figure 10.1. There are six different

More information

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer Practical Data Mining COMP-321B Tutorial 1: Introduction to the WEKA Explorer Gabi Schmidberger Mark Hall Richard Kirkby July 12, 2006 c 2006 University of Waikato 1 Setting up your Environment Before

More information

Practical Data Mining COMP-321B. Tutorial 5: Article Identification

Practical Data Mining COMP-321B. Tutorial 5: Article Identification Practical Data Mining COMP-321B Tutorial 5: Article Identification Shevaun Ryan Mark Hall August 15, 2006 c 2006 University of Waikato 1 Introduction This tutorial will focus on text mining, using text

More information

Decision Trees In Weka,Data Formats

Decision Trees In Weka,Data Formats CS 4510/9010 Applied Machine Learning 1 Decision Trees In Weka,Data Formats Paula Matuszek Fall, 2016 J48: Decision Tree in Weka 2 NAME: weka.classifiers.trees.j48 SYNOPSIS Class for generating a pruned

More information

Short instructions on using Weka

Short instructions on using Weka Short instructions on using Weka G. Marcou 1 Weka is a free open source data mining software, based on a Java data mining library. Free alternatives to Weka exist as for instance R and Orange. The current

More information

WEKA Explorer User Guide for Version 3-4

WEKA Explorer User Guide for Version 3-4 WEKA Explorer User Guide for Version 3-4 Richard Kirkby Eibe Frank July 28, 2010 c 2002-2010 University of Waikato This guide is licensed under the GNU General Public License version 2. More information

More information

Classification using Weka (Brain, Computation, and Neural Learning)

Classification using Weka (Brain, Computation, and Neural Learning) LOGO Classification using Weka (Brain, Computation, and Neural Learning) Jung-Woo Ha Agenda Classification General Concept Terminology Introduction to Weka Classification practice with Weka Problems: Pima

More information

AI32 Guide to Weka. Andrew Roberts 1st March 2005

AI32 Guide to Weka. Andrew Roberts   1st March 2005 AI32 Guide to Weka Andrew Roberts http://www.comp.leeds.ac.uk/andyr 1st March 2005 1 Introduction Weka is an excellent system for learning about machine learning techniques. Of course, it is a generic

More information

Java Archives Search Engine Using Byte Code as Information Source

Java Archives Search Engine Using Byte Code as Information Source Java Archives Search Engine Using Byte Code as Information Source Oscar Karnalim School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia 23512012@std.stei.itb.ac.id

More information

ENGLISH SPANISH TEXT MINING 1. English Spanish Text Mining. Valeria Guevara. Thompson Rivers University

ENGLISH SPANISH TEXT MINING 1. English Spanish Text Mining. Valeria Guevara. Thompson Rivers University ENGLISH SPANISH TEXT MINING 1 English Spanish Text Mining Valeria Guevara Thompson Rivers University Author Note This is a final project COMP 4910 for the bachelors of computing science from the Thompson

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek CS 8520: Artificial Intelligence Weka Lab Paula Matuszek Fall, 2015!1 Weka is Waikato Environment for Knowledge Analysis Machine Learning Software Suite from the University of Waikato Been under development

More information

Text classification with Naïve Bayes. Lab 3

Text classification with Naïve Bayes. Lab 3 Text classification with Naïve Bayes Lab 3 1 The Task Building a model for movies reviews in English for classifying it into positive or negative. Test classifier on new reviews Takes time 2 Sentiment

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution Summary Each of the ham/spam classifiers has been tested against random samples from pre- processed enron sets 1 through 6 obtained via: http://www.aueb.gr/users/ion/data/enron- spam/, or the entire set

More information

6.034 Design Assignment 2

6.034 Design Assignment 2 6.034 Design Assignment 2 April 5, 2005 Weka Script Due: Friday April 8, in recitation Paper Due: Wednesday April 13, in class Oral reports: Friday April 15, by appointment The goal of this assignment

More information

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline Learn to Use Weka Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb-09-2010 Outline Introduction of Weka Explorer Filter Classify Cluster Experimenter KnowledgeFlow

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

COMP s1 - Getting started with the Weka Machine Learning Toolkit

COMP s1 - Getting started with the Weka Machine Learning Toolkit COMP9417 16s1 - Getting started with the Weka Machine Learning Toolkit Last revision: Thu Mar 16 2016 1 Aims This introduction is the starting point for Assignment 1, which requires the use of the Weka

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese

More information

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI DATA ANALYSIS WITH WEKA Author: Nagamani Mutteni Asst.Professor MERI Topic: Data Analysis with Weka Course Duration: 2 Months Objective: Everybody talks about Data Mining and Big Data nowadays. Weka is

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

User Guide Written By Yasser EL-Manzalawy

User Guide Written By Yasser EL-Manzalawy User Guide Written By Yasser EL-Manzalawy 1 Copyright Gennotate development team Introduction As large amounts of genome sequence data are becoming available nowadays, the development of reliable and efficient

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats Data Mining Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka I. Data sets I.1. Data sets characteristics and formats The data to be processed can be structured (e.g. data matrix,

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

1 Document Classification [60 points]

1 Document Classification [60 points] CIS519: Applied Machine Learning Spring 2018 Homework 4 Handed Out: April 3 rd, 2018 Due: April 14 th, 2018, 11:59 PM 1 Document Classification [60 points] In this problem, you will implement several text

More information

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44 Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44 Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software

More information

LAD-WEKA Tutorial Version 1.0

LAD-WEKA Tutorial Version 1.0 LAD-WEKA Tutorial Version 1.0 March 25, 2014 Tibérius O. Bonates tb@ufc.br Federal University of Ceará, Brazil Vaux S. D. Gomes vauxgomes@gmail.com Federal University of the Semi-Arid, Brazil 1 Requirements

More information

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

In this project, I examined methods to classify a corpus of  s by their content in order to suggest text blocks for semi-automatic replies. December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [12-21] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine

More information

Tutorial on Machine Learning Tools

Tutorial on Machine Learning Tools Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow

More information

Tutorial to QuotationFinder_0.4.4

Tutorial to QuotationFinder_0.4.4 Tutorial to QuotationFinder_0.4.4 What is Quotation Finder and for which purposes can it be used? Quotation Finder is a tool for the automatic comparison of fully digitized texts. It can detect quotations,

More information

WEKA Explorer User Guide for Version 3-5-5

WEKA Explorer User Guide for Version 3-5-5 WEKA Explorer User Guide for Version 3-5-5 Richard Kirkby Eibe Frank Peter Reutemann January 26, 2007 c 2002-2006 University of Waikato Contents 1 Launching WEKA 2 2 The WEKA Explorer 4 2.1 Section Tabs.............................

More information

IV B. Tech I semester (JNTUH-R13)

IV B. Tech I semester (JNTUH-R13) St. MARTIN s ENGINERING COLLEGE Dhulapally(V), Qutbullapur(M), Secunderabad-500014 COMPUTER SCIENCE AND ENGINEERING LAB MANUAL OF DATAWAREHOUSE AND DATAMINING IV B. Tech I semester (JNTUH-R13) Prepared

More information

Tutorial to QuotationFinder_0.6

Tutorial to QuotationFinder_0.6 Tutorial to QuotationFinder_0.6 What is QuotationFinder, and for which purposes can it be used? QuotationFinder is a tool for the automatic comparison of fully digitized texts. It can detect quotations,

More information

How Does a Search Engine Work? Part 1

How Does a Search Engine Work? Part 1 How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling

More information

Chapter 4. Processing Text

Chapter 4. Processing Text Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

Tutorial to QuotationFinder_0.4.3

Tutorial to QuotationFinder_0.4.3 Tutorial to QuotationFinder_0.4.3 What is Quotation Finder and for which purposes can it be used? Quotation Finder is a tool for the automatic comparison of fully digitized texts. It can either detect

More information

Chapter 8 The C 4.5*stat algorithm

Chapter 8 The C 4.5*stat algorithm 109 The C 4.5*stat algorithm This chapter explains a new algorithm namely C 4.5*stat for numeric data sets. It is a variant of the C 4.5 algorithm and it uses variance instead of information gain for the

More information

Tutorial on Machine Learning. Impact of dataset composition on models performance. G. Marcou, N. Weill, D. Horvath, D. Rognan, A.

Tutorial on Machine Learning. Impact of dataset composition on models performance. G. Marcou, N. Weill, D. Horvath, D. Rognan, A. Part 1. Tutorial on Machine Learning. Impact of dataset composition on models performance G. Marcou, N. Weill, D. Horvath, D. Rognan, A. Varnek 1 Introduction Predictive performance of QSAR model depends

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

Pace University. Fundamental Concepts of CS121 1

Pace University. Fundamental Concepts of CS121 1 Pace University Fundamental Concepts of CS121 1 Dr. Lixin Tao http://csis.pace.edu/~lixin Computer Science Department Pace University October 12, 2005 This document complements my tutorial Introduction

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,

More information

Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we

More information

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 02 The Term Vocabulary and Postings Lists Constructing Inverted Indexes The major steps in constructing

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery Practice notes 2 Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment A System for Managing Experiments in Data Mining A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Greeshma

More information

Author Prediction for Turkish Texts

Author Prediction for Turkish Texts Ziynet Nesibe Computer Engineering Department, Fatih University, Istanbul e-mail: admin@ziynetnesibe.com Abstract Author Prediction for Turkish Texts The main idea of authorship categorization is to specify

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 8.11.2017 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Non-trivial extraction of implicit, previously unknown and potentially useful information from data CS 795/895 Applied Visual Analytics Spring 2013 Data Mining Dr. Michele C. Weigle http://www.cs.odu.edu/~mweigle/cs795-s13/ What is Data Mining? Many Definitions Non-trivial extraction of implicit, previously

More information

Classifica(on and Clustering with WEKA. Classifica*on and Clustering with WEKA

Classifica(on and Clustering with WEKA. Classifica*on and Clustering with WEKA Classifica(on and Clustering with WEKA 1 Schedule: Classifica(on and Clustering with WEKA 1. Presentation of WEKA. 2. Your turn: perform classification and clustering. 2 WEKA Weka is a collec*on of machine

More information

CS294-1 Final Project. Algorithms Comparison

CS294-1 Final Project. Algorithms Comparison CS294-1 Final Project Algorithms Comparison Deep Learning Neural Network AdaBoost Random Forest Prepared By: Shuang Bi (24094630) Wenchang Zhang (24094623) 2013-05-15 1 INTRODUCTION In this project, we

More information

SKOS Shuttle. (Welcome) Tutorial TEM Text Extraction Management. May 2018

SKOS Shuttle. (Welcome) Tutorial TEM Text Extraction Management. May 2018 SKOS Shuttle (Welcome) Tutorial TEM Text Extraction Management May 2018 This tutorial illustrates How to extract in SKOS Shuttle new concepts out of free text and to add them to a thesaurus Table of Contents

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

1 Machine Learning System Design

1 Machine Learning System Design Machine Learning System Design Prioritizing what to work on: Spam classification example Say you want to build a spam classifier Spam messages often have misspelled words We ll have a labeled training

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE

CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE In work educational data mining has been used on qualitative data of students and analysis their performance using C4.5 decision tree algorithm.

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 10 - Classification trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey

More information

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer In part from: Yizhou Sun 2008 What is WEKA? Waikato Environment for Knowledge Analysis It s a data mining/machine learning tool developed by Department of Computer Science,,

More information

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER N. Suresh Kumar, Dr. M. Thangamani 1 Assistant Professor, Sri Ramakrishna Engineering College, Coimbatore, India 2 Assistant

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/11/16 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Data Engineering. Data preprocessing and transformation

Data Engineering. Data preprocessing and transformation Data Engineering Data preprocessing and transformation Just apply a learner? NO! Algorithms are biased No free lunch theorem: considering all possible data distributions, no algorithm is better than another

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Running Java Programs

Running Java Programs Running Java Programs Written by: Keith Fenske, http://www.psc-consulting.ca/fenske/ First version: Thursday, 10 January 2008 Document revised: Saturday, 13 February 2010 Copyright 2008, 2010 by Keith

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion Sara Lana-Serrano 1,3, Julio Villena-Román 2,3, José C. González-Cristóbal 1,3 1 Universidad Politécnica de Madrid 2 Universidad

More information

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). http://waikato.researchgateway.ac.nz/ Research Commons at the University of Waikato Copyright Statement: The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). The thesis

More information

CHAPTER 4 METHODOLOGY AND TOOLS

CHAPTER 4 METHODOLOGY AND TOOLS CHAPTER 4 METHODOLOGY AND TOOLS 4.1 RESEARCH METHODOLOGY In an effort to test empirically the suggested data mining technique, the data processing quality, it is important to find a real-world for effective

More information

Using Weka for Classification. Preparing a data file

Using Weka for Classification. Preparing a data file Using Weka for Classification Preparing a data file Prepare a data file in CSV format. It should have the names of the features, which Weka calls attributes, on the first line, with the names separated

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC Attribute Discretization and Selection Clustering NIKOLA MILIKIĆ nikola.milikic@fon.bg.ac.rs UROŠ KRČADINAC uros@krcadinac.com Naive Bayes Features Intended primarily for the work with nominal attributes

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

IBM. Migration Cookbook. Migrating from License Metric Tool and Tivoli Asset Discovery for Distributed 7.5 to License Metric Tool 9.

IBM. Migration Cookbook. Migrating from License Metric Tool and Tivoli Asset Discovery for Distributed 7.5 to License Metric Tool 9. IBM License Metric Tool 9.x Migration Cookbook Migrating from License Metric Tool and Tivoli Asset Discovery for Distributed 7.5 to License Metric Tool 9.x IBM IBM License Metric Tool 9.x Migration Cookbook

More information